Variable definitions that control algorithmic decisions below. These are placed here to draw our attention them – if we wish to change our decisions throughout the pipeline, these can be adjusted, rather than having to hunt them down below.

#parameters that control our algorithmic decisions about what constitutes bad head motion
#we put these here to draw our attention to 
fd_spike_mm <- 0.3 #movement greater than this threshold (in mm) is considered a movement spike
p_spikes_cutoff <- 0.2 #if there are more than this proportion of spikes (based on fd_spike_mm), exclude the subject
fd_max_mm <- 10 #any movement greater than this amount should lead the subject to be excluded

Introduction

One of the most important skills you learn along the data science path is not specific to R, but is crucial to solving data problems. People have called this different things such as “thinking algorithmically”, “thinking like an engineer”, or “solving problems like a computer programmer.” If you’re not an expert on algorithms and are neither a computer scientist nor an engineer, you might ask “what does this really mean?” The answer is hard to convey conceptually, but instead depends heavily on the experiences you accrue as you solve data problems.

My goal here is to convey some of the lessons I’ve learned over the past 10 years and give you an approximate framework for solving data management and analysis problems that extend beyond the basics of the data wrangling and programming.

Thesis

One of the greatest threats to reproducible science is the use of data analysis scripts that are poorly organized, poorly documented, and that reflect the sometimes unpredictable sequence of decisions made by the scientist. If it seems hacky, it probably is hacky – and may not be reproducible!

When we solve analysis problems using code, we often begin by hacking our way through the forest. Each step is solved sequentially without necessarily anticipating or representing the additional steps that lie ahead. In part, this is inevitable. We cannot easily predict the complexity of the problem or the intermediate solutions we will test. Therefore, we may make decisions about how to handle a problem, only to reverse on our approach later on. One difficulty is that we may not ‘undo’ all of our earlier steps, or we may suffer from copy-paste errors in the code that lead to inconsistencies in the solution. These human errors are inevitable, but very threatening to reproducibilty. Therefore, we need to implement programming practices that mitigate the risk of code yielding unexpected, invalid, or unpredictable results.

Literate programming documents like R Markdown and Jupyter notebooks make the flow of an analysis more accessible, but they do not necessarily support good practices in coding that make code less vulnerable to mistakes.

Corollary: The more complex the problem, the more we must test each piece of the pipeline as we scale up from prototype to general solution.

The value of functional programming

One of the greatest assets to reproducible (data) science is the use of programming paradigms that enforce good practices. In particular, functional programming, in which we view data analysis as a set of predictable transformation functions, supports a clearheaded approach to data problems that blocks many programming crutches.

When we write functions, they demand that we provide specific inputs (arguments) and we demand that the function produce a predictable set of outputs. This should be an unbreakable covenant that leads us to trust functions – and in a way, to forget how they work – after we have validated that they work as expected.

Functions should validate their inputs. For example, if we are expecting a set of time codes for when events happened in the experiment, times should not be negative! Or, if the function is expecting a 2D matrix and we pass in a vector, what should happen?

Motivating example: diagnosing and making decisions about head movement in fMRI

I know most of you aren’t in the neuroimaging world, and my goal isn’t to confuse you with all of the details (as people like to say, “it’s complicated”). And the example is not intended to pitch to the imaging crowd, but instead to demonstrate a real-world data problem that has parallels to many other data types (behavior during experiments, self-reports distributed across files and folders, etc.).

With regard to fMRI, I’ll simply convey that head movement during a scan is a huge bugaboo because it fundamentally corrupts the signal. To wit: