Baby Steps for a Reproducible Workflow in R

Author

Juuso Repo

Published

September 12, 2024

The idea of methods reproducibility is to provide sufficient documentation of procedures and data to enable repeating the same procedures, in a similar or different context (Goodman et al., 2016). A reproducible workflow, then, is a systematically organized sequence of steps to accomplish this. As the value placed on reproducibility continues to rise within and beyond academia, mastering these practices is increasingly beneficial. This tutorial aims to present the basics rather than delve into all the complexities of all advanced stuff. It covers steps from preparing data for analysis to creating reproducible tables and figures, to be included in a paper or presentation.

The basic steps for reproducible workflow covered here are:

flowchart LR
    step1("1. Set up the project <br> ")
    step2("2. Prepare and analyse data <br> and generate outputs")
    step3("3. Prepare code and data <br> for sharing")
    step4("4. Share code and data in public")
    
    step1 --> step2 --> step3 --> step4

A reproducible workflow enables you to:

Adjust code or data and efficiently re-run all analyses.
Conduct a review before publishing, covering phases from data preparation to generation of outputs.
Understand your research processes months later, effectively collaborating with your future self.
Share or pass along your project and provide the necessary code for others to extend your work.
Enhance the credibility and trustworthiness of your research.

The guide and provided code snippets serve as a template for initiating new projects or creating reproducible versions of existing ones. The best way to learn is to download the notebooks, experiment with the examples in RStudio, and adapt them to your project and preferences. To follow along, ensure that you have R and RStudio installed and up to date.

Tip

The tutorial is a work in progress. For any suggestions for improvements or requests for additional content, please drop a message to the author.

1 STEP ONE: Creating project and folder structure

We begin with creating a tidy nest for all files related to the study/project. Or at least for the files related to data and analysis! Creating a coherent folder structure that can be used in all your future research projects will facilitate collaboration and make your life easier - with more time for reproduction!

1.1 Create a RStudio project

In RStudio, start fresh and select File - New Project - New Directory - Quarto Project. It creates you a project folder with few files, and shows you that folder in the Files window.
Copy and run the below code in RStudio console window to download this notebook to your project folder. (or download the zip file and copy the files to your project folder).

download.file("https://raw.githubusercontent.com/juusorepo/ReproRepo/master/babysteps.qmd", "babysteps.qmd")
Open the notebook in RStudio.
Run the below code chunk to install required packages. The Here package simplifies file path management by making all paths relative to the project root (top-level folder).

Code: Load packages

## Load packages: knitr and here
# Here  for file path management and Knitr for dynamic report generation
# Check if packages are installed, and install if not
if (!requireNamespace("knitr", quietly = TRUE)) {install.packages("knitr")}
if (!requireNamespace("here", quietly = TRUE)) {install.packages("here")}
# Load packages
require(knitr)
require(here)

1.2 Creating the project folder structure

The code below will create a coherent folder structure for your project. You can modify the folder names for your own project as needed. Alternatively, you may create folders without code. Note these good practices when working with folders and files:

Naming conventions for folders and files: only lowercase letters; no spaces but-dashes (or_underscores) instead. Prefix numbers with leading zeros can be used to ensure automatic sorting (01-data, 02-scripts…).
Relative file paths. Always use relative folder paths for reproducibility. Relative means data/file.csv instead of absolute one: C:/Users/Documents…./file.csv
Here -function. Wrap the file path with the here function to ensure the path is always relative to the project top-level folder (root) and works in all contexts, e.g., here("data/file.csv"). Requires R package Here.
Separate folders for public sharing. For our reproducible example, we will only need the folders: data, scripts, and supplementary for public sharing.

Run the code below to create your folders.

# List folder names into a vector object, folders not used for tutorial are commented out
folders <- c(
  "01-data",
    "01-data/raw", # Unmodified, original raw data
    "01-data/processed", # Data after cleaning or transformations
    "01-data/metadata", # Codebooks, dictionaries etc.
#   "01-data/methods", # Ethics, protocols, licenses etc.
  "02-scripts", # Scripts for public sharing
  "03-supplementary", # Additional material for public sharing
  "04-notebooks", # Notebooks for running analyses and writing notes
  "05-outputs",
    "05-outputs/figures", # Main graphs and visualizations
    "05-outputs/tables", # Main result tables
#   "05-outputs/manuscripts", # Drafts and final versions of papers
#   "05-outputs/presentations", # Slides, posters, etc.
  "99-archive" # Archived materials, old versions 
)    
# Create folders (loop through the vector) 
for (folder in folders) {dir.create(folder, recursive = TRUE, showWarnings = TRUE)}
# Print message
message("Project folders created successfully.")

If you want to move your previous work into the new project, just copy the files to the respective folders. Create new subfolders as needed.

Note

Why was the reproducibility advocate bad at hide and seek?

Because they always left a trail to reproduce their steps!

2 STEP TWO: Preparing and analysing data and generating reproducible tables and figures

Using notebooks instead of plain text .R scripts offers many benefits. Notebooks integrate code, results, and notes in one document, making them more readable, enhancing collaboration and reproducibility. In this tutorial, we will use Quarto notebooks (.qmd) which is a next generation version of R Markdown and becoming the new gold standard. If you are used to using R scripts, the outputs in notebook can look a bit different. But you can still see the full outputs in the console windown after running your code.

2.1 Preparing data in a reproducible way

For reproducibility, it is vital to keep the raw data untouched and create code which includes all steps done for processing and preparing the data for analysis. All data cleaning done in R ensures the transparency of your research.

A good open science practice is to simulate data when planning analysis or preregistering your study (Peikert et al., 2021). Simulated data can also be used for sharing when sharing the original data is not possible.

flowchart LR
    step0("Simulate <br>and plan")
    step1("Collect")
    step2("Process")
    step3("Analyse and <br>generate outputs")
    step4a("Figures")
    step4b("Tables")
    step5("Manuscript or<br> presentation")
    step6("Text")
    
    step0 --> step1 --> step2 --> step3 --> step4a
    step3 --> step4b
    step4a --> step5
    step4b --> step5
    step6 --> step5

For this tutorial, we simulate a study on baby steps. In the first notebook we create raw data and a sample codebook.

Download the simulate-data notebook by running the code below (if not downloaded already)
Open the notebook and run all code.
If you are reading this online, you can read the notebook here

# Download simulate-data notebook (if not downloaded)
if (file.exists(here("04-notebooks/00-simulate-data.qmd"))) {
  print("Notebooks already exists.")
} else {
download.file("https://raw.githubusercontent.com/juusorepo/ReproRepo/master/04-notebooks/00-simulate-data.qmd", 
              "04-notebooks/00-simulate-data.qmd")
}
message("Notebook saved in your Notebooks -folder. Open and run the code.")

Prepare-data -notebook is an example and a template for creating reproducible steps for processing raw data.

Download the prepare-data notebook by running the code below (if not downloaded already)
Open the notebook and run all code.
If you are just reading this online, you can read the notebook here

# Download prepare-data notebook (if not downloaded)
if (file.exists(here("04-notebooks/01-prepare-data.qmd"))) {
  print("Notebook already exists.")
} else {
download.file("https://raw.githubusercontent.com/juusorepo/ReproRepo/master/04-notebooks/01-prepare-data.qmd", 
             "04-notebooks/01-prepare-data.qmd")
}
message("Notebook saved in your Notebooks -folder. Open and run the code.")

Tip

You can create a new notebook by selecting: File - New File - Quarto document (or R Notebook). For organizing your analyses in multiple notebooks, a good practice is to number them following your analytic plan / workflow.

2.2 Creating reproducible analyses, tables and figures

The next notebook shows how to create reproducible tables and figures in R. Also in APA style. This will avoid the need for copy-pasting values from R to a word processor, ensuring fewer errors and - better reproducibility! You can export the formatted table into Word, PowerPoint, HTML, or PDF.

Download the notebook by running the code below.
Open the notebook to follow and run the examples.
If you are reading this online, you can read the notebook here

# Download a notebook for creating tables (if not downloaded)
if (file.exists(here("04-notebooks/02-create-tables-and-figures.qmd"))) {
  print("Notebooks already exists.")
} else {
download.file("https://raw.githubusercontent.com/juusorepo/ReproRepo/master/04-notebooks/02-create-tables-and-figures.qmd", 
              here("04-notebooks/02-create-tables-and-figures.qmd"))
}
message("Descript -notebooks downloaded and saved in your notebooks -folder. Open and try!")

Note

Why did the scientist break up with reproducibility?
Because they wanted a relationship with fewer variables!

3 STEP THREE: Preparing data and code for sharing

flowchart LR
    step0("Prepare data<br> for sharing")
    step1("Prepare notebooks<br> for sharing")
    step2("Extract code from<br> notebooks with purl")
    step3("Test the code scripts")
    step1a("Share notebooks<br>with colleagues")
    step4("Share code and<br> data in public")
    
    step1 --> step2 --> step3 --> step4
    step1 --> step1a
    step0 --> step4

3.1 Preparing code for public sharing

Comment the code. Make sure the code you plan to share is properly commented. Good practice is to write the comments before you write the code. If you lack comments, you can use Chat GPT to assist and review your code (with caution, naturally). With the prompt below, you can keep adding your scripts one at a time. Replace your code with the commented version, and test the code. Example prompt:

Dearest AI, please revise and improve the comments in my R script to increase its clarity and convey its purpose, without changing the code itself. Identify any errors separately and alert me to potential reproducibility issues. I will submit sections of the code sequentially for your review.

Use README files. For complicated scripts and analyses, you can add a README text file with extensive documentation.
For a style guide for coding in R, see: style.tidyverse.org. It includes best practices for e.g., naming objects, and tools for reviewing your code.
Document R version and packages used. Documenting ‘dependencies’ ensures that future researchers can replicate the exact computational environment in which your analysis was conducted. With the report package, you can create this as a supplementary file with the code below.

# Create a supplementary file with R version and packages used
# load report package
if (!requireNamespace("report", quietly = TRUE)) {install.packages("report")}
require(report)

# Create dependencies report
dependencies <- report(sessionInfo())

# Export report as a supplementary text file 
writeLines(dependencies, here("03-supplementary/dependencies.txt"))

# print a summary of the report
message("Dependencies summary saved in the 03-supplementary -folder. A brief summary:")
summary(report(sessionInfo()))

The analysis was done using the R Statistical language (v4.2.0; R Core Team,
2022) on Ubuntu 22.04.4 LTS, using the packages report (v0.5.8), here (v1.0.1)
and knitr (v1.45).

Tip

If your notebook is getting messy, you can collapse/hide a code chunk from the small arrow on the top-left corner of the code chunk. You can also write “#| include:false” in the beginning of the chunk to suppress all output from that code block.

3.1.1 Extracting code from notebooks - the PURL function

As you may have noticed from the example notebooks, we used notebooks also for personal notes and exploratory analyses. How can one manage all that in the same file? With help from the ‘purl’ function!

The “purl” function from the knitr package will extract all executable code from the notebooks and save them in in an .R script file - for public sharing. The trick is that it only extracts the code chunks with comments, not the notes (text outside code chunks) or results. In addition, the code chunks marked with purl=FALSE (in the title, see example below), will not be included in the .R script file created. Thus, you can keep draft analyses and personal notes in the same notebook with the code for public sharing.

Run the code below to create .R script files from the notebooks.
Open an .R script file created to view the result.

# Extract simulate-data: use purl to extract r-code from notebook into script file
purl(
    here("04-notebooks/00-simulate-data.qmd"), # the source notebook 
    output=here("02-scripts/00-simulate-data.R"), # the output script file
    documentation = 1 # to include only the code chunks
    )

# Extract prepare-data
purl(
    here("04-notebooks/01-prepare-data.qmd"), 
    output=here("02-scripts/01-prepare-data.R"), 
    documentation = 1 
    )

# Extract create-tables-and-figures
purl(here("04-notebooks/02-create-tables-and-figures.qmd"), 
     output=here("02-scripts/02-create-tables-and-figures.R"),
     documentation = 1)

3.2 Running the analyses from R script files

After the R script files are created, it is better to test the scripts before sharing. However, all modifications should be made in notebooks, so you will not get lost with different versions. After adjustments, just run the purl function again to replace the script files.

With the code below, you can run all the analyses without opening the R script files. The full workflow in one code chunk!

Restart R Session in RStudio: Select Session > Restart R (good practice to ensure reproducibility).
Run the below code chunk. Maybe better to run one row at a time, to check for errors (ctrl-enter).

# Clear all objects from memory to ensure reproducibility
rm(list = ls())
# reload here as we emptied it from memory
require(here)

# 0 Simulate data
source(here("02-scripts/00-simulate-data.R"))

# 1 Prepare data
source(here("02-scripts/01-prepare-data.R"))

# 2 Run analyses to create tables and figures
# source(here("02-scripts/02-create-tables-and-figures.R"))

Note

Labeling each code chunk with “#| label: your-label” will split the script file into easily readable chunks.

1 STEP ONE: Creating project and folder structure

1.1 Create a RStudio project

1.2 Creating the project folder structure

2 STEP TWO: Preparing and analysing data and generating reproducible tables and figures

2.1 Preparing data in a reproducible way

2.2 Creating reproducible analyses, tables and figures

3 STEP THREE: Preparing data and code for sharing

3.1 Preparing code for public sharing

3.1.1 Extracting code from notebooks - the PURL function

3.2 Running the analyses from R script files

3.3 Sharing notebooks with co-authors

3.4 Preparing data for sharing

4 STEP FOUR: Sharing code and data in open repository