Baby Steps for a Reproducible Workflow in R

Author

Juuso Repo

Published

September 12, 2024

The idea of methods reproducibility is to provide sufficient documentation of procedures and data to enable repeating the same procedures, in a similar or different context (Goodman et al., 2016). A reproducible workflow, then, is a systematically organized sequence of steps to accomplish this. As the value placed on reproducibility continues to rise within and beyond academia, mastering these practices is increasingly beneficial. This tutorial aims to present the basics rather than delve into all the complexities of all advanced stuff. It covers steps from preparing data for analysis to creating reproducible tables and figures, to be included in a paper or presentation.

The basic steps for reproducible workflow covered here are:

flowchart LR
    step1("1. Set up the project <br> ")
    step2("2. Prepare and analyse data <br> and generate outputs")
    step3("3. Prepare code and data <br> for sharing")
    step4("4. Share code and data in public")
    
    step1 --> step2 --> step3 --> step4

A reproducible workflow enables you to:

The guide and provided code snippets serve as a template for initiating new projects or creating reproducible versions of existing ones. The best way to learn is to download the notebooks, experiment with the examples in RStudio, and adapt them to your project and preferences. To follow along, ensure that you have R and RStudio installed and up to date.

Tip

The tutorial is a work in progress. For any suggestions for improvements or requests for additional content, please drop a message to the author.

1 STEP ONE: Creating project and folder structure

We begin with creating a tidy nest for all files related to the study/project. Or at least for the files related to data and analysis! Creating a coherent folder structure that can be used in all your future research projects will facilitate collaboration and make your life easier - with more time for reproduction!

1.1 Create a RStudio project

  • In RStudio, start fresh and select File - New Project - New Directory - Quarto Project. It creates you a project folder with few files, and shows you that folder in the Files window.

  • Copy and run the below code in RStudio console window to download this notebook to your project folder. (or download the zip file and copy the files to your project folder).

    download.file("https://raw.githubusercontent.com/juusorepo/ReproRepo/master/babysteps.qmd", "babysteps.qmd")

  • Open the notebook in RStudio.

  • Run the below code chunk to install required packages. The Here package simplifies file path management by making all paths relative to the project root (top-level folder).

Code: Load packages
## Load packages: knitr and here
# Here  for file path management and Knitr for dynamic report generation
# Check if packages are installed, and install if not
if (!requireNamespace("knitr", quietly = TRUE)) {install.packages("knitr")}
if (!requireNamespace("here", quietly = TRUE)) {install.packages("here")}
# Load packages
require(knitr)
require(here)

1.2 Creating the project folder structure

The code below will create a coherent folder structure for your project. You can modify the folder names for your own project as needed. Alternatively, you may create folders without code. Note these good practices when working with folders and files:

  • Naming conventions for folders and files: only lowercase letters; no spaces but-dashes (or_underscores) instead. Prefix numbers with leading zeros can be used to ensure automatic sorting (01-data, 02-scripts…).

  • Relative file paths. Always use relative folder paths for reproducibility. Relative means data/file.csv instead of absolute one: C:/Users/Documents…./file.csv

  • Here -function. Wrap the file path with the here function to ensure the path is always relative to the project top-level folder (root) and works in all contexts, e.g., here("data/file.csv"). Requires R package Here.

  • Separate folders for public sharing. For our reproducible example, we will only need the folders: data, scripts, and supplementary for public sharing.

Run the code below to create your folders.

# List folder names into a vector object, folders not used for tutorial are commented out
folders <- c(
  "01-data",
    "01-data/raw", # Unmodified, original raw data
    "01-data/processed", # Data after cleaning or transformations
    "01-data/metadata", # Codebooks, dictionaries etc.
#   "01-data/methods", # Ethics, protocols, licenses etc.
  "02-scripts", # Scripts for public sharing
  "03-supplementary", # Additional material for public sharing
  "04-notebooks", # Notebooks for running analyses and writing notes
  "05-outputs",
    "05-outputs/figures", # Main graphs and visualizations
    "05-outputs/tables", # Main result tables
#   "05-outputs/manuscripts", # Drafts and final versions of papers
#   "05-outputs/presentations", # Slides, posters, etc.
  "99-archive" # Archived materials, old versions 
)    
# Create folders (loop through the vector) 
for (folder in folders) {dir.create(folder, recursive = TRUE, showWarnings = TRUE)}
# Print message
message("Project folders created successfully.")

If you want to move your previous work into the new project, just copy the files to the respective folders. Create new subfolders as needed.

Note

Why was the reproducibility advocate bad at hide and seek?

Because they always left a trail to reproduce their steps!

2 STEP TWO: Preparing and analysing data and generating reproducible tables and figures

Using notebooks instead of plain text .R scripts offers many benefits. Notebooks integrate code, results, and notes in one document, making them more readable, enhancing collaboration and reproducibility. In this tutorial, we will use Quarto notebooks (.qmd) which is a next generation version of R Markdown and becoming the new gold standard. If you are used to using R scripts, the outputs in notebook can look a bit different. But you can still see the full outputs in the console windown after running your code.

2.1 Preparing data in a reproducible way

For reproducibility, it is vital to keep the raw data untouched and create code which includes all steps done for processing and preparing the data for analysis. All data cleaning done in R ensures the transparency of your research.

A good open science practice is to simulate data when planning analysis or preregistering your study (Peikert et al., 2021). Simulated data can also be used for sharing when sharing the original data is not possible.

flowchart LR
    step0("Simulate <br>and plan")
    step1("Collect")
    step2("Process")
    step3("Analyse and <br>generate outputs")
    step4a("Figures")
    step4b("Tables")
    step5("Manuscript or<br> presentation")
    step6("Text")
    
    step0 --> step1 --> step2 --> step3 --> step4a
    step3 --> step4b
    step4a --> step5
    step4b --> step5
    step6 --> step5

For this tutorial, we simulate a study on baby steps. In the first notebook we create raw data and a sample codebook.

  • Download the simulate-data notebook by running the code below (if not downloaded already)
  • Open the notebook and run all code.
  • If you are reading this online, you can read the notebook here
# Download simulate-data notebook (if not downloaded)
if (file.exists(here("04-notebooks/00-simulate-data.qmd"))) {
  print("Notebooks already exists.")
} else {
download.file("https://raw.githubusercontent.com/juusorepo/ReproRepo/master/04-notebooks/00-simulate-data.qmd", 
              "04-notebooks/00-simulate-data.qmd")
}
message("Notebook saved in your Notebooks -folder. Open and run the code.")

Prepare-data -notebook is an example and a template for creating reproducible steps for processing raw data.

  • Download the prepare-data notebook by running the code below (if not downloaded already)
  • Open the notebook and run all code.
  • If you are just reading this online, you can read the notebook here
# Download prepare-data notebook (if not downloaded)
if (file.exists(here("04-notebooks/01-prepare-data.qmd"))) {
  print("Notebook already exists.")
} else {
download.file("https://raw.githubusercontent.com/juusorepo/ReproRepo/master/04-notebooks/01-prepare-data.qmd", 
             "04-notebooks/01-prepare-data.qmd")
}
message("Notebook saved in your Notebooks -folder. Open and run the code.")
Tip

You can create a new notebook by selecting: File - New File - Quarto document (or R Notebook). For organizing your analyses in multiple notebooks, a good practice is to number them following your analytic plan / workflow.

2.2 Creating reproducible analyses, tables and figures

The next notebook shows how to create reproducible tables and figures in R. Also in APA style. This will avoid the need for copy-pasting values from R to a word processor, ensuring fewer errors and - better reproducibility! You can export the formatted table into Word, PowerPoint, HTML, or PDF.

  • Download the notebook by running the code below.
  • Open the notebook to follow and run the examples.
  • If you are reading this online, you can read the notebook here
# Download a notebook for creating tables (if not downloaded)
if (file.exists(here("04-notebooks/02-create-tables-and-figures.qmd"))) {
  print("Notebooks already exists.")
} else {
download.file("https://raw.githubusercontent.com/juusorepo/ReproRepo/master/04-notebooks/02-create-tables-and-figures.qmd", 
              here("04-notebooks/02-create-tables-and-figures.qmd"))
}
message("Descript -notebooks downloaded and saved in your notebooks -folder. Open and try!")
Note

Why did the scientist break up with reproducibility?
Because they wanted a relationship with fewer variables!

3 STEP THREE: Preparing data and code for sharing

flowchart LR
    step0("Prepare data<br> for sharing")
    step1("Prepare notebooks<br> for sharing")
    step2("Extract code from<br> notebooks with purl")
    step3("Test the code scripts")
    step1a("Share notebooks<br>with colleagues")
    step4("Share code and<br> data in public")
    
    step1 --> step2 --> step3 --> step4
    step1 --> step1a
    step0 --> step4

3.1 Preparing code for public sharing

  • Comment the code. Make sure the code you plan to share is properly commented. Good practice is to write the comments before you write the code. If you lack comments, you can use Chat GPT to assist and review your code (with caution, naturally). With the prompt below, you can keep adding your scripts one at a time. Replace your code with the commented version, and test the code. Example prompt:

Dearest AI, please revise and improve the comments in my R script to increase its clarity and convey its purpose, without changing the code itself. Identify any errors separately and alert me to potential reproducibility issues. I will submit sections of the code sequentially for your review.

  • Use README files. For complicated scripts and analyses, you can add a README text file with extensive documentation.
  • For a style guide for coding in R, see: style.tidyverse.org. It includes best practices for e.g., naming objects, and tools for reviewing your code.
  • Document R version and packages used. Documenting ‘dependencies’ ensures that future researchers can replicate the exact computational environment in which your analysis was conducted. With the report package, you can create this as a supplementary file with the code below.
# Create a supplementary file with R version and packages used
# load report package
if (!requireNamespace("report", quietly = TRUE)) {install.packages("report")}
require(report)

# Create dependencies report
dependencies <- report(sessionInfo())

# Export report as a supplementary text file 
writeLines(dependencies, here("03-supplementary/dependencies.txt"))

# print a summary of the report
message("Dependencies summary saved in the 03-supplementary -folder. A brief summary:")
summary(report(sessionInfo()))
The analysis was done using the R Statistical language (v4.2.0; R Core Team,
2022) on Ubuntu 22.04.4 LTS, using the packages report (v0.5.8), here (v1.0.1)
and knitr (v1.45).
Tip

If your notebook is getting messy, you can collapse/hide a code chunk from the small arrow on the top-left corner of the code chunk. You can also write “#| include:false” in the beginning of the chunk to suppress all output from that code block.

3.1.1 Extracting code from notebooks - the PURL function

As you may have noticed from the example notebooks, we used notebooks also for personal notes and exploratory analyses. How can one manage all that in the same file? With help from the ‘purl’ function!

The “purl” function from the knitr package will extract all executable code from the notebooks and save them in in an .R script file - for public sharing. The trick is that it only extracts the code chunks with comments, not the notes (text outside code chunks) or results. In addition, the code chunks marked with purl=FALSE (in the title, see example below), will not be included in the .R script file created. Thus, you can keep draft analyses and personal notes in the same notebook with the code for public sharing.

  • Run the code below to create .R script files from the notebooks.

  • Open an .R script file created to view the result.

# Extract simulate-data: use purl to extract r-code from notebook into script file
purl(
    here("04-notebooks/00-simulate-data.qmd"), # the source notebook 
    output=here("02-scripts/00-simulate-data.R"), # the output script file
    documentation = 1 # to include only the code chunks
    )

# Extract prepare-data
purl(
    here("04-notebooks/01-prepare-data.qmd"), 
    output=here("02-scripts/01-prepare-data.R"), 
    documentation = 1 
    )

# Extract create-tables-and-figures
purl(here("04-notebooks/02-create-tables-and-figures.qmd"), 
     output=here("02-scripts/02-create-tables-and-figures.R"),
     documentation = 1)

3.2 Running the analyses from R script files

After the R script files are created, it is better to test the scripts before sharing. However, all modifications should be made in notebooks, so you will not get lost with different versions. After adjustments, just run the purl function again to replace the script files.

With the code below, you can run all the analyses without opening the R script files. The full workflow in one code chunk!

  • Restart R Session in RStudio: Select Session > Restart R (good practice to ensure reproducibility).

  • Run the below code chunk. Maybe better to run one row at a time, to check for errors (ctrl-enter).

# Clear all objects from memory to ensure reproducibility
rm(list = ls())
# reload here as we emptied it from memory
require(here)

# 0 Simulate data
source(here("02-scripts/00-simulate-data.R"))

# 1 Prepare data
source(here("02-scripts/01-prepare-data.R"))

# 2 Run analyses to create tables and figures
# source(here("02-scripts/02-create-tables-and-figures.R"))
Note

Labeling each code chunk with “#| label: your-label” will split the script file into easily readable chunks.

3.3 Sharing notebooks with co-authors

Best way to share some analyses with your colleagues? Notebooks are great for that. They can be read in any browser and can include code and your notes and questions for your colleagues. Although maybe intimidating to show others your code, it’s an important step in making your research reproducible. With this example, we render/knit one notebook into html format for sharing,

  • Open one of the example notebooks, for example 01-prepare-data.qmd

  • Select “Render” from the toolbar on top of your notebook. This process converts the notebook into an HTML file. To alter the output format, add “format: html” at the notebook header, replacing html with pdf, docx, or another desired output format.

3.4 Preparing data for sharing

For full reproducibility, share your raw data when possible. Check these steps before sharing:

  • Consent. Ensure you have consent from the study participants and your institution to share the data.

  • Format. Make sure your data is in open file format like .csv.

  • Documentation. Include necessary documentation in the metadata folder (e.g., codebooks, readme, licenses, ethics). Use open file formats. See example codebook in the simulate-data notebook.

  • De-identify. If needed, prepare a de-identified version of your raw data to manage the risk of identifying individuals in the dataset.

Note

What’s a reproducible researcher’s favorite movie?
“Groundhog Day” – they love seeing the same results every time!

4 STEP FOUR: Sharing code and data in open repository

The final step is to select an open repository for public sharing. Open Science Framework (OSF) is one option, which we will be using here. It is a free, open-source service that connects the entire research lifecycle. Some alternatives are: Zenodo, Figshare, and Github.

  1. Create a user and/or login at: https://osf.io/
  2. Create a new project in OSF (you can keep the project private when testing)
  3. Create necessary folders with corresponding names with your project folders. With our example, we would create folder data/raw, data/metadata, scripts, and supplementary.
  4. Upload the files to the folders created
  5. Create a DOI (permanent link) for your project in OSF and start sharing!

*Congratulations on taking the initial steps! Feel free to adapt these ideas and codes to fit your personal style and needs. With R, numerous paths lead to the same destination.

To expand the basic steps, e.g., producing a full manuscript in RStudio, managing version control with tools like Git, and ensuring computational reproducibility through containerization technologies such as Docker, see resources. *