Prepare data

Published

September 12, 2024

1 Prepare data with EDA

This notebook sets an example workflow and template for preparing raw data for analysis. The focus is on cleaning, coding, and transforming the data, which are crucial steps to ensure the reliability and validity of the analysis. All processing steps should be documented, including the determination of the sample size, any data exclusions, manipulations, and the creation of variables for the study.

EDA (Exploratory Data Analysis) is integrated into the process to understand the distribution of the data and identify any issues that need addressing. After processing, the processed data is saved in CSV format. This approach keeps the raw data untouched and makes all processing steps transparent and reproducible.

flowchart LR
    step0("Load<br> raw data") -->
    step1("Modify<br> Data Types") -->
    step2("Initial<br> EDA") -->
    step3("Data<br> Cleaning") -->
    step4("Data<br> Transformations") -->
    step5("Post-Cleaning<br> EDA") -->
    step6("Save <br>the data")

Why prepare data in a notebook instead of a script file?

  • Working in a notebook, rather than a script file, allows for the inclusion of exploratory data analysis and personal notes while processing data to ensure accurate processing. With the “purl” function, explained in the Babysteps tutorial, only the code chunks will be extracted and saved in a script file for public sharing. The full notebook is intended for personal and collegial use. Feel free to modify the notebook for your style and purposes.

1.1 1 Load raw data and packages

Overview: Start by loading required R packages and the dataset into your analysis environment. For the tutorial, we load the raw data created in the Simulate-data notebook.

View code: Load packages and data
# Load required packages
require(here)
require(tidyverse)

# Load / download data
if (file.exists(here("01-data/raw/babysteps-rawdata.csv"))) {
  dataset <- read.csv(here("01-data/raw/babysteps-rawdata.csv"))
} else {
  dataset <-
    read.csv(
      "https://raw.githubusercontent.com/juusorepo/ReproRepo/master/01-data/raw/babysteps-rawdata.csv"
    )
}

1.2 Check and Modify Data Types

Before diving into EDA, it’s crucial to ensure that each variable in your dataset is stored in the most appropriate data type. Correct data types can improve computational efficiency and are essential for appropriate analysis techniques.

  • Action Steps: Start by listing the current data types of variables in your dataset. This helps identify any variables that may be incorrectly typed, such as numeric variables recognized as character data due to formatting issues.
# Display the current data types of all variables in the dataset
str(dataset)
'data.frame':   300 obs. of  7 variables:
 $ BabyID     : int  1 1 1 2 2 2 3 3 3 4 ...
 $ StepType   : chr  "Walking" "Walking" "Walking" "Walking" ...
 $ AgeMonths  : int  18 18 18 17 17 17 21 21 21 16 ...
 $ Wave       : chr  "T1" "T2" "T3" "T1" ...
 $ SleepHours : int  9 9 9 9 8 7 14 13 15 14 ...
 $ PuzzleTime : int  49 47 48 43 48 35 54 41 55 80 ...
 $ GiggleCount: int  4 4 4 4 4 4 4 4 4 4 ...
  • Convert Data Types: Based on the initial inspection, we convert variables to their correct data types. Common conversions include transforming character variables that represent categories into factors and ensuring numeric variables are not mistakenly treated as character or factor types.
# Converting character variables to factors
dataset <- dataset %>% mutate_if(is.character, as.factor)

1.3 Initial Exploratory Data Analysis

Before further processing, you may want to conduct an exploratory data analysis (EDA) to gain insights into the dataset’s distribution and characteristics. There are various ways to do this and also packages for automated EDA analysis (e.g., DataExplorer, GGally, SmartEDA, Hmisc). For this tutorial, we use the DataExplorer, which works fine for smaller datasets.

  • Action Steps: Run the code below to create an EDA report. See results in Notebooks folder or here.
# Load/Install DataExplorer package
if (!requireNamespace("DataExplorer", quietly = TRUE)) {install.packages("DataExplorer")}
require(DataExplorer)

# Create the report
dataset %>%
    create_report(
        output_file = paste("EDA-report-initial", Sys.Date(), sep="-"), # filename
        report_title = "Initial EDA Report - Babysteps Dataset",
        y = 'PuzzleTime' # to set the outcome variable
    )
message("The EDA report was created and saved in your notebooks folder")

1.4 Data Cleaning

This step involves identifying and correcting issues in your dataset, such as missing values, errors, outliers, and standardizing variable names, to ensure data quality.

1.4.1 Identifying Missing Values

  • Overview: Begin by identifying missing values in your dataset. Missing data can occur for various reasons, from non-response in surveys to errors in data entry.
  • Action Steps: Use descriptive statistics to identify missing patterns and decide on appropriate methods for handling them, such as imputation or exclusion, based on the nature of your data.

1.4.2 Correcting Errors

  • Overview: Data entry errors, inconsistencies in response formats, and other inaccuracies can significantly affect your analysis.
  • Action Steps: Validate data ranges (e.g., ages within plausible limits) and consistency (e.g., gender coded uniformly). Correct identified errors where possible, or note them for exclusion or special consideration.

1.4.3 Handling Outliers

  • Overview: Outliers can influence statistical analyses and may represent either genuine phenomena or data errors.
  • Action Steps: Identify outliers through visual (e.g., boxplots) and statistical methods. Investigate their origins and decide whether to keep, adjust, or remove them, documenting your rationale.

1.4.4 Standardize Variable Names

Ensure all variable names are in lowercase to maintain consistency across your dataset. This can help avoid case-sensitive errors in your analysis scripts.

# lowercase all variables (good practice)
dataset <- dataset %>% rename_all(tolower)

Tip. Check out clean.names() function from janitor package to clean up variable names.

1.5 Data Recoding and Transformation

This step encompasses the processes of adjusting your variables to better fit your analysis needs and preparing your data through various transformations. It ensures that your dataset is in the optimal form for analysis, addressing both the structure of your data and the scales of measurement.

1.5.1 Recoding Variables

  • Overview: Recoding involves adjusting existing variables to better fit your analysis needs, such as combining categories of a nominal variable or changing measurement scales.
  • Action Steps: Clearly define your recoding rules and apply them uniformly across your dataset. Document changes to ensure transparency and reproducibility.
# Recode wave into an integer for regression analyses
dataset$wave <- as.integer(gsub("T", "", dataset$wave))

# create a new categorical variable AgeGroup
dataset <- dataset %>%
  mutate(agegroup = case_when(
    agemonths <= 14 ~ "12-14 months",
    agemonths <= 20 ~ "15-20 months",
    TRUE ~ "21-24 months"
  ))

1.5.2 Normalizing Scales

  • Overview: Variable scales may need normalization, especially when combining data from different sources or preparing for certain statistical analyses.
  • Action Steps: Apply normalization techniques, such as z-score standardization or min-max scaling, to adjust scales. Choose a method appropriate for your data distribution and analysis requirements.

1.5.3 Creating Dummy Variables

  • Overview: Dummy variables are used to represent categorical data in binary form, which is necessary for many types of statistical modeling.
  • Action Steps: Convert categorical variables into dummy variables as needed.

1.6 Post-Cleaning EDA

After cleaning and transforming your data, perform another round of EDA to verify the data preparation steps’ effects and ensure the dataset is ready for analysis. Look for any remaining issues to check the data quality and structure post-cleaning.

# Create the EDA report, define the outcome variable
dataset %>%
    create_report(
        output_file = paste("EDA-report-processed", Sys.Date(), sep=" - "),
        report_title = "EDA Report with processed data - Babysteps Dataset",
        y = 'puzzletime'
    )
message("The EDA report was created and saved in your notebooks folder")

1.7 Save the processed data

# Save in CSV format into processed data -subfolder
write.csv(dataset, here("01-data/processed/babysteps.csv"), row.names = FALSE)
message("The processed dataset was saved in processed data folder")
The processed dataset was saved in processed data folder

2 Conclusion

This workflow guides you from loading your data to ensuring it’s analytically ready, highlighting the iterative nature of data preparation. By incorporating initial and post-cleaning EDA, you continuously validate your steps and decisions, enhancing the quality of your research.

To follow the full tutorial, next step is Creating reproducible outputs