Simulating data (with codebook)

Published

September 12, 2024

As part of the Babysteps for Reproducibility -tutorial, in this notebook, simulated raw data is generated and saved in CSV format. Additionally, an example codebook is created, serving as a metadata document that outlines the dataset’s contents, structure, and generation process. Simulating data before working with empirical data is a good practice because it allows for testing analytical methods, understanding potential outcomes, and ensuring that the data analysis workflow is robust and error-free. This preparatory step helps in planning and anticipating the needs and challenges of working with empirical data.

Action steps:

Run the first code chunk to create a sample codebook.
Run the second code chunk to simulate and save the raw data.
Proceed to the next step: Preparing data for analysis

Understanding all the details of the code below is not necessary for following the tutorial.

View code: Load packages

# load / install tidyverse package for data manipulation
if (!requireNamespace("tidyverse", quietly = TRUE)) {install.packages("tidyverse")}
require(tidyverse)
require(here)

1 Create an example codebook

View code: Create codebook

# Define and create a codebook
codebook <- "
# Codebook for Longitudinal Baby Steps Dataset

## Dataset Overview

**Title:** Longitudinal Study on Baby Steps and Developmental Progress

**Description:** The dataset follows 100 baby participants across three observation points to monitor how their mobility type (Crawling, Toddling, Walking) affects their ability to solve puzzles and their engagement with the task, as indicated by their giggle counts. The data has three three observation points for each baby. The data is simulated for planning a (fictional) study.

**Author:** Jane Doe

**Creation Date:** March 11, 2024

**Last Updated:** March 12, 2024

**Version:** 1.0

## Contact Information

**Author Contact:** Jane Doe (email)

**Institution:** University of Data Science

## License

This dataset is made available under the [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). Users are free to share and adapt the material for any purpose, even commercially, under the following terms: appropriate credit must be given, a link to the license provided, and indication if changes were made.

## Citation

Please cite this dataset as:

Doe, J. (2024). Longitudinal Study on Baby Steps and Developmental Progress. University of Data Science. Version 1.0. [DOI or URL]

## Variables

-   BabyID
    -   Description: Unique identifier for each baby participant.
    -   Type: Integer
    -   Example Values: 1, 2, 3, ..., 100
-   StepType
    -   Description: The type of mobility the baby is primarily using at the time of observation.
    -   Type: Categorical
    -   Possible Values: Crawling, Toddling, Walking
-   AgeMonths
    -   Description: Age of the baby at the time of observation, in months.
    -   Type: Integer
    -   Range: 12-24 months
-   ObservationPoint
    -   Description: The stage of observation in the longitudinal study.
    -   Type: Categorical
    -   Possible Values: Start (12-14 months), Midway (15-20 months), End (21-24 months)
-   PuzzleTime
    -   Description: Time it takes for the baby to solve a simple puzzle, measured in seconds.
    -   Type: Numeric
    -   Example Values: Values can range based on puzzle difficulty and baby's skill level.
-   GiggleCount
    -   Description: Number of times the baby giggles while solving the puzzle. Used as a proxy for enjoyment or engagement.
    -   Type: Integer
    -   Example Values: Non-negative values, varying by individual and task.

### Collection Method

Data were collected through direct observation of baby participants in a controlled environment, ensuring that puzzle difficulty was consistent across observations.

## Use Cases

This dataset is intended for research on early childhood development, specifically examining the relationship between motor skills, problem-solving abilities, and engagement in activities.

## Ethics Approval

This study received ethics approval from the Institutional Review Board at the University of Data Science, approval number #12345.

## Acknowledgements

We thank the participants and their families for their time and contribution to this study. This research was supported by the Early Development Research Grant #67890.
"
# Save the codebook content to a file
writeLines(codebook, here("01-data/metadata/codebook-babysteps.md"))
message("Codebook created and saved in folder: 01-data/metadata.")

Codebook created and saved in folder: 01-data/metadata.

2 Simulate data

View code: Simulate data

# Simulate data for the babysteps analysis example
# Set seed for reproducibility
set.seed(123) 

# Set parameters
n_babies <- 100
observations_per_baby <- 3  # Each baby is observed in three waves

# Create baby IDs and waves
baby_ids <- rep(1:n_babies, each = observations_per_baby)
wave <- rep(c("T1", "T2", "T3"), times = n_babies)

# Assign step types randomly but consistently for each baby across observations (for simplicity)
step_type <- rep(sample(c("Crawling", "Toddling", "Walking"), n_babies, replace = TRUE), each = observations_per_baby)

# Simulate age in months, ensuring full variation from 12 to 24 months and logical progression
age_months_start <- sample(12:22, n_babies, replace = TRUE)
age_progression <- list(0:2, 0:2, 0:2)
age_months <- rep(age_months_start, each = observations_per_baby) + rep(0:2, each = n_babies)
# Adjust age_months to ensure it does not exceed 24 months
age_months <- pmin(age_months, 24)

# Simulate variable sleep_hours across waves
sleep_hours_adjustment <- sample(-1:1, n_babies * observations_per_baby, replace = TRUE)
sleep_hours <- rep(sample(8:16, n_babies, replace = TRUE), each = observations_per_baby) + sleep_hours_adjustment

# Simulate puzzle_time with variability
noise <- rnorm(n = n_babies * observations_per_baby, mean = 0, sd = 5)
puzzle_time <- round(120 - (age_months - 12) * 3 - ifelse(step_type == "Walking", 20, ifelse(step_type == "Toddling", 15, 10)) - (18 - sleep_hours) * 4 + noise)

# Simulate giggle_count based on the simulated characteristics
giggle_count <- round(
  ifelse(step_type == "Walking", 5, ifelse(step_type == "Toddling", 4, 3)) -
  (age_months - min(age_months)) / (max(age_months) - min(age_months)) * 3 +
  (sleep_hours - min(sleep_hours)) / (max(sleep_hours) - min(sleep_hours)) * 2
)

# Ensure giggle_count remains within a realistic range
giggle_count <- pmin(pmax(giggle_count, 3), 10)

# Dataframe creation
data <- tibble(BabyID = baby_ids,
                    StepType = step_type,
                    AgeMonths = age_months,
                    Wave = wave,
                    SleepHours = sleep_hours,
                    PuzzleTime = puzzle_time,
                    GiggleCount = giggle_count)

# Save the raw data into a csv file
write.csv(data, here("01-data/raw/babysteps-rawdata.csv"), row.names = FALSE)
message("Raw data was created and saved in folder: 01-data/raw.")

To follow the full tutorial, next step is Preparing data for analysis