Computational Reproducibility in Statistics

Hanne Oberman

PhD candidate at Utrecht University

Why bother?

We would like our results to be as fully reproducible as possible:

A. Reproducibility is one of the pillars of science

  • It is the standard by which to judge scientific claims
  • It helps the cumulative growth of knowledge - no duplication of effort

B. Reproducibility may greatly benefit you

  • You’ll develop better work habits
  • Better teamwork - especially with new team members
  • Changing or amending your work is much easier
  • Higher research impact - more likely to be picked up and cited

Crisis?

Definition

A result is reproducible when the same analysis steps performed on the same dataset consistently produces the same answer.

Definition

Research results are replicable if there is sufficient information available for independent researchers to make the same findings using the same procedures.

True or false?

In computational sciences - such as statistics - simply having the data and code means that the results are not only replicable, but fully reproducible.

Reproducibility of R scripts

Reproducible research is not the norm:

74% of R files failed to complete without error

Reproducibility spectrum

Making research reproducible

  • reproducible documents
  • research compendiums

Reproducible but not replicable

set.seed(1)
run_simulation()
set.seed(2)
run_simulation()
set.seed(3)

Teaching reproducibility

Markup Languages and Reproducible Programming in Statistics

Course aims

Course aims include the development of a publication-ready reproducible research compendium that contains:

  • a typeset manuscript following a markup language,
  • data and code,
  • everything that allows for successful reproduction and reuse of the materials (e.g. a license).

In our course, students are taught various tools and languages, such as Quarto markdown, version control with git, and reproducible environments for R with renv.

Full course aims

  1. Students develop fundamental knowledge and understanding in the state of the art in statistical markup languages and reproducible programming and development

  2. They can determine the most effective markup strategies to address a typesetting problem

  3. They can efficiently organize a reproducible programming and development process

  4. They can produce repositories up to the standards of international programming and coding conventions and initiatives

  5. They can produce publications up to the typesetting standards of international peer-reviewed journals

Markup languages

Version control

Licensing

Sharing R code

Research compendiums

Course weeks

  1. Markup languages
  2. Quarto markdown
  3. Version control (with git and GitHub)
  4. Reproducible research in statistics
  5. Developer portfolios
  6. Re-usable R code (with R packages and Shiny)

Reusable course elements

Missing element(s)?

Poster

Take aways

  • reproducibility is important
  • we should all learn reproducible workflows
  • we should teach reproducible workflows

Thank you!

Research Compendiums

Research compendium

DOI

Definition

A research compendium is a collection of all digital parts of a research project including data, code, texts…

The collection is created in such a way that reproducing all results is straightforward1


The compendium serves as a means for distributing, managing, and updating the collection2

Basic compendium

A basic research compendium is just a folder…

compendium/
├── data
│   └── my_data.csv
├── analysis
│   └── my_script.R
├── requirements.txt
└── README.md

(Not so) basic compendium

… but it can become extensive…

|
├── paper/
│   ├── paper.qmd       
│   └── references.bib  
| 
├── figures/            
|
├── data/
│   ├── raw_data/       
│   └── clean_data/   
|
└── templates
    └── journal_template.csl     

(Not so) basic compendium

…or even executable!

|
├── _targets.R
├── R/
│   ├── functions_data.R
│   ├── functions_analysis.R
│   ├── functions_visualization.R
└── data/
    └── input_data.csv

(Not so) basic compendium

Guidelines

  • Completeness
  • Organization
  • Economy
  • Transparency
  • Documentation
  • Access
  • Provenance
  • Metadata
  • Automation
  • Review

In practice

Research Data Management Support workshop:

Writing Reproducible Manuscripts in R and Python

Compendium step-by-step

  • Think about a good folder structure
    • Split up ‘read-only’, ‘human-generated’, and ‘project-generated’ files
  • Create folder structure (main directory and sub directories)
    • Add a landing page in the form of a README document
    • Make the compendium executable (to automatically generate the results; optional)
    • Make the compendium into a git repository (optional)
  • Add all files needed for reproducing the results of the project
    • Avoid ‘hard coded’ parameters or human intervention in the execution
  • Make the compendium as clean and easy to use as possible
    • Include a citation file and a LICENSE file with info on how it can be used
  • Publish your compendium
    • E.g. on Zenodo (optional, more on this in the last course week)

References

Markup Languages and Reproducible Programming in Statistics team (2024). Course materials. URL: www.gerkovink.com/markup

Utrecht University (2024). Course description. URL: https://osiris-student.uu.nl/#/onderwijscatalogus/extern/cursus?cursuscode=202000010&taal=en&collegejaar=huidig

The Turing Way Community (2022). The Turing Way: A handbook for reproducible, ethical and collaborative research (1.0.2). DOI: 10.5281/ZENODO.3233853

Utrecht University (2023). Best Practices for Writing Reproducible Code. URL: utrechtuniversity.github.io/workshop-computational-reproducibility

Utrecht University (2023). Writing Reproducible Manuscripts in R & Python. URL: utrechtuniversity.github.io/workshop-reproducible-manuscripts

Eglen, S., & Nüst, D., (2024). CODECHECK. URL: codecheck.org.uk