Computational Reproducibility in Statistics

Hanne Oberman

PhD candidate at Utrecht University

Why bother?

We would like our results to be as fully reproducible as possible:

A. Reproducibility is one of the pillars of science

It is the standard by which to judge scientific claims
It helps the cumulative growth of knowledge - no duplication of effort

B. Reproducibility may greatly benefit you

You’ll develop better work habits
Better teamwork - especially with new team members
Changing or amending your work is much easier
Higher research impact - more likely to be picked up and cited

Crisis?

Definition

A result is reproducible when the same analysis steps performed on the same dataset consistently produces the same answer.

Definition

Research results are replicable if there is sufficient information available for independent researchers to make the same findings using the same procedures.

True or false?

In computational sciences - such as statistics - simply having the data and code means that the results are not only replicable, but fully reproducible.

Reproducibility of `R` scripts

Reproducible research is not the norm:

74% of R files failed to complete without error

Reproducibility spectrum

Making research reproducible

reproducible documents
research compendiums

Reproducible but not replicable

set.seed(1)
run_simulation()
set.seed(2)
run_simulation()
set.seed(3)

Teaching reproducibility

Markup Languages and Reproducible Programming in Statistics

Course aims

Course aims include the development of a publication-ready reproducible research compendium that contains:

a typeset manuscript following a markup language,
data and code,
everything that allows for successful reproduction and reuse of the materials (e.g. a license).

In our course, students are taught various tools and languages, such as Quarto markdown, version control with git, and reproducible environments for R with renv.

Full course aims

Students develop fundamental knowledge and understanding in the state of the art in statistical markup languages and reproducible programming and development
They can determine the most effective markup strategies to address a typesetting problem
They can efficiently organize a reproducible programming and development process
They can produce repositories up to the standards of international programming and coding conventions and initiatives
They can produce publications up to the typesetting standards of international peer-reviewed journals

Markup languages

Version control

Licensing

Research compendiums

Course weeks

Markup languages
Quarto markdown
Version control (with git and GitHub)
Reproducible research in statistics
Developer portfolios
Re-usable R code (with R packages and Shiny)

Reusable course elements

Missing element(s)?

Poster

Take aways

reproducibility is important
we should all learn reproducible workflows
we should teach reproducible workflows

Thank you!

Research Compendiums

Research compendium

Definition

A research compendium is a collection of all digital parts of a research project including data, code, texts…

The collection is created in such a way that reproducing all results is straightforward¹

The compendium serves as a means for distributing, managing, and updating the collection²

Basic compendium

A basic research compendium is just a folder…

compendium/
├── data
│   └── my_data.csv
├── analysis
│   └── my_script.R
├── requirements.txt
└── README.md

(Not so) basic compendium

… but it can become extensive…

|
├── paper/
│   ├── paper.qmd       
│   └── references.bib  
| 
├── figures/            
|
├── data/
│   ├── raw_data/       
│   └── clean_data/   
|
└── templates
    └── journal_template.csl

(Not so) basic compendium

…or even executable!

|
├── _targets.R
├── R/
│   ├── functions_data.R
│   ├── functions_analysis.R
│   ├── functions_visualization.R
└── data/
    └── input_data.csv

(Not so) basic compendium

Guidelines

Completeness
Organization
Economy
Transparency
Documentation
Access
Provenance
Metadata
Automation
Review

In practice

Research Data Management Support workshop:

Writing Reproducible Manuscripts in R and Python

Compendium step-by-step

Think about a good folder structure
- Split up ‘read-only’, ‘human-generated’, and ‘project-generated’ files
Create folder structure (main directory and sub directories)
- Add a landing page in the form of a README document
- Make the compendium executable (to automatically generate the results; optional)
- Make the compendium into a git repository (optional)
Add all files needed for reproducing the results of the project
- Avoid ‘hard coded’ parameters or human intervention in the execution
Make the compendium as clean and easy to use as possible
- Include a citation file and a LICENSE file with info on how it can be used
Publish your compendium
- E.g. on Zenodo (optional, more on this in the last course week)

References

Markup Languages and Reproducible Programming in Statistics team (2024). Course materials. URL: www.gerkovink.com/markup

Utrecht University (2024). Course description. URL: https://osiris-student.uu.nl/#/onderwijscatalogus/extern/cursus?cursuscode=202000010&taal=en&collegejaar=huidig

The Turing Way Community (2022). The Turing Way: A handbook for reproducible, ethical and collaborative research (1.0.2). DOI: 10.5281/ZENODO.3233853

Utrecht University (2023). Best Practices for Writing Reproducible Code. URL: utrechtuniversity.github.io/workshop-computational-reproducibility

Utrecht University (2023). Writing Reproducible Manuscripts in R & Python. URL: utrechtuniversity.github.io/workshop-reproducible-manuscripts

Eglen, S., & Nüst, D., (2024). CODECHECK. URL: codecheck.org.uk

Computational Reproducibility in Statistics

Why bother?

Crisis?

Definition

Definition

True or false?

Reproducibility of R scripts

Reproducibility spectrum

Making research reproducible

Reproducible but not replicable

Teaching reproducibility

Course aims

Full course aims

Markup languages

Version control

Licensing

Sharing R code

Research compendiums

Course weeks

Reusable course elements

Missing element(s)?

Poster

Take aways

Thank you!

Research Compendiums

Research compendium

Definition

Basic compendium

(Not so) basic compendium

(Not so) basic compendium

(Not so) basic compendium

Guidelines

In practice

Compendium step-by-step

References

Reproducibility of `R` scripts

Sharing `R` code