Last updated: 2018-12-05

workflowr checks: (Click a bullet for more information)
• R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

• Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

• Seed: set.seed(20180730)

The command set.seed(20180730) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

• Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

• Repository version: 09d6884

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:

Ignored files:
Ignored:    .DS_Store
Ignored:    .Rhistory
Ignored:    .Rproj.user/
Ignored:    analysis/cache.bak.20181031/
Ignored:    analysis/cache.bak/
Ignored:    analysis/cache.lind2.20181114/
Ignored:    analysis/cache/
Ignored:    data/Lindstrom2/
Ignored:    data/processed.bak.20181031/
Ignored:    data/processed.bak/
Ignored:    data/processed.lind2.20181114/
Ignored:    packrat/lib-R/
Ignored:    packrat/lib-ext/
Ignored:    packrat/lib/
Ignored:    packrat/src/
Ignored:    test.csv.zip


Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
Expand here to see past versions:
File Version Author Date Message
Rmd 09d6884 Luke Zappia 2018-12-05 Update home page and getting started
html a61f9c9 Luke Zappia 2018-09-13 Rebuild site

This page describes how to download the data and code used in this analysis, set up the project directory and rerun the analysis. We have use the workflowr package to organise the analysis and insert reproducibilty information into the output documents. The packrat package has also been used to manage R package versions and dependencies.

Getting the code

All the code and outputs of analysis are available from GitHub at https://github.com/Oshlack/combes-organoid-paper. If you want to replicate the analysis you can either fork the repository and clone it or download the repository as a zipped directory.

Once you have a local copy of the repository you should see the following directory structure:

• analysis/ - Contains the RMarkdown documents with the various stages of analysis. These are numbered according to the order they should be run.
• data/ - This directory contains the data files used in the analysis with each dataset in it’s own sub-directory (see Getting the data for details). Processed intermediate data files will also be placed here.
• output/ - Directory for output files produced by the analysis, each analysis step has it’s own sub-directory.
• docs/ - This directory contains the analysis website hosted at http://oshlacklab.com/combes-organoid-paper, including image files.
• R/ - R scripts with custom functions used in some analysis stages.
• packrat/ - Directory created by packrat that contains details of the R packages and versions used in the analysis.
• README.md - README describing the project.
• .Rprofile - Custom R profile for the project including set up for packrat and workflowr.
• .gitignore - Details of files and directories that are excluded from the repository.
• _workflowr.yml - Workflowr configuration file.
• combes-organoid-paper.Rproj - RStudio project file.

Installing packages

Packages and dependencies for this project are managed using packrat. This should allow you to install and use the same package versions as we have used for the analysis. packrat should automatically take care of this process for you the first time that you open R in the project directory. If for some reason this does not happen you may need to run the following commands:

install.packages("packrat")
packrat::restore()

Note that a clean install of all the required packages can take a significant amount of time when the project is first opened.

Getting the data

In this project we have used three scRNA-seq datasets, two batches of kidney organoids, the first containing three samples and the second a single organoid, and a human fetal kidney dataset published by Lindstrom et al. The organoid datasets can be downloaded from GEO accession number GSE114802 and the fetal kidney dataset from GEO GSE102596.

Once the datasets have been downloaded they need to be extracted, placed in the correct directorys and renamed. The analysis code assumes the following directory structure inside the data/ directory:

• Lindstrom/
• GSM2741551_count-table-human16w.tsv
• Organoid123/
• barcodes.tsv
• genes.tsv
• matrix.mtx
• Organoid4/
• barcodes.tsv
• genes.tsv
• matrix.mtx

Additional data files used during the analysis are provided as part of the repository. Intermediate data files created during the analysis will be placed in data/processed. These are used by later stages of the analysis so should not be moved, altered or deleted.

Running the analysis

The analysis directory contains the following analysis files:

• 01_Organoid123_QC.Rmd
• 02_Organoid4_QC.Rmd
• 03_Organoids_Integration.Rmd
• 04_Organoids_Clustering.Rmd
• 04B_Organoids_Nephron.Rmd
• 04C_Organoids_Trajectory.Rmd
• 04D_Organoids_Figures.Rmd
• 05_Lindstrom_QC.Rmd
• 06_Combined_Integration.Rmd
• 07_Combined_Clustering.Rmd
• 07B_Combined_Nephron.Rmd
• 07C_Combined_Trajectory.Rmd
• 07D_Combined_Figures.Rmd
• 08_Crossover.Rmd
• 99_Methods.Rmd

As indicated by the numbering they should be run in this order. If you want to rerun the entire analysis this can be easily done using workflowr.

workflowr::wflow_build(republish = TRUE)

It is important to consider the computer and environment you are using before doing this. Running this analysis from scratch requires a considerable amount of time, disk space and memory. Some stages of the analysis also assume that multiple (10) cores are available for processing. If you have fewer cores available you will need to change the following line in the relevant files and provide the number of cores that are available for use.

bpparam <- MulticoreParam(workers = 10)

It is also possible to run individual stages of the analysis, either by providing the names of the file you want to run to workflowr::wflow_build() or by manually knitting the document (for example using the ‘Kint’ button in RStudio).

Caching

To avoid having to repeatably rerun long running sections of the analysis we have turned on caching in the analysis documents. However, this comes at a tradeoff with disk space, useability and (potentially but unlikely if careful) reproducibility. In most cases this should not be a problem but it is something to be aware of. In particularly there is a incompatibilty with caching and workflowr that can cause images to not appear in the resulting HTML files (see this GitHub issue for more details). If you have already run part of the analysis (and therefore have a cache) and want to rerun a document the safest option is the use the RStudio ‘Knit’ button.

devtools::session_info()

This reproducible R Markdown analysis was created with workflowr 1.1.1