This page describes how to download the data and code used in this analysis, set up the project directory and rerun the analysis. We have use the workflowr package to organise the analysis and insert reproducibilty information into the output documents.

Getting the code

All the code and outputs of analysis are available from GitHub at If you want to replicate the analysis you can either fork the repository and clone it or download the repository as a zipped directory.

Once you have a local copy of the repository you should see the following directory structure:

Getting the data

In this project we have used data from several publicly avilable datasets. Flow-sorted, blood cell methylation data generated using Illumina HumanMethylationEPIC arrays. Normal kidney methylation data from The Cancer Genome Atlas (TCGA) kidney clear-cell carcinoma (KIRC) cohort, which was generated using Illumina HumanMethylation450 arrays. These are both automatically downloaded as part of the analysis directly from the Bioconductor ExperimentHub.

We use a flow-sorted, blood cell RNAseq dataset, which can be downloaded from GEO at GSE107011 or SRA at SRP125125.

Once the RNAseq data has been downloaded it needs to be extracted, placed in the correct directories and quasi-mapped and quantified using Salmon. The approach we took is described here. The downstream analysis code assumes the following directory structure inside the data/ directory:

We use pre B-cell development Affymetrix gene expression array data, which can be downloaded from GEO at GSE45460.

For downstream analysis the CEL files for each sample are expected to be present in the following directory structure:

We use publicly available 450K data generated from developing human B-cells, which can be downloaded from: GSE45459 . Specifically, the GSE45459_Matrix_signal_intensities.txt.gz file should be downloaded and placed in the data\datasets directory and unzipped using gunzip GSE45459_Matrix_signal_intensities.txt.gz.

Some additional data files used during the analysis are provided as part of the repository.

Intermediate data files created during the analysis will be placed in:

These are used by later stages of the analysis so should not be moved, altered or deleted.

Running the analysis

The analysis directory contains the following analysis files:

 [1] "01_exploreArrayBiasEPIC.Rmd"     "02_exploreArrayBias450.Rmd"     
 [3] "03_fdrAnalysisBRCA.Rmd"          "03_fdrAnalysisKIRC.Rmd"         
 [5] "04_expressionGenesets.Rmd"       "04_expressionGenesetsBcells.Rmd"
 [7] "05_compareMethods.Rmd"           "05_compareMethodsBcells.Rmd"    
 [9] "06_runTimeComparison.Rmd"        "07_regionAnalysis.Rmd"          
[11] "07_regionAnalysisBcells.Rmd"     "08_methylGSAParamSweep.Rmd"     

As indicated by the numbering they should be run in this order. If you want to rerun the entire analysis this can be easily done using workflowr.

workflowr::wflow_build(republish = TRUE)

It is important to consider the computer and environment you are using before doing this. Running this analysis from scratch requires a considerable amount of time, disk space and memory. Some stages of the analysis need to be executed on a HPC to generate results required by downstream steps. If you do no have access to a HPC to perform these analyses using the code provided, you can download pre-computed RDS files containing the results from DOI.

To use the pre-computed RDS objects, after cloning or downloading the GitHub repository to your computer, please extract the outputs.tar.gz archive under the output directory, using tar -xvf outputs.tar.gz.

It is also possible to run individual stages of the analysis, either by providing the names of the file you want to run to workflowr::wflow_build() or by manually knitting the document (for example using the 'Kint' button in RStudio).

Once all the analyses have been rerun, the manuscript figures can be generated using the code provided here.


