Last updated: 2021-04-21

Checks: 7 0

Knit directory: methyl-geneset-testing/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it's best to always run the code in an empty environment.

The command set.seed(20200302) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version e3c4003. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:

Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/figures.nb.html
    Ignored:    code/.DS_Store
    Ignored:    code/.Rhistory
    Ignored:    code/.job/
    Ignored:    code/old/
    Ignored:    data/.DS_Store
    Ignored:    data/annotations/
    Ignored:    data/cache-intermediates/
    Ignored:    data/cache-region/
    Ignored:    data/cache-rnaseq/
    Ignored:    data/cache-runtime/
    Ignored:    data/datasets/.DS_Store
    Ignored:    data/datasets/GSE110554-data.RData
    Ignored:    data/datasets/GSE120854/
    Ignored:    data/datasets/GSE120854_RAW.tar
    Ignored:    data/datasets/GSE135446-data.RData
    Ignored:    data/datasets/GSE135446/
    Ignored:    data/datasets/GSE135446_RAW.tar
    Ignored:    data/datasets/GSE45459-data.RData
    Ignored:    data/datasets/GSE45459_Matrix_signal_intensities.txt
    Ignored:    data/datasets/GSE45460/
    Ignored:    data/datasets/GSE45460_RAW.tar
    Ignored:    data/datasets/GSE95460_RAW.tar
    Ignored:    data/datasets/GSE95460_RAW/
    Ignored:    data/datasets/GSE95462-data.RData
    Ignored:    data/datasets/GSE95462/
    Ignored:    data/datasets/GSE95462_RAW/
    Ignored:    data/datasets/SRP100803/
    Ignored:    data/datasets/SRP125125/.DS_Store
    Ignored:    data/datasets/SRP125125/SRR6298*/
    Ignored:    data/datasets/SRP125125/SRR_Acc_List.txt
    Ignored:    data/datasets/SRP125125/SRR_Acc_List_Full.txt
    Ignored:    data/datasets/SRP125125/SraRunTable.txt
    Ignored:    data/datasets/SRP125125/multiqc_data/
    Ignored:    data/datasets/SRP125125/multiqc_report.html
    Ignored:    data/datasets/SRP125125/quants/
    Ignored:    data/datasets/SRP166862/
    Ignored:    data/datasets/SRP217468/
    Ignored:    data/datasets/TCGA.BRCA.rds
    Ignored:    data/datasets/TCGA.KIRC.rds
    Ignored:    data/misc/
    Ignored:    output/--exclude
    Ignored:    output/.DS_Store
    Ignored:    output/FDR-analysis/
    Ignored:    output/compare-methods/
    Ignored:    output/figures/
    Ignored:    output/methylgsa-params/
    Ignored:    output/outputs-1.tar.gz
    Ignored:    output/outputs.tar.gz
    Ignored:    output/random-cpg-sims/

Untracked files:
    Untracked:  analysis/old/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/gettingStarted.Rmd) and HTML (docs/gettingStarted.html) files. If you've configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd e3c4003 JovMaksimovic 2021-04-21 wflow_publish("analysis/gettingStarted.Rmd")
html 0f8d628 JovMaksimovic 2021-04-20 Build site.
Rmd 9d23572 JovMaksimovic 2021-04-20 wflow_publish(rownames(stat\(status)[stat\)status$modified == TRUE])
html d3675c5 JovMaksimovic 2020-08-28 Build site.
Rmd 562f140 JovMaksimovic 2020-08-28 wflow_publish(c("analysis/03_expressionGenesets.Rmd", "analysis/gettingStarted.Rmd",
html e66c39a JovMaksimovic 2020-08-14 Build site.
Rmd ad9e7be JovMaksimovic 2020-08-14 wflow_publish(c("analysis/figures.Rmd", "analysis/gettingStarted.Rmd",
html 555069b JovMaksimovic 2020-08-14 Build site.
Rmd 91699a8 JovMaksimovic 2020-08-14 wflow_publish("analysis/_site.yml", republish = TRUE, all = TRUE)
html e162725 JovMaksimovic 2020-07-27 Build site.
Rmd ea6f88d JovMaksimovic 2020-07-27 wflow_publish(c("analysis/index.Rmd", "analysis/gettingStarted.Rmd"))
html d439b32 JovMaksimovic 2020-07-27 Build site.
Rmd 6278674 JovMaksimovic 2020-07-27 wflow_publish(c("analysis/index.Rmd", "analysis/gettingStarted.Rmd"))

This page describes how to download the data and code used in this analysis, set up the project directory and rerun the analysis. We have use the workflowr package to organise the analysis and insert reproducibilty information into the output documents.

Getting the code

All the code and outputs of analysis are available from GitHub at If you want to replicate the analysis you can either fork the repository and clone it or download the repository as a zipped directory.

Once you have a local copy of the repository you should see the following directory structure:

Getting the data

In this project we have used data from several publicly avilable datasets. Flow-sorted, blood cell methylation data generated using Illumina HumanMethylationEPIC arrays. Normal kidney methylation data from The Cancer Genome Atlas (TCGA) kidney clear-cell carcinoma (KIRC) cohort, which was generated using Illumina HumanMethylation450 arrays. These are both automatically downloaded as part of the analysis directly from the Bioconductor ExperimentHub.

We use a flow-sorted, blood cell RNAseq dataset, which can be downloaded from GEO at GSE107011 or SRA at SRP125125.

Once the RNAseq data has been downloaded it needs to be extracted, placed in the correct directories and quasi-mapped and quantified using Salmon. The approach we took is described here. The downstream analysis code assumes the following directory structure inside the data/ directory:

We use pre B-cell development Affymetrix gene expression array data, which can be downloaded from GEO at GSE45460.

For downstream analysis the CEL files for each sample are expected to be present in the following directory structure:

We use publicly available 450K data generated from developing human B-cells, which can be downloaded from: GSE45459 . Specifically, the GSE45459_Matrix_signal_intensities.txt.gz file should be downloaded and placed in the data\datasets directory and unzipped using gunzip GSE45459_Matrix_signal_intensities.txt.gz.

Some additional data files used during the analysis are provided as part of the repository.

Intermediate data files created during the analysis will be placed in:

These are used by later stages of the analysis so should not be moved, altered or deleted.

Running the analysis

The analysis directory contains the following analysis files:

 [1] "01_exploreArrayBiasEPIC.Rmd"     "02_exploreArrayBias450.Rmd"     
 [3] "03_fdrAnalysisBRCA.Rmd"          "03_fdrAnalysisKIRC.Rmd"         
 [5] "04_expressionGenesets.Rmd"       "04_expressionGenesetsBcells.Rmd"
 [7] "05_compareMethods.Rmd"           "05_compareMethodsBcells.Rmd"    
 [9] "06_runTimeComparison.Rmd"        "07_regionAnalysis.Rmd"          
[11] "07_regionAnalysisBcells.Rmd"     "08_methylGSAParamSweep.Rmd"     

As indicated by the numbering they should be run in this order. If you want to rerun the entire analysis this can be easily done using workflowr.

workflowr::wflow_build(republish = TRUE)

It is important to consider the computer and environment you are using before doing this. Running this analysis from scratch requires a considerable amount of time, disk space and memory. Some stages of the analysis need to be executed on a HPC to generate results required by downstream steps. If you do no have access to a HPC to perform these analyses using the code provided, you can download pre-computed RDS files containing the results from DOI.

To use the pre-computed RDS objects, after cloning or downloading the GitHub repository to your computer, please extract the outputs.tar.gz archive under the output directory, using tar -xvf outputs.tar.gz.

It is also possible to run individual stages of the analysis, either by providing the names of the file you want to run to workflowr::wflow_build() or by manually knitting the document (for example using the 'Kint' button in RStudio).

Once all the analyses have been rerun, the manuscript figures can be generated using the code provided here.


R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        whisker_0.4       knitr_1.31        magrittr_2.0.1   
 [5] here_1.0.1        R6_2.5.0          rlang_0.4.10      fansi_0.4.2      
 [9] stringr_1.4.0     tools_4.0.3       xfun_0.22         utf8_1.2.1       
[13] git2r_0.28.0      jquerylib_0.1.3   htmltools_0.5.1.1 ellipsis_0.3.1   
[17] rprojroot_2.0.2   yaml_2.2.1        digest_0.6.27     tibble_3.1.0     
[21] lifecycle_1.0.0   crayon_1.4.1      later_1.1.0.1     sass_0.3.1       
[25] vctrs_0.3.7       promises_1.2.0.1  fs_1.5.0          glue_1.4.2       
[29] evaluate_0.14     rmarkdown_2.7     stringi_1.5.3     bslib_0.2.4      
[33] compiler_4.0.3    pillar_1.5.1      jsonlite_1.7.2    httpuv_1.5.5     
[37] pkgconfig_2.0.3