Last updated: 2021-04-21

Checks: 7 0

Knit directory: methyl-geneset-testing/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it's best to always run the code in an empty environment.

Seed: set.seed(20200302)

The command set.seed(20200302) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: e3c4003

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version e3c4003. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/figures.nb.html
    Ignored:    code/.DS_Store
    Ignored:    code/.Rhistory
    Ignored:    code/.job/
    Ignored:    code/old/
    Ignored:    data/.DS_Store
    Ignored:    data/annotations/
    Ignored:    data/cache-intermediates/
    Ignored:    data/cache-region/
    Ignored:    data/cache-rnaseq/
    Ignored:    data/cache-runtime/
    Ignored:    data/datasets/.DS_Store
    Ignored:    data/datasets/GSE110554-data.RData
    Ignored:    data/datasets/GSE120854/
    Ignored:    data/datasets/GSE120854_RAW.tar
    Ignored:    data/datasets/GSE135446-data.RData
    Ignored:    data/datasets/GSE135446/
    Ignored:    data/datasets/GSE135446_RAW.tar
    Ignored:    data/datasets/GSE45459-data.RData
    Ignored:    data/datasets/GSE45459_Matrix_signal_intensities.txt
    Ignored:    data/datasets/GSE45460/
    Ignored:    data/datasets/GSE45460_RAW.tar
    Ignored:    data/datasets/GSE95460_RAW.tar
    Ignored:    data/datasets/GSE95460_RAW/
    Ignored:    data/datasets/GSE95462-data.RData
    Ignored:    data/datasets/GSE95462/
    Ignored:    data/datasets/GSE95462_RAW/
    Ignored:    data/datasets/SRP100803/
    Ignored:    data/datasets/SRP125125/.DS_Store
    Ignored:    data/datasets/SRP125125/SRR6298*/
    Ignored:    data/datasets/SRP125125/SRR_Acc_List.txt
    Ignored:    data/datasets/SRP125125/SRR_Acc_List_Full.txt
    Ignored:    data/datasets/SRP125125/SraRunTable.txt
    Ignored:    data/datasets/SRP125125/multiqc_data/
    Ignored:    data/datasets/SRP125125/multiqc_report.html
    Ignored:    data/datasets/SRP125125/quants/
    Ignored:    data/datasets/SRP166862/
    Ignored:    data/datasets/SRP217468/
    Ignored:    data/datasets/TCGA.BRCA.rds
    Ignored:    data/datasets/TCGA.KIRC.rds
    Ignored:    data/misc/
    Ignored:    output/--exclude
    Ignored:    output/.DS_Store
    Ignored:    output/FDR-analysis/
    Ignored:    output/compare-methods/
    Ignored:    output/figures/
    Ignored:    output/methylgsa-params/
    Ignored:    output/outputs-1.tar.gz
    Ignored:    output/outputs.tar.gz
    Ignored:    output/random-cpg-sims/

Untracked files:
    Untracked:  analysis/old/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/gettingStarted.Rmd) and HTML (docs/gettingStarted.html) files. If you've configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	e3c4003	JovMaksimovic	2021-04-21	wflow_publish("analysis/gettingStarted.Rmd")
html	0f8d628	JovMaksimovic	2021-04-20	Build site.
Rmd	9d23572	JovMaksimovic	2021-04-20	wflow_publish(rownames(stat$status)[stat$status$modified == TRUE])
html	d3675c5	JovMaksimovic	2020-08-28	Build site.
Rmd	562f140	JovMaksimovic	2020-08-28	wflow_publish(c("analysis/03_expressionGenesets.Rmd", "analysis/gettingStarted.Rmd",
html	e66c39a	JovMaksimovic	2020-08-14	Build site.
Rmd	ad9e7be	JovMaksimovic	2020-08-14	wflow_publish(c("analysis/figures.Rmd", "analysis/gettingStarted.Rmd",
html	555069b	JovMaksimovic	2020-08-14	Build site.
Rmd	91699a8	JovMaksimovic	2020-08-14	wflow_publish("analysis/_site.yml", republish = TRUE, all = TRUE)
html	e162725	JovMaksimovic	2020-07-27	Build site.
Rmd	ea6f88d	JovMaksimovic	2020-07-27	wflow_publish(c("analysis/index.Rmd", "analysis/gettingStarted.Rmd"))
html	d439b32	JovMaksimovic	2020-07-27	Build site.
Rmd	6278674	JovMaksimovic	2020-07-27	wflow_publish(c("analysis/index.Rmd", "analysis/gettingStarted.Rmd"))

This page describes how to download the data and code used in this analysis, set up the project directory and rerun the analysis. We have use the workflowr package to organise the analysis and insert reproducibilty information into the output documents.

Getting the code

All the code and outputs of analysis are available from GitHub at https://github.com/Oshlack/methyl-geneset-testing. If you want to replicate the analysis you can either fork the repository and clone it or download the repository as a zipped directory.

Once you have a local copy of the repository you should see the following directory structure:

analysis/ - Contains the RMarkdown documents with the various stages of analysis. These are numbered according to the order they should be run.
data/ - This directory contains the data files used in the analysis with sub-directories for different data types (see Getting the data for details). Processed intermediate data files will also be placed here.
output/ - Directory for output files produced by the analysis, each analysis step has it's own sub-directory.
docs/ - This directory contains the analysis website hosted at http://oshlacklab.com/methyl-geneset-testing, including image files.
code/ - R scripts with custom functions used in some analysis stages. There are sub-directories for scripts associated with different steps in the anlaysis.
README.md - README describing the project.
.Rprofile - Custom R profile for the project including set up for workflowr.
.gitignore - Details of files and directories that are excluded from the repository.
_workflowr.yml - Workflowr configuration file.
methyl-geneset-testing.Rproj - RStudio project file.

Getting the data

In this project we have used data from several publicly avilable datasets. Flow-sorted, blood cell methylation data generated using Illumina HumanMethylationEPIC arrays. Normal kidney methylation data from The Cancer Genome Atlas (TCGA) kidney clear-cell carcinoma (KIRC) cohort, which was generated using Illumina HumanMethylation450 arrays. These are both automatically downloaded as part of the analysis directly from the Bioconductor ExperimentHub.

We use a flow-sorted, blood cell RNAseq dataset, which can be downloaded from GEO at GSE107011 or SRA at SRP125125.

Once the RNAseq data has been downloaded it needs to be extracted, placed in the correct directories and quasi-mapped and quantified using Salmon. The approach we took is described here. The downstream analysis code assumes the following directory structure inside the data/ directory:

datasets
- SRP125125
  - quants
    - SRR6298258_quant
    - ...
    - ...
    - SRR6298376_quant

We use pre B-cell development Affymetrix gene expression array data, which can be downloaded from GEO at GSE45460.

For downstream analysis the CEL files for each sample are expected to be present in the following directory structure:

data
- datasets
  - GSE45460

We use publicly available 450K data generated from developing human B-cells, which can be downloaded from: GSE45459 . Specifically, the GSE45459_Matrix_signal_intensities.txt.gz file should be downloaded and placed in the data\datasets directory and unzipped using gunzip GSE45459_Matrix_signal_intensities.txt.gz.

Some additional data files used during the analysis are provided as part of the repository.

genesets
- GO-immune-system-process.txt
- kegg-immune-related-pathways.csv
datasets
- SRP125125
  - SraRunTableFull.txt

Intermediate data files created during the analysis will be placed in:

annotations
cache-intermediates
cache-region
cache-rnaseq
cache-runtime

These are used by later stages of the analysis so should not be moved, altered or deleted.

Running the analysis

The analysis directory contains the following analysis files:

 [1] "01_exploreArrayBiasEPIC.Rmd"     "02_exploreArrayBias450.Rmd"     
 [3] "03_fdrAnalysisBRCA.Rmd"          "03_fdrAnalysisKIRC.Rmd"         
 [5] "04_expressionGenesets.Rmd"       "04_expressionGenesetsBcells.Rmd"
 [7] "05_compareMethods.Rmd"           "05_compareMethodsBcells.Rmd"    
 [9] "06_runTimeComparison.Rmd"        "07_regionAnalysis.Rmd"          
[11] "07_regionAnalysisBcells.Rmd"     "08_methylGSAParamSweep.Rmd"

As indicated by the numbering they should be run in this order. If you want to rerun the entire analysis this can be easily done using workflowr.

workflowr::wflow_build(republish = TRUE)

It is important to consider the computer and environment you are using before doing this. Running this analysis from scratch requires a considerable amount of time, disk space and memory. Some stages of the analysis need to be executed on a HPC to generate results required by downstream steps. If you do no have access to a HPC to perform these analyses using the code provided, you can download pre-computed RDS files containing the results from .

To use the pre-computed RDS objects, after cloning or downloading the GitHub repository to your computer, please extract the outputs.tar.gz archive under the output directory, using tar -xvf outputs.tar.gz.

It is also possible to run individual stages of the analysis, either by providing the names of the file you want to run to workflowr::wflow_build() or by manually knitting the document (for example using the 'Kint' button in RStudio).

Once all the analyses have been rerun, the manuscript figures can be generated using the code provided here.

devtools::session_info()

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        whisker_0.4       knitr_1.31        magrittr_2.0.1   
 [5] here_1.0.1        R6_2.5.0          rlang_0.4.10      fansi_0.4.2      
 [9] stringr_1.4.0     tools_4.0.3       xfun_0.22         utf8_1.2.1       
[13] git2r_0.28.0      jquerylib_0.1.3   htmltools_0.5.1.1 ellipsis_0.3.1   
[17] rprojroot_2.0.2   yaml_2.2.1        digest_0.6.27     tibble_3.1.0     
[21] lifecycle_1.0.0   crayon_1.4.1      later_1.1.0.1     sass_0.3.1       
[25] vctrs_0.3.7       promises_1.2.0.1  fs_1.5.0          glue_1.4.2       
[29] evaluate_0.14     rmarkdown_2.7     stringi_1.5.3     bslib_0.2.4      
[33] compiler_4.0.3    pillar_1.5.1      jsonlite_1.7.2    httpuv_1.5.5     
[37] pkgconfig_2.0.3

Getting started

Getting the code

Getting the data

Running the analysis