Last updated: 2018-12-04
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180730)
The command set.seed(20180730)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 5dcfa12
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/cache.bak.20181031/
Ignored: analysis/cache.bak/
Ignored: analysis/cache.lind2.20181114/
Ignored: analysis/cache/
Ignored: data/Lindstrom2/
Ignored: data/processed.bak.20181031/
Ignored: data/processed.bak/
Ignored: data/processed.lind2.20181114/
Ignored: packrat/lib-R/
Ignored: packrat/lib-ext/
Ignored: packrat/lib/
Ignored: packrat/src/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 5dcfa12 | Luke Zappia | 2018-12-04 | Fix typo in methods |
Rmd | 858cf44 | Luke Zappia | 2018-12-04 | Add hFK podocyte DE |
html | 858cf44 | Luke Zappia | 2018-12-04 | Add hFK podocyte DE |
html | 292a1c6 | Luke Zappia | 2018-11-23 | Minor fixes to output |
Rmd | d79b8e7 | Luke Zappia | 2018-11-20 | Create DE signature and filter results |
html | d79b8e7 | Luke Zappia | 2018-11-20 | Create DE signature and filter results |
html | a61f9c9 | Luke Zappia | 2018-09-13 | Rebuild site |
Rmd | a99f69a | Luke Zappia | 2018-09-13 | Fix link to site |
html | ad10b21 | Luke Zappia | 2018-09-13 | Switch to GitHub |
Rmd | 57b7fc2 | Luke Zappia | 2018-09-13 | Add workflowr to methods |
Rmd | 1d8b5a6 | Luke Zappia | 2018-09-11 | Update methods |
Rmd | 7755ac7 | Luke Zappia | 2018-08-15 | Add methods document |
# Presentation
library("glue")
library("knitr")
# Paths
library("here")
# Output
library("jsonlite")
# Tidyverse
library("tidyverse")
write_bib(c("base", "SingleCellExperiment", "cowplot", "workflowr"),
file = here("data/packages.bib"))
docs <- list.dirs(here("output"), full.names = FALSE)[-1]
params <- map(docs, function(doc) {
out <- tryCatch(
{
table <- suppressWarnings(
read_json(here(glue("output/{doc}/parameters.json")),
simplifyVector = TRUE)
)
p.list <- table$Value %>% setNames(table$Parameter)
},
error = function(e) {
message(glue("{doc} parameters file not found"))
return(list())
}
)
return(out)
})
names(params) <- str_remove(docs, "[0-9A-Z]+_")
The Cell Ranger pipeline (v1.3.1) was used to perform sample demultiplexing, barcode processing and single-cell gene counting. Briefly, samples were demultiplexed to produce a pair of FASTQ files for each sample. Reads containing sequence information were aligned to the GRCh38 reference genome provided with Cell Ranger (v1.2.0). Cell barcodes were filtered to remove empty droplets and PCR duplicates were removed by selecting unique combinations of cell barcodes, unique molecular identifiers and gene ids with the final results being a gene expression matrix that was used for further analysis. The three samples in the first batch of organoids were aggregated using Cell Ranger with no normalisation and treated as a single dataset.
The R statistical programming language (v3.5.0) (R Core Team 2018) was used for further analysis. Count data for each experiment was read into R and used to construct a SingleCellExperiment object (v1.2.0) (A. Lun and Risso 2017).Gene annotation information was added from BioMart (Smedley et al. 2015) using the biomaRt package (v2.36.1) (Durinck et al. 2005) and cells were assigned cell cycle scores using the cyclone (Scialdone et al. 2015) function in the scran package (v1.8.2) (A. T. L. Lun, McCarthy, and Marioni 2016).
The scater package (v1.8.2) (McCarthy et al. 2017) was used to produce a series of diagnostic quality control plots. Cells with a high number of expressed genes (indicating potential doublets) were removed, as were cells with a high percentage of counts assigned to mitochondrial or ribosomal genes, or with low expression of the housekeeping genes GAPDH and ACTB.
Genes that had less than two total counts across a dataset, or were expressed in less than two individual cells, were removed. We also removed genes without an annotated HGNC symbol.
Following quality control the first organoid dataset consisted of 6649 cells and 18386 genes with a median of 2738 genes expressed per a cell, the fourth organoid had 1288 cells and 16885 genes with a median of 3248 genes expressed per cell and the Lindstrom dataset had 3178 cells and 16166 genes with a median of 1509.5 genes expressed per cell.
The two organoid datasets were integrated using the alignment method in the Seurat package (v2.3.1) (Satija et al. 2015; Butler et al. 2018). Briefly, highly variable genes were identified in each dataset and those that were present in both datasets (1156 genes) were selected. Canonical correlation analysis (Hotelling 1936; Hardoon, Szedmak, and Shawe-Taylor 2004) was then performed using the selected genes and 25 dimensions that represent the majority of variation were selected. The final step used dynamic time warping (Berndt and Clifford 1994) to align the datasets in the selected subspace.
To perform clustering Seurat constructs a shared nearest neighbour graph of cells in the aligned subspace and uses Louvain modularity optimisation (Blondel et al. 2008) to assign cells to clusters. The number of clusters produced using this method is controlled by a resolution parameter with higher values giving more clusters. We performed clustering over a range of resolutions from 0 to 1 in steps of 0.1 and used the clustree package (v0.2.2.9000) to produce clustering trees (Zappia and Oshlack 2018) showing the expression of known marker genes to select the appropriate resolution to use. We chose to use a resolution of 0.6 which produced 13 clusters.
Marker genes for each cluster were detected by testing for differential expression between cells in one cluster and all other cells using a Wilcoxon rank sum test (Bauer 1972). To reduce processing time only genes that were expressed in at least 10 percent of cells in one of these groups were tested. To identify conserved marker genes a similar process was performed on each dataset separately and the results combined using the maximum p-value method. We also tested for within cluster differential expression to identify differences between cells of the same type in different datasets.
Based on identified marker genes we determined clusters 2 and 9 represented the nephron lineage. The 1125 cells in these clusters were re-clustered at a resolution of 0.5 resulting in 5 clusters.
We also performed pseudotime trajectory analysis on the nephron cells using Monocle (v2.8.0) (Trapnell et al. 2014; Qiu et al. 2017). The intersection of the top 100 genes with the greatest absolute foldchange for each nephron cluster were selected for this analysis, giving a set of 455 genes used to order the cells.
The combined organoid and human fetal kidney analysis used the procedure described for the organoid only analysis, but with slightly different parameters. We identified 1368 variable genes present in all three datasets and selected the first 20 canonical correlation dimensions. For clustering we chose a resolution of 0.5 which produced 16 clusters. Clusters 6, 7, 10 and 15 were determined to be the nephron lineage and these 1964 cells were re-clustered at a resolution of 0.6 producing 8 clusters.
We also performed differential expression testing between the two datasets as a whole which was used to identify a signature of 374 genes that represent the main differences between them. To identify cell type specific differences between organoid and human fetal kidney we performed differential expression testing between cells within a cluster and removed genes from the overall differential expression signature. Cluster 7 in the combined nephron analysis was identified as a human fetal kidney specific podocyte cluster. To investigate the differences between these cells and other podocytes we compared gene expression in this cluster to the general podocyte cluster (CN0).
Figures shown here were produced using functions in the Seurat, Monocle and clustree packages. Additional plots and customisations were created using the ggplot2 (v3.0.0) (Wickham 2010) and cowplot (v0.9.3) (Wilke 2018) packages. The analysis project was managed using the workflowr (v1.1.1) (Blischak, Carbonetto, and Stephens 2018) package which was also used to produce the publicly available website displaying the analysis code, results and output.
Both organoid datasets are available from GEO accession number GSE114802 and the Lindstrom fetal kidney dataset is available from GEO accession GSE102596. A website showing reports produced during analysis including the exact software versions and parameters used can be accessed at http://oshlacklab.com/combes-organoid-paper and the analysis code is available at https://github.com/Oshlack/combes-organoid-paper.
Bauer, David F. 1972. “Constructing Confidence Sets Using Rank Statistics.” J. Am. Stat. Assoc. 67 (339). Taylor & Francis: 687–90. doi:10.1080/01621459.1972.10481279.
Berndt, Donald J, and James Clifford. 1994. “Using Dynamic Time Warping to Find Patterns in Time Series.” In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 359–70. AAAIWS’94. Seattle, WA: AAAI Press. http://dl.acm.org/citation.cfm?id=3000850.3000887.
Blischak, John, Peter Carbonetto, and Matthew Stephens. 2018. Workflowr: A Framework for Reproducible and Collaborative Data Science. https://CRAN.R-project.org/package=workflowr.
Blondel, Vincent D, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. “Fast unfolding of communities in large networks.” J. Stat. Mech. 2008 (10). IOP Publishing: P10008. doi:10.1088/1742-5468/2008/10/P10008.
Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating single-cell transcriptomic data across different conditions, technologies, and species.” Nat. Biotechnol., April. doi:10.1038/nbt.4096.
Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. “BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis.” Bioinformatics 21 (16): 3439–40. doi:10.1093/bioinformatics/bti525.
Hardoon, David R, Sandor Szedmak, and John Shawe-Taylor. 2004. “Canonical correlation analysis: an overview with application to learning methods.” Neural Comput. 16 (12): 2639–64. doi:10.1162/0899766042321814.
Hotelling, Harold. 1936. “RELATIONS BETWEEN TWO SETS OF VARIATES.” Biometrika 28 (3-4). Oxford University Press: 321–77. doi:10.1093/biomet/28.3-4.321.
Lun, Aaron T L, Davis J McCarthy, and John C Marioni. 2016. “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res. 5 (August): 2122. doi:10.12688/f1000research.9501.2.
Lun, Aaron, and Davide Risso. 2017. SingleCellExperiment: S4 Classes for Single Cell Data.
McCarthy, Davis J, Kieran R Campbell, Aaron T L Lun, and Quin F Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8): 1179–86. doi:10.1093/bioinformatics/btw777.
Qiu, Xiaojie, Qi Mao, Ying Tang, Li Wang, Raghav Chawla, Hannah A Pliner, and Cole Trapnell. 2017. “Reversed graph embedding resolves complex single-cell trajectories.” Nat. Methods, August. doi:10.1038/nmeth.4402.
R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Satija, Rahul, Jeffrey A Farrell, David Gennert, Alexander F Schier, and Aviv Regev. 2015. “Spatial reconstruction of single-cell gene expression data.” Nat. Biotechnol. 33 (5). Nature Publishing Group: 495–502. doi:10.1038/nbt.3192.
Scialdone, Antonio, Kedar N Natarajan, Luis R Saraiva, Valentina Proserpio, Sarah A Teichmann, Oliver Stegle, John C Marioni, and Florian Buettner. 2015. “Computational assignment of cell-cycle stage from single-cell transcriptome data.” Methods 85 (September): 54–61. doi:10.1016/j.ymeth.2015.06.021.
Smedley, Damian, Syed Haider, Steffen Durinck, Luca Pandini, Paolo Provero, James Allen, Olivier Arnaiz, et al. 2015. “The BioMart community portal: an innovative alternative to large, centralized data repositories.” Nucleic Acids Res. 43 (W1): W589–98. doi:10.1093/nar/gkv350.
Trapnell, Cole, Davide Cacchiarelli, Jonna Grimsby, Prapti Pokharel, Shuqiang Li, Michael Morse, Niall J Lennon, Kenneth J Livak, Tarjei S Mikkelsen, and John L Rinn. 2014. “The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.” Nat. Biotechnol. 32 (4): 381–86. doi:10.1038/nbt.2859.
Wickham, Hadley. 2010. ggplot2: Elegant Graphics for Data Analysis. Springer New York.
Wilke, Claus O. 2018. Cowplot: Streamlined Plot Theme and Plot Annotations for ’Ggplot2’. https://CRAN.R-project.org/package=cowplot.
Zappia, Luke, and Alicia Oshlack. 2018. “Clustering trees: a visualization for evaluating clusterings at multiple resolutions.” Gigascience 7 (7). doi:10.1093/gigascience/giy083.
devtools::session_info()
setting value
version R version 3.5.0 (2018-04-23)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
tz Australia/Melbourne
date 2018-12-04
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
backports 1.1.2 2017-12-13 CRAN (R 3.5.0)
base * 3.5.0 2018-06-18 local
bindr 0.1.1 2018-03-13 cran (@0.1.1)
bindrcpp 0.2.2 2018-03-29 cran (@0.2.2)
broom 0.5.0 2018-07-17 cran (@0.5.0)
cellranger 1.1.0 2016-07-27 CRAN (R 3.5.0)
cli 1.0.0 2017-11-05 CRAN (R 3.5.0)
colorspace 1.3-2 2016-12-14 cran (@1.3-2)
compiler 3.5.0 2018-06-18 local
crayon 1.3.4 2017-09-16 CRAN (R 3.5.0)
datasets * 3.5.0 2018-06-18 local
devtools 1.13.6 2018-06-27 CRAN (R 3.5.0)
digest 0.6.15 2018-01-28 CRAN (R 3.5.0)
dplyr * 0.7.6 2018-06-29 cran (@0.7.6)
evaluate 0.10.1 2017-06-24 CRAN (R 3.5.0)
forcats * 0.3.0 2018-02-19 CRAN (R 3.5.0)
ggplot2 * 3.0.0 2018-07-03 cran (@3.0.0)
git2r 0.21.0 2018-01-04 CRAN (R 3.5.0)
glue * 1.3.0 2018-07-17 cran (@1.3.0)
graphics * 3.5.0 2018-06-18 local
grDevices * 3.5.0 2018-06-18 local
grid 3.5.0 2018-06-18 local
gtable 0.2.0 2016-02-26 cran (@0.2.0)
haven 1.1.2 2018-06-27 CRAN (R 3.5.0)
here * 0.1 2017-05-28 CRAN (R 3.5.0)
hms 0.4.2 2018-03-10 CRAN (R 3.5.0)
htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0)
httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
jsonlite * 1.5 2017-06-01 CRAN (R 3.5.0)
knitr * 1.20 2018-02-20 CRAN (R 3.5.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.5.0)
lazyeval 0.2.1 2017-10-29 cran (@0.2.1)
lubridate 1.7.4 2018-04-11 cran (@1.7.4)
magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
methods * 3.5.0 2018-06-18 local
modelr 0.1.2 2018-05-11 CRAN (R 3.5.0)
munsell 0.5.0 2018-06-12 cran (@0.5.0)
nlme 3.1-137 2018-04-07 CRAN (R 3.5.0)
pillar 1.3.0 2018-07-14 cran (@1.3.0)
pkgconfig 2.0.1 2017-03-21 cran (@2.0.1)
plyr 1.8.4 2016-06-08 cran (@1.8.4)
purrr * 0.2.5 2018-05-29 cran (@0.2.5)
R.methodsS3 1.7.1 2016-02-16 CRAN (R 3.5.0)
R.oo 1.22.0 2018-04-22 CRAN (R 3.5.0)
R.utils 2.6.0 2017-11-05 CRAN (R 3.5.0)
R6 2.2.2 2017-06-17 CRAN (R 3.5.0)
Rcpp 0.12.18 2018-07-23 cran (@0.12.18)
readr * 1.1.1 2017-05-16 CRAN (R 3.5.0)
readxl 1.1.0 2018-04-20 CRAN (R 3.5.0)
rlang 0.2.1 2018-05-30 CRAN (R 3.5.0)
rmarkdown 1.10.2 2018-07-30 Github (rstudio/rmarkdown@18207b9)
rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0)
rstudioapi 0.7 2017-09-07 CRAN (R 3.5.0)
rvest 0.3.2 2016-06-17 CRAN (R 3.5.0)
scales 0.5.0 2017-08-24 cran (@0.5.0)
stats * 3.5.0 2018-06-18 local
stringi 1.2.4 2018-07-20 cran (@1.2.4)
stringr * 1.3.1 2018-05-10 CRAN (R 3.5.0)
tibble * 1.4.2 2018-01-22 cran (@1.4.2)
tidyr * 0.8.1 2018-05-18 cran (@0.8.1)
tidyselect 0.2.4 2018-02-26 cran (@0.2.4)
tidyverse * 1.2.1 2017-11-14 CRAN (R 3.5.0)
tools 3.5.0 2018-06-18 local
utils * 3.5.0 2018-06-18 local
whisker 0.3-2 2013-04-28 CRAN (R 3.5.0)
withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
workflowr 1.1.1 2018-07-06 CRAN (R 3.5.0)
xml2 1.2.0 2018-01-24 CRAN (R 3.5.0)
yaml 2.2.0 2018-07-25 cran (@2.2.0)
This reproducible R Markdown analysis was created with workflowr 1.1.1