Last updated: 2018-12-04

workflowr checks: (Click a bullet for more information)
  • R Markdown file: up-to-date

    Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

  • Environment: empty

    Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

  • Seed: set.seed(20180730)

    The command set.seed(20180730) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

  • Session information: recorded

    Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

  • Repository version: 5dcfa12

    Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

    Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
    Ignored files:
        Ignored:    .DS_Store
        Ignored:    .Rhistory
        Ignored:    .Rproj.user/
        Ignored:    analysis/cache.bak.20181031/
        Ignored:    analysis/cache.bak/
        Ignored:    analysis/cache.lind2.20181114/
        Ignored:    analysis/cache/
        Ignored:    data/Lindstrom2/
        Ignored:    data/processed.bak.20181031/
        Ignored:    data/processed.bak/
        Ignored:    data/processed.lind2.20181114/
        Ignored:    packrat/lib-R/
        Ignored:    packrat/lib-ext/
        Ignored:    packrat/lib/
        Ignored:    packrat/src/
    Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
Expand here to see past versions:
    File Version Author Date Message
    Rmd 5dcfa12 Luke Zappia 2018-12-04 Fix typo in methods
    Rmd 858cf44 Luke Zappia 2018-12-04 Add hFK podocyte DE
    html 858cf44 Luke Zappia 2018-12-04 Add hFK podocyte DE
    html 292a1c6 Luke Zappia 2018-11-23 Minor fixes to output
    Rmd d79b8e7 Luke Zappia 2018-11-20 Create DE signature and filter results
    html d79b8e7 Luke Zappia 2018-11-20 Create DE signature and filter results
    html a61f9c9 Luke Zappia 2018-09-13 Rebuild site
    Rmd a99f69a Luke Zappia 2018-09-13 Fix link to site
    html ad10b21 Luke Zappia 2018-09-13 Switch to GitHub
    Rmd 57b7fc2 Luke Zappia 2018-09-13 Add workflowr to methods
    Rmd 1d8b5a6 Luke Zappia 2018-09-11 Update methods
    Rmd 7755ac7 Luke Zappia 2018-08-15 Add methods document

# Presentation

# Paths

# Output

# Tidyverse
write_bib(c("base", "SingleCellExperiment", "cowplot", "workflowr"),
          file = here("data/packages.bib"))
docs <- list.dirs(here("output"), full.names = FALSE)[-1]

params <- map(docs, function(doc) {
    out <- tryCatch(
            table <- suppressWarnings(
                               simplifyVector = TRUE)
            p.list <- table$Value %>% setNames(table$Parameter)
        error = function(e) {
            message(glue("{doc} parameters file not found"))


names(params) <- str_remove(docs, "[0-9A-Z]+_")


The Cell Ranger pipeline (v1.3.1) was used to perform sample demultiplexing, barcode processing and single-cell gene counting. Briefly, samples were demultiplexed to produce a pair of FASTQ files for each sample. Reads containing sequence information were aligned to the GRCh38 reference genome provided with Cell Ranger (v1.2.0). Cell barcodes were filtered to remove empty droplets and PCR duplicates were removed by selecting unique combinations of cell barcodes, unique molecular identifiers and gene ids with the final results being a gene expression matrix that was used for further analysis. The three samples in the first batch of organoids were aggregated using Cell Ranger with no normalisation and treated as a single dataset.

Quality control

The R statistical programming language (v3.5.0) (R Core Team 2018) was used for further analysis. Count data for each experiment was read into R and used to construct a SingleCellExperiment object (v1.2.0) (A. Lun and Risso 2017).Gene annotation information was added from BioMart (Smedley et al. 2015) using the biomaRt package (v2.36.1) (Durinck et al. 2005) and cells were assigned cell cycle scores using the cyclone (Scialdone et al. 2015) function in the scran package (v1.8.2) (A. T. L. Lun, McCarthy, and Marioni 2016).

The scater package (v1.8.2) (McCarthy et al. 2017) was used to produce a series of diagnostic quality control plots. Cells with a high number of expressed genes (indicating potential doublets) were removed, as were cells with a high percentage of counts assigned to mitochondrial or ribosomal genes, or with low expression of the housekeeping genes GAPDH and ACTB.

Genes that had less than two total counts across a dataset, or were expressed in less than two individual cells, were removed. We also removed genes without an annotated HGNC symbol.

Following quality control the first organoid dataset consisted of 6649 cells and 18386 genes with a median of 2738 genes expressed per a cell, the fourth organoid had 1288 cells and 16885 genes with a median of 3248 genes expressed per cell and the Lindstrom dataset had 3178 cells and 16166 genes with a median of 1509.5 genes expressed per cell.

Clustering analysis


The two organoid datasets were integrated using the alignment method in the Seurat package (v2.3.1) (Satija et al. 2015; Butler et al. 2018). Briefly, highly variable genes were identified in each dataset and those that were present in both datasets (1156 genes) were selected. Canonical correlation analysis (Hotelling 1936; Hardoon, Szedmak, and Shawe-Taylor 2004) was then performed using the selected genes and 25 dimensions that represent the majority of variation were selected. The final step used dynamic time warping (Berndt and Clifford 1994) to align the datasets in the selected subspace.

To perform clustering Seurat constructs a shared nearest neighbour graph of cells in the aligned subspace and uses Louvain modularity optimisation (Blondel et al. 2008) to assign cells to clusters. The number of clusters produced using this method is controlled by a resolution parameter with higher values giving more clusters. We performed clustering over a range of resolutions from 0 to 1 in steps of 0.1 and used the clustree package (v0.2.2.9000) to produce clustering trees (Zappia and Oshlack 2018) showing the expression of known marker genes to select the appropriate resolution to use. We chose to use a resolution of 0.6 which produced 13 clusters.

Marker genes for each cluster were detected by testing for differential expression between cells in one cluster and all other cells using a Wilcoxon rank sum test (Bauer 1972). To reduce processing time only genes that were expressed in at least 10 percent of cells in one of these groups were tested. To identify conserved marker genes a similar process was performed on each dataset separately and the results combined using the maximum p-value method. We also tested for within cluster differential expression to identify differences between cells of the same type in different datasets.

Based on identified marker genes we determined clusters 2 and 9 represented the nephron lineage. The 1125 cells in these clusters were re-clustered at a resolution of 0.5 resulting in 5 clusters.

We also performed pseudotime trajectory analysis on the nephron cells using Monocle (v2.8.0) (Trapnell et al. 2014; Qiu et al. 2017). The intersection of the top 100 genes with the greatest absolute foldchange for each nephron cluster were selected for this analysis, giving a set of 455 genes used to order the cells.


The combined organoid and human fetal kidney analysis used the procedure described for the organoid only analysis, but with slightly different parameters. We identified 1368 variable genes present in all three datasets and selected the first 20 canonical correlation dimensions. For clustering we chose a resolution of 0.5 which produced 16 clusters. Clusters 6, 7, 10 and 15 were determined to be the nephron lineage and these 1964 cells were re-clustered at a resolution of 0.6 producing 8 clusters.

We also performed differential expression testing between the two datasets as a whole which was used to identify a signature of 374 genes that represent the main differences between them. To identify cell type specific differences between organoid and human fetal kidney we performed differential expression testing between cells within a cluster and removed genes from the overall differential expression signature. Cluster 7 in the combined nephron analysis was identified as a human fetal kidney specific podocyte cluster. To investigate the differences between these cells and other podocytes we compared gene expression in this cluster to the general podocyte cluster (CN0).

Visualisation and presentation

Figures shown here were produced using functions in the Seurat, Monocle and clustree packages. Additional plots and customisations were created using the ggplot2 (v3.0.0) (Wickham 2010) and cowplot (v0.9.3) (Wilke 2018) packages. The analysis project was managed using the workflowr (v1.1.1) (Blischak, Carbonetto, and Stephens 2018) package which was also used to produce the publicly available website displaying the analysis code, results and output.


Both organoid datasets are available from GEO accession number GSE114802 and the Lindstrom fetal kidney dataset is available from GEO accession GSE102596. A website showing reports produced during analysis including the exact software versions and parameters used can be accessed at and the analysis code is available at


Bauer, David F. 1972. “Constructing Confidence Sets Using Rank Statistics.” J. Am. Stat. Assoc. 67 (339). Taylor & Francis: 687–90. doi:10.1080/01621459.1972.10481279.

Berndt, Donald J, and James Clifford. 1994. “Using Dynamic Time Warping to Find Patterns in Time Series.” In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 359–70. AAAIWS’94. Seattle, WA: AAAI Press.

Blischak, John, Peter Carbonetto, and Matthew Stephens. 2018. Workflowr: A Framework for Reproducible and Collaborative Data Science.

Blondel, Vincent D, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. “Fast unfolding of communities in large networks.” J. Stat. Mech. 2008 (10). IOP Publishing: P10008. doi:10.1088/1742-5468/2008/10/P10008.

Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating single-cell transcriptomic data across different conditions, technologies, and species.” Nat. Biotechnol., April. doi:10.1038/nbt.4096.

Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. “BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis.” Bioinformatics 21 (16): 3439–40. doi:10.1093/bioinformatics/bti525.

Hardoon, David R, Sandor Szedmak, and John Shawe-Taylor. 2004. “Canonical correlation analysis: an overview with application to learning methods.” Neural Comput. 16 (12): 2639–64. doi:10.1162/0899766042321814.

Hotelling, Harold. 1936. “RELATIONS BETWEEN TWO SETS OF VARIATES.” Biometrika 28 (3-4). Oxford University Press: 321–77. doi:10.1093/biomet/28.3-4.321.

Lun, Aaron T L, Davis J McCarthy, and John C Marioni. 2016. “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res. 5 (August): 2122. doi:10.12688/f1000research.9501.2.

Lun, Aaron, and Davide Risso. 2017. SingleCellExperiment: S4 Classes for Single Cell Data.

McCarthy, Davis J, Kieran R Campbell, Aaron T L Lun, and Quin F Wills. 2017. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics 33 (8): 1179–86. doi:10.1093/bioinformatics/btw777.

Qiu, Xiaojie, Qi Mao, Ying Tang, Li Wang, Raghav Chawla, Hannah A Pliner, and Cole Trapnell. 2017. “Reversed graph embedding resolves complex single-cell trajectories.” Nat. Methods, August. doi:10.1038/nmeth.4402.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

Satija, Rahul, Jeffrey A Farrell, David Gennert, Alexander F Schier, and Aviv Regev. 2015. “Spatial reconstruction of single-cell gene expression data.” Nat. Biotechnol. 33 (5). Nature Publishing Group: 495–502. doi:10.1038/nbt.3192.

Scialdone, Antonio, Kedar N Natarajan, Luis R Saraiva, Valentina Proserpio, Sarah A Teichmann, Oliver Stegle, John C Marioni, and Florian Buettner. 2015. “Computational assignment of cell-cycle stage from single-cell transcriptome data.” Methods 85 (September): 54–61. doi:10.1016/j.ymeth.2015.06.021.

Smedley, Damian, Syed Haider, Steffen Durinck, Luca Pandini, Paolo Provero, James Allen, Olivier Arnaiz, et al. 2015. “The BioMart community portal: an innovative alternative to large, centralized data repositories.” Nucleic Acids Res. 43 (W1): W589–98. doi:10.1093/nar/gkv350.

Trapnell, Cole, Davide Cacchiarelli, Jonna Grimsby, Prapti Pokharel, Shuqiang Li, Michael Morse, Niall J Lennon, Kenneth J Livak, Tarjei S Mikkelsen, and John L Rinn. 2014. “The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.” Nat. Biotechnol. 32 (4): 381–86. doi:10.1038/nbt.2859.

Wickham, Hadley. 2010. ggplot2: Elegant Graphics for Data Analysis. Springer New York.

Wilke, Claus O. 2018. Cowplot: Streamlined Plot Theme and Plot Annotations for ’Ggplot2’.

Zappia, Luke, and Alicia Oshlack. 2018. “Clustering trees: a visualization for evaluating clusterings at multiple resolutions.” Gigascience 7 (7). doi:10.1093/gigascience/giy083.

Session information

 setting  value                       
 version  R version 3.5.0 (2018-04-23)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       Australia/Melbourne         
 date     2018-12-04                  

 package     * version date       source                            
 assertthat    0.2.0   2017-04-11 CRAN (R 3.5.0)                    
 backports     1.1.2   2017-12-13 CRAN (R 3.5.0)                    
 base        * 3.5.0   2018-06-18 local                             
 bindr         0.1.1   2018-03-13 cran (@0.1.1)                     
 bindrcpp      0.2.2   2018-03-29 cran (@0.2.2)                     
 broom         0.5.0   2018-07-17 cran (@0.5.0)                     
 cellranger    1.1.0   2016-07-27 CRAN (R 3.5.0)                    
 cli           1.0.0   2017-11-05 CRAN (R 3.5.0)                    
 colorspace    1.3-2   2016-12-14 cran (@1.3-2)                     
 compiler      3.5.0   2018-06-18 local                             
 crayon        1.3.4   2017-09-16 CRAN (R 3.5.0)                    
 datasets    * 3.5.0   2018-06-18 local                             
 devtools      1.13.6  2018-06-27 CRAN (R 3.5.0)                    
 digest        0.6.15  2018-01-28 CRAN (R 3.5.0)                    
 dplyr       * 0.7.6   2018-06-29 cran (@0.7.6)                     
 evaluate      0.10.1  2017-06-24 CRAN (R 3.5.0)                    
 forcats     * 0.3.0   2018-02-19 CRAN (R 3.5.0)                    
 ggplot2     * 3.0.0   2018-07-03 cran (@3.0.0)                     
 git2r         0.21.0  2018-01-04 CRAN (R 3.5.0)                    
 glue        * 1.3.0   2018-07-17 cran (@1.3.0)                     
 graphics    * 3.5.0   2018-06-18 local                             
 grDevices   * 3.5.0   2018-06-18 local                             
 grid          3.5.0   2018-06-18 local                             
 gtable        0.2.0   2016-02-26 cran (@0.2.0)                     
 haven         1.1.2   2018-06-27 CRAN (R 3.5.0)                    
 here        * 0.1     2017-05-28 CRAN (R 3.5.0)                    
 hms           0.4.2   2018-03-10 CRAN (R 3.5.0)                    
 htmltools     0.3.6   2017-04-28 CRAN (R 3.5.0)                    
 httr          1.3.1   2017-08-20 CRAN (R 3.5.0)                    
 jsonlite    * 1.5     2017-06-01 CRAN (R 3.5.0)                    
 knitr       * 1.20    2018-02-20 CRAN (R 3.5.0)                    
 lattice       0.20-35 2017-03-25 CRAN (R 3.5.0)                    
 lazyeval      0.2.1   2017-10-29 cran (@0.2.1)                     
 lubridate     1.7.4   2018-04-11 cran (@1.7.4)                     
 magrittr      1.5     2014-11-22 CRAN (R 3.5.0)                    
 memoise       1.1.0   2017-04-21 CRAN (R 3.5.0)                    
 methods     * 3.5.0   2018-06-18 local                             
 modelr        0.1.2   2018-05-11 CRAN (R 3.5.0)                    
 munsell       0.5.0   2018-06-12 cran (@0.5.0)                     
 nlme          3.1-137 2018-04-07 CRAN (R 3.5.0)                    
 pillar        1.3.0   2018-07-14 cran (@1.3.0)                     
 pkgconfig     2.0.1   2017-03-21 cran (@2.0.1)                     
 plyr          1.8.4   2016-06-08 cran (@1.8.4)                     
 purrr       * 0.2.5   2018-05-29 cran (@0.2.5)                     
 R.methodsS3   1.7.1   2016-02-16 CRAN (R 3.5.0)                    
 R.oo          1.22.0  2018-04-22 CRAN (R 3.5.0)                    
 R.utils       2.6.0   2017-11-05 CRAN (R 3.5.0)                    
 R6            2.2.2   2017-06-17 CRAN (R 3.5.0)                    
 Rcpp          0.12.18 2018-07-23 cran (@0.12.18)                   
 readr       * 1.1.1   2017-05-16 CRAN (R 3.5.0)                    
 readxl        1.1.0   2018-04-20 CRAN (R 3.5.0)                    
 rlang         0.2.1   2018-05-30 CRAN (R 3.5.0)                    
 rmarkdown     1.10.2  2018-07-30 Github (rstudio/rmarkdown@18207b9)
 rprojroot     1.3-2   2018-01-03 CRAN (R 3.5.0)                    
 rstudioapi    0.7     2017-09-07 CRAN (R 3.5.0)                    
 rvest         0.3.2   2016-06-17 CRAN (R 3.5.0)                    
 scales        0.5.0   2017-08-24 cran (@0.5.0)                     
 stats       * 3.5.0   2018-06-18 local                             
 stringi       1.2.4   2018-07-20 cran (@1.2.4)                     
 stringr     * 1.3.1   2018-05-10 CRAN (R 3.5.0)                    
 tibble      * 1.4.2   2018-01-22 cran (@1.4.2)                     
 tidyr       * 0.8.1   2018-05-18 cran (@0.8.1)                     
 tidyselect    0.2.4   2018-02-26 cran (@0.2.4)                     
 tidyverse   * 1.2.1   2017-11-14 CRAN (R 3.5.0)                    
 tools         3.5.0   2018-06-18 local                             
 utils       * 3.5.0   2018-06-18 local                             
 whisker       0.3-2   2013-04-28 CRAN (R 3.5.0)                    
 withr         2.1.2   2018-03-15 CRAN (R 3.5.0)                    
 workflowr     1.1.1   2018-07-06 CRAN (R 3.5.0)                    
 xml2          1.2.0   2018-01-24 CRAN (R 3.5.0)                    
 yaml          2.2.0   2018-07-25 cran (@2.2.0)                     

This reproducible R Markdown analysis was created with workflowr 1.1.1