Single Cell RNA-seq Analysis of Human Colon

Abstract

Aim of this analysis is performing a single-cell RNA analysis of a human colon sample studied in “Structural Remodeling of the Human Colonic Mesenchyme in Inflammatory Bowel Disease” (Kinchen J. et al. 2018) to define how the colonic mesenchyme remodels to fuel inflammation and barrier dysfunction in IBD (Inflammatory Bowel Disease).

Data and methods

The dataset is collected in PanglaoDB, a database of single cell RNA sequencing experiments from mouse and human integrated in a unified framework. The data are stored under the accession number (SRA703206).

About the experimental details, the sample library was sequencing using as instrument Illumina HiSeq 4000 and the 10X chromium protocol.

The general workflow of the computational analysis here described is based on Seurat and follows the Seurat vignette.

These are the main libraries used:

# Importing libraries
library(Seurat)
library(ggplot2)
library(dplyr)

Loading data

Firstly, we download the data from PanglaoDB as RData containing the sparse matrix from PanglaoDB. IN the matrix, rows are genes and columns are cells, the counts are not normalized. After loading the data, we perform a “string manipulation” on the row.names in order to keep only the gene symbol and remove the ENSEMBL ID (ENSG_).

wd <- "/home/mariachiara/Desktop/University/transcript/project/sc/"
load(paste0(wd,"SRA703206_SRS3296611.sparse.RData"))
sc_data <- sm
row.names(sc_data) <- sapply(strsplit(row.names(sc_data), "\\_"), "[[", 1)
remove(sm)

Now, we trasform the count matrix into the SeuratObject, that is useful being a container for both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. In creating the SeuratObject we use as thresholds:

  • min.cells: the minimum number of cells in which a gene can be detected
  • min.features is the minimum number of genes that have to be expressed in a cell.

“Features” is the way to refer to genes. Genes/Cells not satisfying these constrains are discarded a priori.

colon <- CreateSeuratObject(counts = sc_data, project = "sc_colon", min.cells = 3, min.features = 200)
colon
## An object of class Seurat 
## 25052 features across 5997 samples within 1 assay 
## Active assay: RNA (25052 features, 0 variable features)

Our table we will use in the downstream analysis contains:

  • 25,052 genes
  • 5,997 cells

QC

As first step, we perform a quality control on cells. QC metrics commonly used include:

  • Number of unique genes (features) detected in each cell: cells with few genes may be low-quality cells or empty droplets, while the ones with an aberrantly high gene count may be doublets.
  • Number of reads (correlates strongly with unique genes)
  • Percentage of mitochondrial reads: cells with a too high percentage of mt reads often are low-quality or dying cells.

Mitochondrial QC

We calculate mitochondrial QC metrics with the PercentageFeatureSet() function, which calculates the percentage of counts originating from a set of genes (selected according to name). In this case, we use the set of all genes with name starting with MT- to identify mitochondrial genes.

colon[["percent.mt"]] <- PercentageFeatureSet(colon, pattern = "^MT-")
colon@meta.data

By looking at these violin plots, we can have a visualization of the QC metrics described:

VlnPlot(colon, 
        features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), 
        ncol = 3, 
        cols = "#3399FF",
        pt.size=0)

We can also observe the scatter of the features, typically used to visualize feature-feature relationships. The intercept in the plots shows the thresholds to carry out the removal of “low-quality cells”.

plot1 <- FeatureScatter(colon, feature1 = "nCount_RNA", feature2 = "percent.mt", cols = "#3399FF") +
  geom_hline(yintercept=25, colour="black")
plot2 <- FeatureScatter(colon, feature1 = "nCount_RNA", feature2 = "nFeature_RNA",cols = "#3399FF") +
  geom_hline(yintercept=4000, colour="black") +
  geom_hline(yintercept=200, colour="black")
plot1 + plot2

From these scatter plots, we can notice a negative correlation between the number of reads and the percentage of mitochondrial genes, while a strong positive one (as said before in explaining the QC metrics) between the number of reads and the number of unique genes.

Overall, by looking at these plots, we can set as threshold for the parameters:

  • A number of unique genes expressed lower than 200 (low quality cells/empty droplets) or higher than 4000 (doublets);
  • A percentage of mitochondrial genes higher than 25.

and remove the “low quality cells”.

colon_hq <- subset(colon, subset = nFeature_RNA > 200 & nFeature_RNA < 4000 & percent.mt < 25)
colon_hq
## An object of class Seurat 
## 25052 features across 5211 samples within 1 assay 
## Active assay: RNA (25052 features, 0 variable features)

After the removal of the “low quality cells”, we obtain 5,211 cells.

Normalizing the data

So, we proceed with the normalization of the data using NormalizeData() function. By default, Seurat normalizes the gene counts for each cell by the total counts for each cell, multiplies this by a scale factor (10,000 by default), and log-transforms the counts.

colon_hq <- NormalizeData(colon_hq, normalization.method = "LogNormalize", scale.factor = 10000)

Identification of highly variable features

Then, we have to select in our gene set the genes that show the highest cell-to-cell variation (measured by the relationship between variance and mean), the “most variable genes”. This can be done using the function FindVariableFeatures(). By default, the top 2000 variable genes are kept for all downstream analyses; we can take a different number of variable genes by changing the value of nfeatures parameter.

colon_hq <- FindVariableFeatures(colon_hq, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(colon_hq), 10)
paste("The top 10 most variable genes are:", paste(top10, collapse=", "))
## [1] "The top 10 most variable genes are: IGLC2, IGLC3, PLVAP, IGHA2, ACKR1, CCL19, CA4, RGS5, CLDN5, RGS1"

Here we can visualize in red the 2,000 most variable genes, among which the top 10 ones are labeled with the gene symbol.

plot3 <- VariableFeaturePlot(colon_hq)
plot3 <- LabelPoints(plot = plot3, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)
plot3

Data scaling

By this step, all log-normalized counts are transformed so that they have mean 0 and variance 1 across all cells, regardless of the count values (high or low). This leads to a sort of “ternarization”, according to which for each cell a gene will be tend “up” (>0) “average” (=0) or “down” (<0) with very close values.

colon_hq <- ScaleData(colon_hq, features=rownames(colon_hq))

In performing the scaling, we can also regress out unwanted sources of variation that can be present. An example of source of variation between cells is the cell cycle effect. To evaluate its impact on our dataset, we have to assign each cell a score, that is based on its expression of G2/M and S phase marker genes; then, we plot the cells in a lower dimensionality space (running a PCA) to assess if cells cluster by the predicted cell cycle.

# Cell phase prediction
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes

colon_hq <- CellCycleScoring(colon_hq, s.features=s.genes, g2m.features=g2m.genes, set.ident=TRUE)
head(colon_hq[[]])

Looking at the plot, it is no revelead a clear separation between cells, so we have not to regress out cell cycle scores during data scaling.

# Linear dimensionality reduction with PCA
colon_hq <- RunPCA(colon_hq, features=VariableFeatures(object=colon_hq))

# Visualization in a lower dimensionality space
DimPlot(colon_hq, reduction="pca")

Choosing dimensions

The next step is choosing how many principal components (dimensions) keep for the subsequent clustering, that can be done using the “elbow” plot. It is a ranking of principle components based on the percentage of variance explained by each one.

ElbowPlot(colon_hq, ndims = 50) + geom_vline(colour = "red", xintercept = 22)

The optimal number of PC seems to be 22 (red line), as after that number there is no significant variation in the explained variability.

Clustering the cells

Among the different ways to perform the clustering, Seurat uses a KNN graph-based method using the Euclidean distance as distance in PCA space. Ratio: two cells are “close” if they are close in the PCA space and if they share the most fo the closest cells. FindNeighbors() builds the KNN graph, the dims parameter can be set with the chosen number of PC. The FindClusters() function implements a modularity optimization technique (by default the Louvain algorithm) to dividing the cells into clusters. The resolution parameter can be set to change the granularity for the clusters. Then, to visualizing clusters of cells based on scRNA data the PCA plot is not the best method; instead, the t_SNE or especially UMAP, non linear dimensionality reduction methods, are the preferrable ones. In order to visualize the clusters obtained, we use the function RunUMAP(), from the package umap, in which we specify “umap” in reduction parameter.

Firstly, we perform a clustering using as value for dims and resolution parameters the same used in the Seurat vignette.

colon_hq <- FindNeighbors(colon_hq, dims = 1:10)
colon_hq <- FindClusters(colon_hq, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 5211
## Number of edges: 167149
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8933
## Number of communities: 10
## Elapsed time: 0 seconds
colon_hq <- RunUMAP(colon_hq, dims = 1:10)
DimPlot(colon_hq, reduction = "umap", label = TRUE)

table(colon_hq@meta.data$seurat_clusters)
## 
##    0    1    2    3    4    5    6    7    8    9 
## 1227 1208  713  655  345  293  293  230  218   29

According to the elbow previously observed, we perform also a clustering using a number of PC equal to 22 to get better results.

colon_hq <- FindNeighbors(colon_hq, dims = 1:22)
colon_hq <- FindClusters(colon_hq, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 5211
## Number of edges: 190039
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8860
## Number of communities: 12
## Elapsed time: 0 seconds
colon_hq <- RunUMAP(colon_hq, dims = 1:22)
DimPlot(colon_hq, reduction = "umap", label = TRUE)

Finding markers

The final step aims to find which are the “marker genes” (expressed exclusively, or at least over-expressed) in each cluster with respect to the others, in order to identify the clusters obtained and assign the cell type.

The function FindAllMarkers() allows to find the differentially expressed genes for all the clusters with respect to the others. The parameters that can be tuned are:

  • min.pct: minimum fraction of cells in which genes must be express; in this case, we take gene expressed in at least 25% of the cells
  • logfc.threshold: log2 fold change of at least 0.25
colon.markers_22 <- FindAllMarkers(colon_hq, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

Here, we store in a table the top 10 marker genes found for each cluster and then visualize them with the heatmap in which is reported their expression.

top10_22 <- colon.markers_22 %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
top10_22
DoHeatmap(colon_hq, features = top10_22$gene)

Moreover, for each cluster we can assess the specificity of a marker gene to the cluster with two kind of plots:

  • the violin plots show the distribution of gene expression for each cell in each cluster
  • the UMAP plot integrated with a heatmap show how much a gene is expressed in each cluster.
VlnPlot(colon_hq, features = c("CCL13", "F3", "AREG", "MZB1", "MMP1",
    "PLVAP", "GPM6B", "OGN","RGS5","SOSTDC1","KLRB1"), pt.size = 0, ncol = 2)

FeaturePlot(colon_hq, features = c("CCL13", "F3", "AREG", "MZB1", "MMP1",
    "PLVAP", "GPM6B", "OGN","RGS5","SOSTDC1","KLRB1"), pt.size = 0, ncol = 2)

Here, we can investigate the expression of the marker genes in the cluster 0.

VlnPlot(colon_hq, features = c("CCL2","VMP1","NFKBIZ"), pt.size = 0.1, ncol = 2)

FeaturePlot(colon_hq, features =  c("CCL2","VMP1","NFKBIZ"), pt.size = 0, ncol = 2)

Assigning cell type identity

The cell type is then assigned to each of the clusters searching in PanglaoDB and in literature to what cell type the expression of the marker genes is associated. The final results of our analysis is represented in the UMAP plot, in which we specify the cell type found as cluster label.

new.cluster.ids <- c("Unknown", rep("Fibroblasts", 2), "Epithelial cells", "Plasma cells",
    "Fibroblasts", "Endothelial cells","Glia cells","Fibroblasts", rep("Smooth muscle cells",2), "T cells")

names(new.cluster.ids) <- levels(colon_hq)
colon_hq <- RenameIdents(colon_hq, new.cluster.ids)
DimPlot(colon_hq, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

table(colon_hq@meta.data$seurat_clusters)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11 
## 1389 1020  554  473  345  329  292  286  250  161   84   28

Session info

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.0.7        ggplot2_3.3.5      SeuratObject_4.0.4 Seurat_4.1.0      
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.15            colorspace_2.0-2      deldir_1.0-6         
##   [4] ellipsis_0.3.2        ggridges_0.5.3        rstudioapi_0.13      
##   [7] spatstat.data_2.1-2   farver_2.1.0          leiden_0.3.9         
##  [10] listenv_0.8.0         ggrepel_0.9.1         RSpectra_0.16-0      
##  [13] fansi_1.0.2           codetools_0.2-18      splines_4.1.2        
##  [16] knitr_1.37            polyclip_1.10-0       jsonlite_1.7.3       
##  [19] ica_1.0-2             cluster_2.1.0         png_0.1-7            
##  [22] uwot_0.1.11           shiny_1.7.1           sctransform_0.3.3    
##  [25] spatstat.sparse_2.1-0 compiler_4.1.2        httr_1.4.2           
##  [28] assertthat_0.2.1      Matrix_1.4-0          fastmap_1.1.0        
##  [31] lazyeval_0.2.2        limma_3.50.1          cli_3.1.1            
##  [34] later_1.3.0           htmltools_0.5.2       tools_4.1.2          
##  [37] igraph_1.2.11         gtable_0.3.0          glue_1.6.1           
##  [40] RANN_2.6.1            reshape2_1.4.4        Rcpp_1.0.8           
##  [43] scattermore_0.7       jquerylib_0.1.4       vctrs_0.3.8          
##  [46] nlme_3.1-152          lmtest_0.9-39         xfun_0.30            
##  [49] stringr_1.4.0         globals_0.14.0        mime_0.12            
##  [52] miniUI_0.1.1.1        lifecycle_1.0.1       irlba_2.3.5          
##  [55] goftest_1.2-3         future_1.23.0         MASS_7.3-53.1        
##  [58] zoo_1.8-9             scales_1.1.1          spatstat.core_2.3-2  
##  [61] promises_1.2.0.1      spatstat.utils_2.3-0  parallel_4.1.2       
##  [64] RColorBrewer_1.1-2    yaml_2.2.2            reticulate_1.24      
##  [67] pbapply_1.5-0         gridExtra_2.3         sass_0.4.0           
##  [70] rpart_4.1-15          stringi_1.7.6         highr_0.9            
##  [73] rlang_1.0.1           pkgconfig_2.0.3       matrixStats_0.61.0   
##  [76] evaluate_0.14         lattice_0.20-41       ROCR_1.0-11          
##  [79] purrr_0.3.4           tensor_1.5            labeling_0.4.2       
##  [82] patchwork_1.1.1       htmlwidgets_1.5.4     cowplot_1.1.1        
##  [85] tidyselect_1.1.1      parallelly_1.30.0     RcppAnnoy_0.0.19     
##  [88] plyr_1.8.6            magrittr_2.0.2        bookdown_0.25        
##  [91] R6_2.5.1              generics_0.1.2        DBI_1.1.2            
##  [94] withr_2.4.3           mgcv_1.8-33           pillar_1.7.0         
##  [97] fitdistrplus_1.1-6    survival_3.2-7        abind_1.4-5          
## [100] tibble_3.1.6          future.apply_1.8.1    crayon_1.4.2         
## [103] KernSmooth_2.23-18    utf8_1.2.2            spatstat.geom_2.3-1  
## [106] plotly_4.10.0         rmarkdown_2.13        grid_4.1.2           
## [109] data.table_1.14.2     rmdformats_1.0.3      digest_0.6.29        
## [112] xtable_1.8-4          tidyr_1.2.0           httpuv_1.6.5         
## [115] munsell_0.5.0         viridisLite_0.4.0     bslib_0.3.1