Single Cell RNA-seq Analysis of Human Colon
Abstract
Aim of this analysis is performing a single-cell RNA analysis of a human colon sample studied in “Structural Remodeling of the Human Colonic Mesenchyme in Inflammatory Bowel Disease” (Kinchen J. et al. 2018) to define how the colonic mesenchyme remodels to fuel inflammation and barrier dysfunction in IBD (Inflammatory Bowel Disease).
Data and methods
The dataset is collected in PanglaoDB, a database of single cell RNA sequencing experiments from mouse and human integrated in a unified framework. The data are stored under the accession number (SRA703206).
About the experimental details, the sample library was sequencing using as instrument Illumina HiSeq 4000 and the 10X chromium protocol.
The general workflow of the computational analysis here described is based on Seurat and follows the Seurat vignette.
These are the main libraries used:
# Importing libraries
library(Seurat)
library(ggplot2)
library(dplyr)
Loading data
Firstly, we download the data from PanglaoDB as RData containing the sparse matrix from PanglaoDB. IN the matrix, rows are genes and columns are cells, the counts are not normalized. After loading the data, we perform a “string manipulation” on the row.names in order to keep only the gene symbol and remove the ENSEMBL ID (ENSG_).
<- "/home/mariachiara/Desktop/University/transcript/project/sc/"
wd load(paste0(wd,"SRA703206_SRS3296611.sparse.RData"))
<- sm
sc_data row.names(sc_data) <- sapply(strsplit(row.names(sc_data), "\\_"), "[[", 1)
remove(sm)
Now, we trasform the count matrix into the SeuratObject, that is useful being a container for both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. In creating the SeuratObject we use as thresholds:
- min.cells: the minimum number of cells in which a gene can be detected
- min.features is the minimum number of genes that have to be expressed in a cell.
“Features” is the way to refer to genes. Genes/Cells not satisfying these constrains are discarded a priori.
<- CreateSeuratObject(counts = sc_data, project = "sc_colon", min.cells = 3, min.features = 200)
colon colon
## An object of class Seurat
## 25052 features across 5997 samples within 1 assay
## Active assay: RNA (25052 features, 0 variable features)
Our table we will use in the downstream analysis contains:
- 25,052 genes
- 5,997 cells
QC
As first step, we perform a quality control on cells. QC metrics commonly used include:
- Number of unique genes (features) detected in each cell: cells with few genes may be low-quality cells or empty droplets, while the ones with an aberrantly high gene count may be doublets.
- Number of reads (correlates strongly with unique genes)
- Percentage of mitochondrial reads: cells with a too high percentage of mt reads often are low-quality or dying cells.
Mitochondrial QC
We calculate mitochondrial QC metrics with the PercentageFeatureSet()
function, which calculates the percentage of counts originating from a set of genes (selected according to name). In this case, we use the set of all genes with name starting with MT- to identify mitochondrial genes.
"percent.mt"]] <- PercentageFeatureSet(colon, pattern = "^MT-")
colon[[@meta.data colon
By looking at these violin plots, we can have a visualization of the QC metrics described:
VlnPlot(colon,
features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
ncol = 3,
cols = "#3399FF",
pt.size=0)
We can also observe the scatter of the features, typically used to visualize feature-feature relationships. The intercept in the plots shows the thresholds to carry out the removal of “low-quality cells”.
<- FeatureScatter(colon, feature1 = "nCount_RNA", feature2 = "percent.mt", cols = "#3399FF") +
plot1 geom_hline(yintercept=25, colour="black")
<- FeatureScatter(colon, feature1 = "nCount_RNA", feature2 = "nFeature_RNA",cols = "#3399FF") +
plot2 geom_hline(yintercept=4000, colour="black") +
geom_hline(yintercept=200, colour="black")
+ plot2 plot1
From these scatter plots, we can notice a negative correlation between the number of reads and the percentage of mitochondrial genes, while a strong positive one (as said before in explaining the QC metrics) between the number of reads and the number of unique genes.
Overall, by looking at these plots, we can set as threshold for the parameters:
- A number of unique genes expressed lower than 200 (low quality cells/empty droplets) or higher than 4000 (doublets);
- A percentage of mitochondrial genes higher than 25.
and remove the “low quality cells”.
<- subset(colon, subset = nFeature_RNA > 200 & nFeature_RNA < 4000 & percent.mt < 25)
colon_hq colon_hq
## An object of class Seurat
## 25052 features across 5211 samples within 1 assay
## Active assay: RNA (25052 features, 0 variable features)
After the removal of the “low quality cells”, we obtain 5,211 cells.
Normalizing the data
So, we proceed with the normalization of the data using NormalizeData()
function. By default, Seurat normalizes the gene counts for each cell by the total counts for each cell, multiplies this by a scale factor (10,000 by default), and log-transforms the counts.
<- NormalizeData(colon_hq, normalization.method = "LogNormalize", scale.factor = 10000) colon_hq
Identification of highly variable features
Then, we have to select in our gene set the genes that show the highest cell-to-cell variation (measured by the relationship between variance and mean), the “most variable genes”. This can be done using the function FindVariableFeatures()
. By default, the top 2000 variable genes are kept for all downstream analyses; we can take a different number of variable genes by changing the value of nfeatures parameter.
<- FindVariableFeatures(colon_hq, selection.method = "vst", nfeatures = 2000)
colon_hq <- head(VariableFeatures(colon_hq), 10)
top10 paste("The top 10 most variable genes are:", paste(top10, collapse=", "))
## [1] "The top 10 most variable genes are: IGLC2, IGLC3, PLVAP, IGHA2, ACKR1, CCL19, CA4, RGS5, CLDN5, RGS1"
Here we can visualize in red the 2,000 most variable genes, among which the top 10 ones are labeled with the gene symbol.
<- VariableFeaturePlot(colon_hq)
plot3 <- LabelPoints(plot = plot3, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)
plot3 plot3
Data scaling
By this step, all log-normalized counts are transformed so that they have mean 0 and variance 1 across all cells, regardless of the count values (high or low). This leads to a sort of “ternarization”, according to which for each cell a gene will be tend “up” (>0) “average” (=0) or “down” (<0) with very close values.
<- ScaleData(colon_hq, features=rownames(colon_hq)) colon_hq
In performing the scaling, we can also regress out unwanted sources of variation that can be present. An example of source of variation between cells is the cell cycle effect. To evaluate its impact on our dataset, we have to assign each cell a score, that is based on its expression of G2/M and S phase marker genes; then, we plot the cells in a lower dimensionality space (running a PCA) to assess if cells cluster by the predicted cell cycle.
# Cell phase prediction
<- cc.genes$s.genes
s.genes <- cc.genes$g2m.genes
g2m.genes
<- CellCycleScoring(colon_hq, s.features=s.genes, g2m.features=g2m.genes, set.ident=TRUE)
colon_hq head(colon_hq[[]])
Looking at the plot, it is no revelead a clear separation between cells, so we have not to regress out cell cycle scores during data scaling.
# Linear dimensionality reduction with PCA
<- RunPCA(colon_hq, features=VariableFeatures(object=colon_hq))
colon_hq
# Visualization in a lower dimensionality space
DimPlot(colon_hq, reduction="pca")
Choosing dimensions
The next step is choosing how many principal components (dimensions) keep for the subsequent clustering, that can be done using the “elbow” plot. It is a ranking of principle components based on the percentage of variance explained by each one.
ElbowPlot(colon_hq, ndims = 50) + geom_vline(colour = "red", xintercept = 22)
The optimal number of PC seems to be 22 (red line), as after that number there is no significant variation in the explained variability.
Clustering the cells
Among the different ways to perform the clustering, Seurat uses a KNN graph-based method using the Euclidean distance as distance in PCA space. Ratio: two cells are “close” if they are close in the PCA space and if they share the most fo the closest cells. FindNeighbors()
builds the KNN graph, the dims
parameter can be set with the chosen number of PC. The FindClusters()
function implements a modularity optimization technique (by default the Louvain algorithm) to dividing the cells into clusters. The resolution
parameter can be set to change the granularity for the clusters. Then, to visualizing clusters of cells based on scRNA data the PCA plot is not the best method; instead, the t_SNE or especially UMAP, non linear dimensionality reduction methods, are the preferrable ones. In order to visualize the clusters obtained, we use the function RunUMAP()
, from the package umap, in which we specify “umap” in reduction
parameter.
Firstly, we perform a clustering using as value for dims
and resolution
parameters the same used in the Seurat vignette.
<- FindNeighbors(colon_hq, dims = 1:10)
colon_hq <- FindClusters(colon_hq, resolution = 0.5) colon_hq
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 5211
## Number of edges: 167149
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8933
## Number of communities: 10
## Elapsed time: 0 seconds
<- RunUMAP(colon_hq, dims = 1:10)
colon_hq DimPlot(colon_hq, reduction = "umap", label = TRUE)
table(colon_hq@meta.data$seurat_clusters)
##
## 0 1 2 3 4 5 6 7 8 9
## 1227 1208 713 655 345 293 293 230 218 29
According to the elbow previously observed, we perform also a clustering using a number of PC equal to 22 to get better results.
<- FindNeighbors(colon_hq, dims = 1:22)
colon_hq <- FindClusters(colon_hq, resolution = 0.5) colon_hq
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 5211
## Number of edges: 190039
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8860
## Number of communities: 12
## Elapsed time: 0 seconds
<- RunUMAP(colon_hq, dims = 1:22)
colon_hq DimPlot(colon_hq, reduction = "umap", label = TRUE)
Finding markers
The final step aims to find which are the “marker genes” (expressed exclusively, or at least over-expressed) in each cluster with respect to the others, in order to identify the clusters obtained and assign the cell type.
The function FindAllMarkers()
allows to find the differentially expressed genes for all the clusters with respect to the others. The parameters that can be tuned are:
- min.pct: minimum fraction of cells in which genes must be express; in this case, we take gene expressed in at least 25% of the cells
- logfc.threshold: log2 fold change of at least 0.25
<- FindAllMarkers(colon_hq, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) colon.markers_22
Here, we store in a table the top 10 marker genes found for each cluster and then visualize them with the heatmap in which is reported their expression.
<- colon.markers_22 %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
top10_22 top10_22
DoHeatmap(colon_hq, features = top10_22$gene)
Moreover, for each cluster we can assess the specificity of a marker gene to the cluster with two kind of plots:
- the violin plots show the distribution of gene expression for each cell in each cluster
- the UMAP plot integrated with a heatmap show how much a gene is expressed in each cluster.
VlnPlot(colon_hq, features = c("CCL13", "F3", "AREG", "MZB1", "MMP1",
"PLVAP", "GPM6B", "OGN","RGS5","SOSTDC1","KLRB1"), pt.size = 0, ncol = 2)
FeaturePlot(colon_hq, features = c("CCL13", "F3", "AREG", "MZB1", "MMP1",
"PLVAP", "GPM6B", "OGN","RGS5","SOSTDC1","KLRB1"), pt.size = 0, ncol = 2)
Here, we can investigate the expression of the marker genes in the cluster 0.
VlnPlot(colon_hq, features = c("CCL2","VMP1","NFKBIZ"), pt.size = 0.1, ncol = 2)
FeaturePlot(colon_hq, features = c("CCL2","VMP1","NFKBIZ"), pt.size = 0, ncol = 2)
Assigning cell type identity
The cell type is then assigned to each of the clusters searching in PanglaoDB and in literature to what cell type the expression of the marker genes is associated. The final results of our analysis is represented in the UMAP plot, in which we specify the cell type found as cluster label.
<- c("Unknown", rep("Fibroblasts", 2), "Epithelial cells", "Plasma cells",
new.cluster.ids "Fibroblasts", "Endothelial cells","Glia cells","Fibroblasts", rep("Smooth muscle cells",2), "T cells")
names(new.cluster.ids) <- levels(colon_hq)
<- RenameIdents(colon_hq, new.cluster.ids)
colon_hq DimPlot(colon_hq, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
table(colon_hq@meta.data$seurat_clusters)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 1389 1020 554 473 345 329 292 286 250 161 84 28
Session info
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.0.7 ggplot2_3.3.5 SeuratObject_4.0.4 Seurat_4.1.0
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.15 colorspace_2.0-2 deldir_1.0-6
## [4] ellipsis_0.3.2 ggridges_0.5.3 rstudioapi_0.13
## [7] spatstat.data_2.1-2 farver_2.1.0 leiden_0.3.9
## [10] listenv_0.8.0 ggrepel_0.9.1 RSpectra_0.16-0
## [13] fansi_1.0.2 codetools_0.2-18 splines_4.1.2
## [16] knitr_1.37 polyclip_1.10-0 jsonlite_1.7.3
## [19] ica_1.0-2 cluster_2.1.0 png_0.1-7
## [22] uwot_0.1.11 shiny_1.7.1 sctransform_0.3.3
## [25] spatstat.sparse_2.1-0 compiler_4.1.2 httr_1.4.2
## [28] assertthat_0.2.1 Matrix_1.4-0 fastmap_1.1.0
## [31] lazyeval_0.2.2 limma_3.50.1 cli_3.1.1
## [34] later_1.3.0 htmltools_0.5.2 tools_4.1.2
## [37] igraph_1.2.11 gtable_0.3.0 glue_1.6.1
## [40] RANN_2.6.1 reshape2_1.4.4 Rcpp_1.0.8
## [43] scattermore_0.7 jquerylib_0.1.4 vctrs_0.3.8
## [46] nlme_3.1-152 lmtest_0.9-39 xfun_0.30
## [49] stringr_1.4.0 globals_0.14.0 mime_0.12
## [52] miniUI_0.1.1.1 lifecycle_1.0.1 irlba_2.3.5
## [55] goftest_1.2-3 future_1.23.0 MASS_7.3-53.1
## [58] zoo_1.8-9 scales_1.1.1 spatstat.core_2.3-2
## [61] promises_1.2.0.1 spatstat.utils_2.3-0 parallel_4.1.2
## [64] RColorBrewer_1.1-2 yaml_2.2.2 reticulate_1.24
## [67] pbapply_1.5-0 gridExtra_2.3 sass_0.4.0
## [70] rpart_4.1-15 stringi_1.7.6 highr_0.9
## [73] rlang_1.0.1 pkgconfig_2.0.3 matrixStats_0.61.0
## [76] evaluate_0.14 lattice_0.20-41 ROCR_1.0-11
## [79] purrr_0.3.4 tensor_1.5 labeling_0.4.2
## [82] patchwork_1.1.1 htmlwidgets_1.5.4 cowplot_1.1.1
## [85] tidyselect_1.1.1 parallelly_1.30.0 RcppAnnoy_0.0.19
## [88] plyr_1.8.6 magrittr_2.0.2 bookdown_0.25
## [91] R6_2.5.1 generics_0.1.2 DBI_1.1.2
## [94] withr_2.4.3 mgcv_1.8-33 pillar_1.7.0
## [97] fitdistrplus_1.1-6 survival_3.2-7 abind_1.4-5
## [100] tibble_3.1.6 future.apply_1.8.1 crayon_1.4.2
## [103] KernSmooth_2.23-18 utf8_1.2.2 spatstat.geom_2.3-1
## [106] plotly_4.10.0 rmarkdown_2.13 grid_4.1.2
## [109] data.table_1.14.2 rmdformats_1.0.3 digest_0.6.29
## [112] xtable_1.8-4 tidyr_1.2.0 httpuv_1.6.5
## [115] munsell_0.5.0 viridisLite_0.4.0 bslib_0.3.1