본문 바로가기
AI 및 Data Analysis/Code

GSE86982 Analysis Summary

by doraemin_dev 2025. 4. 30.

GSE86982 smartseq TPM

Single-cell RNA-seq data analysis reveals functionally relevant biomarkers of early brain development and their regulatory footprints in human embryonic stem cells (hESCs)

https://academic.oup.com/bib/article/25/3/bbae230/7670713

DATA AVAILABILITY

All data analyzed in this study were published previously [8] and available in NCBI Gene Expression Omnibus (accession no: GSE86982) at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE86982.

GSE86982 smartseq TPM

23045 obs. of 2365 variables

METHODS AND MATERIALS_DEGs identification

  • The expression matrix was imported and converted into a Seurat object using the function ‘CreateSeuratObject’ from the Seurat package (Seurat v3.0).
# 23045 obs. of 2365 variables
tpm_data <- read.csv("C:\\Users\\LG\\Documents\\MJU\\연구실\\24.09.20_scRNA 데이터 분석\\GSE86982_smartseq.tpm.csv", row.names = 1, header = TRUE)

# "26" 또는 "54"가 포함된 열 이름을 찾기
d26_d54_cells  <- grepl("^X26D|^X54D", colnames(tpm_data))
# 23045 obs. of 1561 variables (26D:695, 54D:866)
tpm_d26_d54 <- tpm_data[, d26_d54_cells] 

# Seurat 객체 생성
library(Seurat)
seurat_object <- CreateSeuratObject(counts = as.matrix(tpm_d26_d54), project = "TPM_D26_D54", min.cells = 3, min.features = 200)
  • Then, quality control, normalization, feature selection, and clustering of marker genes of cells from scRNA-seq data were performed. We filtered out the cells with unique feature counts exceeding 10 000 or falling below 200.

Figure S1. Violin plots show quality control (QC) metrics before (A, B) and after (C, D) QC for the data at D26 and D54, respectively. We filtered out cells with unique feature counts over 10,000 or less than 200 and cells with mitochondrial counts greater than 5%.

< Before QC >

< After QC >

# 미토콘드리아 유전자 비율 계산 (MT로 시작하는 유전자 이름 패턴 사용)
seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object, pattern = "^MT")

seurat_object$stage <- ifelse(grepl("26D", colnames(tpm_d26_d54)), "D26", 
                              ifelse(grepl("54", colnames(tpm_d26_d54)), "D54", NA))
VlnPlot(seurat_object, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), group.by = "stage", ncol = 3)

# QC 기준 적용
seurat_object_qc <- subset(seurat_object, subset = nFeature_RNA > 200 & nFeature_RNA < 10000 & percent.mt < 5)

# QC 적용 후 바이올린 플롯 (y축 범위 설정)
VlnPlot(seurat_object_qc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), group.by = "stage", ncol = 3) 
  • We also removed cells with mitochondrial and ribosomal counts greater than 20%.
    The ‘LogNormalize’ global-scaling normalization method was used to standardise individual cells’ feature expression measurements. Then, selection.method = ‘vst’ was used to find feature variables. Each gene's expression level was normalized by considering the total number of unique molecular identifiers (UMI) within each cell, followed by applying a natural logarithm to the UMI counts.
seurat_object[["percent.rb"]] <- PercentageFeatureSet(seurat_object, pattern = "^RPS|^RPL")
seurat_object <- subset(seurat_object, subset = percent.mt < 20 & percent.rb < 20)

seurat_object <- NormalizeData(seurat_object, normalization.method = "LogNormalize", scale.factor = 10000)
seurat_object <- FindVariableFeatures(seurat_object, selection.method = "vst", nfeatures = 2000)
seurat_object <- ScaleData(seurat_object)
  • Finally, we used three methods, MAST [24], limma [25], and DESeq2 [26], to remove the technical limitation of a single method to identify DEGs between D26 and D54.
  • The thresholds for the final DEG set were FDR-corrected P-value <0.05 and absolute log2 fold change value >1.5.
Idents(seurat_object) <- seurat_object$stage

# MAST를 사용한 D26 vs D54 차등 발현 유전자(DEGs) 분석
library(MAST)
degs_mast <- FindMarkers(seurat_object, ident.1 = "D26", ident.2 = "D54", test.use = "MAST")

degs_mast_filtered <- degs_mast[degs_mast$p_val_adj < 0.05 & abs(degs_mast$avg_log2FC) > 1.5, ]

nrow(degs_mast_filtered) # [1] 849
# limma 분석을 위한 raw counts 데이터 추출
counts_data <- GetAssayData(seurat_object, layer = "counts")

# Create a design matrix based on your cell groups (D26 vs D54)
group <- seurat_object$stage
design <- model.matrix(~ group)

# Apply voom transformation and limma analysis
library(limma)
voom_data <- voom(counts_data, design)
fit <- lmFit(voom_data, design)
fit <- eBayes(fit)
degs_limma <- topTable(fit, number = Inf)

degs_limma_filtered <- degs_limma[degs_limma$adj.P.Val < 0.05 & abs(degs_limma$logFC) > 1.5, ]
nrow(degs_limma_filtered)  # [1] 449
# Deseq2
library(DESeq2)
# Convert the Seurat object to a DESeq2 dataset
# Seurat 객체에서 count 데이터 추출 (유전자 x 세포)
# 데이터를 정수형으로 변환
counts_data <- round(as.matrix(GetAssayData(seurat_object, layer = "counts")))

# 세포(stage) 정보 가져오기
meta_data <- data.frame(stage = factor(seurat_object$stage))

# DESeq2 객체 생성 및 분석
dds <- DESeqDataSetFromMatrix(countData = counts_data, colData = meta_data, design = ~ stage)
dds <- DESeq(dds)

# D26과 D54 간의 차등 발현 유전자 추출
degs_deseq2 <- results(dds, contrast = c("stage", "D26", "D54"))

# NA 값을 제거하고 FDR < 0.05 및 abs(log2 fold change) > 1.5로 필터링
degs_deseq2_filtered <- degs_deseq2[!is.na(degs_deseq2$padj) & degs_deseq2$padj < 0.05 & 
                                      !is.na(degs_deseq2$log2FoldChange) & abs(degs_deseq2$log2FoldChange) > 1.5, ]

nrow(degs_deseq2_filtered) # [1] 1299

RESULTS_ Identification of DEGs between D26 and D54

  • DEG analyses using MAST, limma and DESeq2 identified 824, 2753 and 1535 DEGs, respectively. We used the common 539 DEGs identified by all three methods for further analysis in our study.
  • Among the 539 DEGs, 348 were down-regulated and 191 were up-regulated, as depicted in the volcano plot.
nrow(degs_mast_filtered) # [1] 849
nrow(degs_limma_filtered)  # [1] 449
nrow(degs_deseq2_filtered) # [1] 1299

# MAST, limma, DESeq2에서 추출한 DEG의 rownames (유전자 이름)을 각각 가져옵니다.
degs_mast_genes <- rownames(degs_mast_filtered)  # MAST에서 필터링된 DEG
degs_limma_genes <- rownames(degs_limma_filtered)  # limma에서 필터링된 DEG
degs_deseq2_genes <- rownames(degs_deseq2_filtered)  # DESeq2에서 필터링된 DEG

# 세 가지 방법에서 공통된 DEG 찾기
common_degs <- Reduce(intersect, list(degs_mast_genes, degs_limma_genes, degs_deseq2_genes))
length(common_degs) # [1] 184

# 공통된 DEGs에 대한 정보를 가져오기
common_degs_data <- degs_deseq2_filtered[common_degs, ]

# Volcano plot 그리기
library(ggplot2)
ggplot(common_degs_data, aes(x = log2FoldChange, y = -log10(padj))) +
  geom_point(aes(color = log2FoldChange > 0), size = 1.5) +
  scale_color_manual(values = c("blue", "red")) + 
  theme_minimal() +
  labs(title = "Volcano Plot of Common DEGs",
       x = "Log2 (Fold Change)",
       y = "-log10 (Adjusted P-Value)") +
  theme(plot.title = element_text(hjust = 0.5))

up_regulated <- sum(common_degs_data$log2FoldChange > 0)
down_regulated <- sum(common_degs_data$log2FoldChange < 0)

cat("Up-regulated genes:", up_regulated, "\n") # Up-regulated genes: 120 
cat("Down-regulated genes:", down_regulated, "\n") # Down-regulated genes: 64 
  • Hierarchical clustering was conducted to visualize the similarities and differences in gene expression patterns between the two-time points (Figure 3A).

# 필요한 패키지 설치 및 로드
if (!requireNamespace("pheatmap", quietly = TRUE)) {
  install.packages("pheatmap")
}
library(pheatmap)

# 공통된 DEGs에 대한 발현 데이터 추출 (D26과 D54 세포만)
expression_data <- counts_data[common_degs, ]  # 공통 DEGs의 발현 데이터

# Seurat 객체에서 D26과 D54 세포의 메타데이터(stage 정보)만 추출
stage_data <- seurat_object$stage

# D26과 D54 시점의 세포들만 필터링
d26_d54_cells <- which(stage_data %in% c("D26", "D54"))
expression_data <- expression_data[, d26_d54_cells]

# 발현 데이터 로그 변환
expression_data_log <- log2(expression_data + 1)

# stage 정보를 다시 저장
cell_stage <- stage_data[d26_d54_cells]

# 히트맵 그리기 (로그 변환된 데이터 사용)
pheatmap(expression_data_log, 
         annotation_col = data.frame(Stage = cell_stage),  # D26과 D54 시점 표시
         clustering_method = "ward.D2", 
         color = colorRampPalette(c("blue", "white", "red"))(100),  # 색상 팔레트 설정
         show_rownames = FALSE, 
         show_colnames = FALSE, 
         main = "Hierarchical Clustering of Common DEGs (Log Transformed)")
  • To further validate the expression changes of DEGs, we randomly selected several DEG (Figure 3B). These findings suggest that these genes have distinct roles at different time points during neuron development.

  • Additionally, we observed that the distribution of up- and down-regulated DEGs at D26 and D54 spanned across all 22 chromosomes of the human genome, as illustrated in the Circos plot (Figure 3C). Our findings imply extensive transcriptional changes across diverse genomic regions.

RESULTS_ Revealing the timely evolution of gene expression profiling through single-cell analysis at D26 and D54

Figure 4) UMAP plot of the selected DEGs. (A-B) UMAP visualization of some selected genes at D26 and D54, respectively. These crucial genes act as master regulators, guiding neurogenesis, neuronal differentiation, and tissue development in the complex orchestration of brain formation.

  • Heatmap results showed a clear difference in gene expression levels between the two-time points using the different neuron development-related DEGs (Figure S2)

자세한 코드

[Seurat] Single cell Analysis
[Seurat] [MAST] DEGs 분석
[Seurat] [limma] DEGs Analysis
[Seurat] [DESeq2] DEGs Analysis
[MAST], [limma], [DESeq2] Common DEGs Analysis