본문 바로가기

논문 및 데이터 분석

Transcriptome analysis based on machine learning reveals a role for autoinflammatory genes of chronic nonbacterial osteomyelitis (CNO)

Materials and methods

Dataset download and preprocessing

Only two eligible RNA-seq datasets (GSE133378 and GSE187429) in GEO.

 

https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133378/

 

Index of /geo/series/GSE133nnn/GSE133378

 

ftp.ncbi.nlm.nih.gov

NCBI GEO (Gene Expression Omnibus) 데이터베이스는 유전자 발현 및 기타 기능 유전체 데이터의 공개 저장소입니다. 해당 링크에서 4개의 디렉토리는 각기 다른 형식과 목적으로 데이터를 제공합니다. 각 디렉토리의 차이점은 다음과 같습니다:

1. **matrix/**:
   - **목적**: 데이터 행렬을 포함합니다.
   - **내용**: 주로 데이터 행렬 파일 (expression matrix)을 포함합니다. 이는 일반적으로 행이 샘플, 열이 유전자 또는 프로브인 표 형식의 데이터입니다.

2. **miniml/**:
   - **목적**: MINiML 형식의 파일을 포함합니다.
   - **내용**: MINiML (MIAME Notation in Markup Language) 형식으로 표현된 메타데이터와 실험 데이터 파일을 포함합니다. MINiML은 XML 기반의 포맷으로, 실험 조건, 샘플 정보, 데이터 처리 방법 등을 포함합니다.

3. **soft/**:
   - **목적**: SOFT 형식의 파일을 포함합니다.
   - **내용**: SOFT (Simple Omnibus Format in Text) 형식으로 표현된 데이터 파일을 포함합니다. SOFT 파일은 GEO 데이터의 텍스트 기반 포맷으로, 데이터와 함께 메타데이터를 포함합니다. 이 포맷은 사람 읽기와 컴퓨터 처리가 용이합니다.

4. **suppl/**:
   - **목적**: 보조 파일을 포함합니다.
   - **내용**: 분석에 필요한 추가 데이터 파일들을 포함합니다. 여기에는 원본 이미지 파일, 추가적인 raw data, 처리된 데이터 파일 등 다양한 보조 자료들이 포함될 수 있습니다.

이 디렉토리들은 연구자가 원하는 데이터 형식과 분석 목적에 따라 선택적으로 다운로드하고 사용할 수 있게 도와줍니다. 각 디렉토리의 파일 형식은 연구자가 데이터를 어떻게 분석하고자 하는지에 따라 선택하게 됩니다.

 

Weighted gene coexpression network analysis

First, the gene expression matrix is loaded in the R software to check for missing values and identify outliers. Second, the block module function and module division analysis were performed to identify gene co-expression modules.

→ Results

The soft threshold (R^2 = 0.85) for construction of the scale-free network was set at six (Fig. 2A).

the genes within each module can represent the overall gene expression level of each module (Fig. 2B).

Dendrogram : 군집들의 계층 구조를 도식화한 그림

The module with the highest correlation with CNO was blue, r = 0.46, P = 1e − 09 (Fig. 2C).

이 상관 계수는 CNO와 파란 모듈(blue module) 사이의 전체적인 상관 관계를 나타냅니다. 즉, 파란 모듈에 속한 유전자들의 발현 패턴과 CNO의 연관성을 평가한 결과입니다.

The correlation between genes in the blue module and CNO genes was cor = 0.58, p = 5.2e − 68; a total of 229 genes (Supplementary Table 1) most associated with CNO were screened from this module based on Gene significance (GS) = 0.2 and Module Membership (MM) = 0.5 (Fig. 2D).

이 상관 계수는 파란 모듈 내에서 특정 유전자들과 CNO 관련 유전자들 간의 상관 관계를 나타냅니다. 즉, 파란 모듈 내 유전자들 각각의 발현 패턴과 CNO와의 관련성을 평가한 결과입니다.

Identification of DEGs and CNO-related genes

(DEGs : Differentially expressed genes)

First, the expression matrix data and grouping data were obtained (by using R language based on the expression matrices of GSE133378 and series matrix files.)

Subsequently, DEGs between CNO and normal tissues were identified with the “DESeq2” package19.

Finally, those genes that were found to belong to both the DEGs and the set of CNO-related module genes were considered genes related to CNO.

→ Results

In total, 368 DEGs, consisting of 224 upregulated and 144 downregulated genes, were identified by the “DESeq2” package

https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

The volcano plot of all DEGs and the heatmap of the top 25 upregulated genes and the top 25 downregulated genes were visualized (Fig. 3A,B).

Volcano plot은 Log-scaled fold-change를 X축으로, Log-scaled P-value를 Y축으로 갖는 그래프. Fold change란, 어떤 유전자에 대하여 실험군에서의 평균발현량이 대조군에서의 평균발현량의 몇 배인지를 나타내고, P-value는 두 군의 평균발현량 차이가 통계적으로 유의미한 값인지를 알려준다.

Finally, eighty genes belonging jointly to these two groups of genes were considered to be associated with CNO (Fig. 3C).

WGCNA : 229genes = ‘80’ + 149. DEGs : 368genes = ‘80’ + 288.

In addition, as shown in Fig. 3D, the results of the principal component analysis (PCA) based on the “ComplexHeatmap” package indicated that eighty genes could effectively distinguish between normal samples and CNO.

GO and KEGG pathway enrichment analysis of the CNO-related genes

using the enrichGO and enrichKEGG functions of the “clusterProfiler” package, respectively.

GO analysis consists of three main sets of terms, biological processes (BP), molecular functions (MF), and cellular components (CC).

GO 는 비슷한 기능의 유전자들을 묶어 놓은 그룹이라고 생각하면 이해하기 쉽다.

→ Results

Enrichment of the three GO categories (BP, CC, MF) within the DEGs (Fig. 4A,B,C) indicated that they were mainly associated with regulation of actin filament organization, cell‒cell junction organization, gamma-catenin binding, and actin binding (Supplementary Table 3).

As shown in Fig. 4D, the enriched pathways were mainly involved in adherens junction (hsa04520), viral carcinogenesis (hsa05203), systemic lupus erythematosus (hsa05322), viral myocarditis (hsa05416), and phagosome (hsa04145) (Supplementary Table 3)23.

PPI network construction and module analysis

The protein‒protein interaction (PPI) network of the proteins encoded by CNO-related genes was constructed with Search Tool for the Retrieval of Interacting Genes (STRING)

Cytoscape, an application for visualizing molecular interaction networks, was used to construct the PPI network and identify the hub CNO genes

단백질 상호작용(PPI) 네트워크는 허브(hub)라 할 수 있는 상호작용 수가 많은 소수의 단백질과 상호작용수가 적은 다수의 단백질들로 구성된다. 최근 들어 여러 연구들에서 허브 단백질이 비 허브(non-hub) 단백질보다 상호작용 네트워크에 필수적인 단백질일 가능성이 높다고 보고되고 있다.

→ Results

After the isolated nodes were removed by Cytoscape 3.7.1,

the PPI network of CNO-related genes was generated (Fig. 5A).

A subnetwork of the PPI network containing 40 hub CNO genes was identified by the bottleneck algorithm with the cytoHubba plugin (Fig. 5B).

Then, the key module containing 8 genes was identified via the MCODE plugin (Fig. 5C).

Identifying and validating autoinflammatory genes

According to the R package “glmnet,”

the least absolute shrinkage and selection operator (LASSO) machine learning model was applied to the hub CNO genes obtained from a subnetwork of the PPI network based on CytoHubba (Cytoscape plugin) to screen for key genes.

The autoinflammatory genes were intersected with the key genes,

and the intersecting genes were recognized as autoinflammatory genes associated with CNO.

→ Results

We identified twenty key genes by applying the validated machine learning algorithms (LASSO) to the forty hub CNO genes (Fig. 6A1,A2).

(Fig. 6B). The AUC value of the LASSO model was 0.64, and we considered it the optimal CNO prediction model.

According to the set of autoinflammatory genes in the GeneCards database, there were two genes that were associated with autoinflammation among the twenty key genes related to CNO.

The AUC values of UTS2 and MPO were 0.61 and 0.60, respectively, which were both greater than or equal to 0.60, and these genes were therefore identified as CNO-related hub genes (Fig. 6D).

In addition, Fig. 6E illustrates that there was a positive correlation between the two genes.

Autoinflammatory genes of analysis

Small clusters were not allowed, and the cluster number k was set from 2 to 8. The cumulative distribution function (CDF) and area under the CDF curve were used to confirm the optimal cluster number.

Then, we performed PCA to verify the clustering results.

Ultimately, we used NetworkAnalyst22 (NetworkAnalyst) to construct a gene regulation network consisting of the lncRNA‒miRNA–mRNA interactions and a protein–chemical interaction network.

→ Results

we elected to separate our data into 2 groups.

We concluded that grouping by CNO-related autoinflammatory gene expression was appropriate (k = 2).

The starBase database (https://starbase.sysu.edu.cn/) was employed with both low-stringency (≥ 1) and high-stringency (≥ 3) criteria to predict lncRNAs based on hsa-mir-147a belonging to two genes in common, as well as to construct a ceRNA network (Fig. 8A).

We used the online tool NetworkAnalyst to predict compounds targeting these two genes and to construct an mRNA–chemical network (Fig. 8B).

Discussion

In this study, GSE133378 gene expression profiles were downloaded and used for WGCNA and differential expression analysis. Ultimately, eighty genes closely related to CNO were identified according to the specific threshold set.

We identified two autoinflammatory genes (UTS2 and MPO) associated with CNO by combining an autoinflammatory response-related gene set with the LASSO machine learning algorithm. Both genes were highly expressed in CNO patients. The AUC values under the ROC curves for UTS2 and MPO were 0.61 and 0.60, respectively, indicating that both genes are closely associated with autoinflammation.

Conclusion

We identified two hub genes (UTS2 and MPO) that are closely associated with auto-inflammatory in CNO and can differentiate CNO patients from controls, and are thus potential auto-inflammatory-related biomarkers for disease diagnosis and therapeutic monitoring.

'논문 및 데이터 분석' 카테고리의 다른 글

[Seurat] [MAST] DEGs 분석  (0) 2024.09.19
[Seurat] Single cell 분석  (0) 2024.09.19
[Seurat] 설치 ( + R 버전 에러)  (0) 2024.09.18
read alignment STAR  (0) 2024.09.05
SmartSeq2(Smart Sequencing Technology 2)  (1) 2024.09.05