[Hierarchical MIL] scRNA Analysis

논문

Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Data
https://www.biorxiv.org/content/10.1101/2025.02.10.637389v1.full.pdf

2025.02.10.637389v1.full.pdf

1.55MB

논문 정리

2025.03.22 - [AI 및 Data Analysis/Paper] - [Hierarchical MIL] Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Data

[Hierarchical MIL] Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with

논문Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Datahttps://www.biorxiv.org/content/10.1101/2025.02.10.637389v1.full.pdf깃허브https://github.com/minhchaudo/hier-mil GitHub -

doraemin.tistory.com

깃허브

https://github.com/minhchaudo/hier-mil

GitHub - minhchaudo/hier-mil

Contribute to minhchaudo/hier-mil development by creating an account on GitHub.

github.com

데이터 개요

2025.03.22 - [AI 및 Data Analysis/Code] - [Hierarchical MIL] Exploratory Data (Summary)

[Hierarchical MIL] Exploratory Data (Summary)

논문Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Datahttps://www.biorxiv.org/content/10.1101/2025.02.10.637389v1.full.pdf 깃허브https://github.com/minhchaudo/hier-mil GitHub - m

doraemin.tistory.com

코드 분석

[Hierarchical MIL] Code ; Train.py

논문Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Datahttps://www.biorxiv.org/content/10.1101/2025.02.10.637389v1.full.pdf 논문 정리2025.03.22 - [AI 및 Data Analysis/Paper

doraemin.tistory.com

Analysis

✅ STEP 0 : Setup

git clone https://github.com/minhchaudo/hier-mil
cd hier-mil
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

✅ STEP 1: 데이터 전처리

covid.py 스크립트를 실행하여, covid.h5ad 파일 만들기.

hier-mil/data/covid에 데이터 파일 다운받아 넣어주자.

# Download the following files from https://singlecell.broadinstitute.org/single_cell/study/SCP1289/ :
# 1. 20210220_NasalSwab_RawCounts.txt
# 3. 20210220_NasalSwab_NormCounts.txt
# 2. 20210701_NasalSwab_MetaData.txt

Download the following files from https://singlecell.broadinstitute.org/single_cell/study/SCP1289/

Impaired local intrinsic immunity to SARS-CoV-2 infection in severe COVID-19 - Single Cell Portal

ABSTRACT: Infection with SARS-CoV-2, the virus that causes COVID-19, can lead to severe lower respiratory illness including pneumonia and acute respiratory distress syndrome, which can result in profound morbidity and mortality. However, many infected indi

singlecell.broadinstitute.org

* 우클릭 후 '다른 이름으로 링크 저장..'을 누르면 파일이 다운로드 된다.

복붙했다가 데이터 일부만 가져와지는 바람에,, 모두에게 혼란을 줘버렸따,, ㅠㅠ

covid.py 수정사항 (11번째 줄)

# covid.py 수정사항 
# (11번째 줄)
# adata = sc.AnnData(df) # 삭제
adata = sc.AnnData(df.T) # 작성

# (30번째 줄) copy해서 가져가면 더 안정적이다.
# adata = adata[adata.obs["label"] != -1]		# 삭제
adata = adata[adata.obs["label"] != -1].copy() 	# 작성


# (32번째 줄) ; "cell_type__ontology_label"이라는 column은 없다. "Detailed_Cell_Annotations" column에서 가져오자.
# adata.obs.rename({"donor_id":"patient", "cell_type__ontology_label":"cell_type_annotation"}, inplace=True) # 제
adata.obs["cell_type_annotation"] = adata.obs["Detailed_Cell_Annotations"]	# 작성
adata.obs["patient"] = adata.obs["donor_id"]								# 작성

covid.py 실행

cd data/covid
python covid.py

마찬가지로, icb.py 스크립트 실행하여, icb.h5ad 파일 만들기.

데이터 파일 다운받자. zip 파일은 압축 해제하고,
hier-mil/data/icb에 Rshinydata_singlecell 폴더를 넣어주면 된다.

# Download Rshinydata_singlecell-20231219T155916Z-001.zip from https://zenodo.org/records/10407126
and unzip the folder

Download Rshinydata_singlecell-20231219T155916Z-001.zip from https://zenodo.org/records/10407126

Integrated cancer cell-specific single-cell RNA-seq datasets of immune checkpoint blockade-treated patients

Immune checkpoint blockade (ICB) therapies have emerged as a promising avenue for the treatment of various cancers. Despite their success, the efficacy of these treatments is variable across patients and cancer types. Numerous single-cell RNA-sequencing (s

zenodo.org

icb.py 수정사항 (26, 32번째 줄)

# icb.py 수정사항 (26번째 줄)
# ct = pd.read_csv("singler_icb_pre.csv", index_col=0) # 삭제
ct = pd.read_csv("singler_icb.csv", index_col=0)	# 작성

# 32번째 줄 추가
adata.obs = adata.obs.astype(str) # 추가

icb.py 실행

cd ../icb
python icb.py

cardio.py 스크립트 실행하여, cardio.h5ad 파일 만들기.

# Download the following files from https://singlecell.broadinstitute.org/single_cell/study/SCP1303/

# 1. DCM_HCM_Expression_Matrix_raw_counts_V1.mtx
# 2. DCM_HCM_Expression_Matrix_genes_V1.tsv
# 3. DCM_HCM_Expression_Matrix_barcodes_V1.tsv
# 4. DCM_HCM_MetaData_V1.txt

Download the following files from https://singlecell.broadinstitute.org/single_cell/study/SCP1303/

Single-nuclei profiling of human dilated and hypertrophic cardiomyopathy - Single Cell Portal

Single-nuclei profiling of human dilated and hypertrophic cardiomyopathy Mark Chaffin,1 Irinna Papangeli,2 Bridget Simonson,1 Amer-Denis Akkad,2 Matthew C. Hill,1,3 Alessandro Arduini,1 Stephen J. Fleming,1,4 Michelle Melanson,5 Sikander Hayat,2 Maria Kost

singlecell.broadinstitute.org

로컬에 데이터 다운로드 후, 서버로 보내기

2025.03.24 - [dev-setup] - [Server] Send(Copy) local data to Server

cardio.py 수정사항

# cardio.py 수정사항 (20번째 줄)
# adata = sc.AnnData(data) # 삭제
adata = sc.AnnData(data.T.tocsr())  # 작성 # coo → csr 변환 후 AnnData 생성

cardio.py 실행

cd ../cardio
python cardio.py

"Killed" 메시지는 Linux/Unix 환경에서 시스템이 강제로 프로세스를 종료.

원인: 데이터가 너무 커서 메모리 초과 (Out Of Memory, OOM)

코드 대폭 수정...

# Download the following files from https://singlecell.broadinstitute.org/single_cell/study/SCP1303/
# 1. DCM_HCM_Expression_Matrix_raw_counts_V1.mtx
# 2. DCM_HCM_Expression_Matrix_genes_V1.tsv
# 3. DCM_HCM_Expression_Matrix_barcodes_V1.tsv
# 4. DCM_HCM_MetaData_V1.txt


from scipy.io import mmread
import scanpy as sc
import pandas as pd

data = mmread("DCM_HCM_Expression_Matrix_raw_counts_V1.mtx")
data = data.tocsr()

genes = pd.read_csv("DCM_HCM_Expression_Matrix_genes_V1.tsv", sep="\t", header=None).iloc[:,1].tolist()
genes = genes[:100]
data = data[:100, :]  # 유전자 개수 자르기
data = data.astype("float32")

barcodes = open( "DCM_HCM_Expression_Matrix_barcodes_V1.tsv").read().strip().split("\n")

meta = pd.read_csv("DCM_HCM_MetaData_V1.txt", sep="\t").drop(axis=0,index=0).reset_index(drop=True)

# adata = sc.AnnData(data)
# adata = sc.AnnData(data.T.tocsr())  # coo → csr 변환 후 AnnData 생성

# ✅ 'NAME' 컬럼을 인덱스로 설정하고, barcodes 순서로 정렬
meta.set_index("NAME", inplace=True)
meta = meta.loc[[bc for bc in barcodes if bc in meta.index]]  # ✅ barcodes 기준 필터링
barcodes = list(meta.index)  # ✅ 순서를 meta에 맞춤
data = data[:, [i for i, b in enumerate(barcodes)]]  # ✅ 열 순서 맞춤

# ✅ AnnData 생성
adata = sc.AnnData(data.T, obs=meta)


adata.obs.index = barcodes
adata.var.index = genes

sc.pp.filter_genes(adata, min_cells=5)

sc.pp.normalize_total(adata, target_sum=1e4)

sc.pp.log1p(adata)

# meta.set_index("NAME", inplace=True)
# adata.obs = meta.loc[adata.obs.index, :]

# adata.obs["label"] = adata.obs["disease__ontology_label"].apply(lambda x: 0 if x=="normal" else 1 if x=="hypertrophic cardiomyopathy" else 2)
label_map = {
    "normal": 0,
    "hypertrophic cardiomyopathy": 1,
    "dilated cardiomyopathy": 2
}
adata.obs["label"] = adata.obs["disease__ontology_label"].map(label_map).astype(int)

adata = adata[adata.obs["label"].notna()]

adata.obs.rename({"donor_id":"patient", "cell_type__ontology_label":"cell_type_annotation"}, inplace=True)

# Extract the cell embeddings using the scGPT model (scgpt.tasks.embed_data) with the pretrained weights from the whole-human checkpoint. See https://github.com/bowang-lab/scGPT for instructions. 
# Store the embeddings in a new adata object and include the metadata of the old adata. Write the new adata object to the file cardio.h5ad .
#
# ✅ 저장
adata.write_h5ad("../cardio.h5ad")

✅ STEP 2 : 실험 실행 (run.py 또는 run.sh)

이제 .h5ad 파일들이 준비되었으니 실험 실행.

옵션 1) run.sh 실행 (모든 실험 자동 실행)

bash run.sh

모든 데이터셋 (covid, cardio, icb) 에 대해
다양한 실험(task 2~6)을 자동 실행

옵션 2) run.py만 단독 실행

예: covid.h5ad를 대상으로 repeated_k_fold 실험(task 2)을 하려면:

python run.py --data_path covid.h5ad --task 2

(기타 옵션은 run.sh에 있는 것들을 참고해서 추가)

기계학습 실험에서 단일 성능 측정만으론 충분하지 않다.
모델이 진짜 잘 작동하는지 보려면 다양한 조건에서 테스트해봐야 한다.

그래서 이 논문에서는 다음과 같은 6가지 task로 실험을 구성되어 있다. :

Task별 의미 정리 (run.py 기준)

Task 번호	실험 이름	목적
0	train_and_tune	모델 학습 + 하이퍼파라미터 튜닝
1	predict_and_save	학습된 모델로 예측값 저장
2	repeated_k_fold	10번 반복된 k-fold 교차검증으로 모델 성능 평균화
3	vary_train_size	학습 데이터 크기를 줄여가며 성능 비교 (0.25, 0.5, 0.75)
4	vary_cell_count	셀 수를 줄였을 때 성능 변화 분석
5	randomize_cell_annot	셀 타입 정보를 랜덤으로 섞어서 모델 의존도 확인
6	get_p_val_cell_type	permutation test로 중요한 세포 타입 찾기 (biological insight용)

run.sh 파일을 보면 위 task 중 2~6번만 사용 :

📌 1) repeated_k_fold (task 2)

데이터를 여러 번 섞고, k-fold로 나눠서 성능의 평균과 표준편차를 계산
신뢰도 높은 성능 측정용

📌 2) vary_train_size (task 3)

데이터가 적을수록 모델이 얼마나 영향을 받는지를 확인
소량의 데이터로도 학습이 가능한지를 평가함

📌 3) vary_cell_count (task 4)

셀 수가 줄어들 때 성능이 어떻게 되는지
단일 샘플 내 정보량의 영향 확인

📌 4) randomize_cell_annot (task 5)

셀 타입 정보(cell_type_annotation)를 무작위로 섞었을 때 성능 저하가 얼마나 큰지
즉, 모델이 셀 타입 정보를 얼마나 잘 활용하고 있는지 확인하는 실험

📌 5) identify key cell types (task 6)

permutation test 기반으로, 어떤 셀 타입이 label (예: COVID vs Normal)에 가장 큰 영향을 주는지를 분석

📊 왜 이렇게 다양한 실험을 하나?

성능 비교 + 모델 해석 + 실사용 가능성까지 평가하려는 목적
단순히 "AUC 높음 = 좋은 모델"이 아님!
- 데이터가 줄었을 때도 잘 작동하나?
- 모델이 셀 타입 정보를 진짜로 이해하고 있나?
- 어떤 셀 타입이 가장 질병 분류에 중요할까?

'AI 및 Data Analysis > Code' 카테고리의 다른 글

[Hierarchical MIL] Preprocessing Create '.h5ad' (0)	2025.03.28
[Hierarchical MIL] Code ; Train.py (0)	2025.03.27
[Hierarchical MIL] Exploratory Data (Summary) (0)	2025.03.22
[ScRAT] Exploratory Data (Summary) (0)	2025.03.22
[ScRAT] STEP 2. Attention layer (0)	2025.03.20

도라에몽 개발

[Hierarchical MIL] scRNA Analysis

논문

논문 정리

깃허브

데이터 개요

코드 분석

Analysis

✅ STEP 0 : Setup

✅ STEP 1: 데이터 전처리

covid.py 스크립트를 실행하여, covid.h5ad 파일 만들기.

covid.py 수정사항 (11번째 줄)

covid.py 실행

마찬가지로, icb.py 스크립트 실행하여, icb.h5ad 파일 만들기.

icb.py 수정사항 (26, 32번째 줄)

icb.py 실행

cardio.py 스크립트 실행하여, cardio.h5ad 파일 만들기.

cardio.py 수정사항

cardio.py 실행

✅ STEP 2 : 실험 실행 (run.py 또는 run.sh)

옵션 1) run.sh 실행 (모든 실험 자동 실행)

옵션 2) run.py만 단독 실행

Task별 의미 정리 (run.py 기준)

run.sh 파일을 보면 위 task 중 2~6번만 사용 :

📌 1) repeated_k_fold (task 2)

📌 2) vary_train_size (task 3)

📌 3) vary_cell_count (task 4)

📌 4) randomize_cell_annot (task 5)

📌 5) identify key cell types (task 6)

📊 왜 이렇게 다양한 실험을 하나?

'AI 및 Data Analysis > Code' 카테고리의 다른 글

티스토리툴바

[Hierarchical MIL] scRNA Analysis

논문

논문 정리

깃허브

데이터 개요

코드 분석

Analysis

✅ STEP 0 : Setup

✅ STEP 1: 데이터 전처리

covid.py 스크립트를 실행하여, covid.h5ad 파일 만들기.

covid.py 수정사항 (11번째 줄)

covid.py 실행

마찬가지로, icb.py 스크립트 실행하여, icb.h5ad 파일 만들기.

icb.py 수정사항 (26, 32번째 줄)

icb.py 실행

cardio.py 스크립트 실행하여, cardio.h5ad 파일 만들기.

cardio.py 수정사항

cardio.py 실행

✅ STEP 2 : 실험 실행 (run.py 또는 run.sh)

옵션 1) run.sh 실행 (모든 실험 자동 실행)

옵션 2) run.py만 단독 실행

Task별 의미 정리 (run.py 기준)

run.sh 파일을 보면 위 task 중 2~6번만 사용 :

📌 1) repeated_k_fold (task 2)

📌 2) vary_train_size (task 3)

📌 3) vary_cell_count (task 4)

📌 4) randomize_cell_annot (task 5)

📌 5) identify key cell types (task 6)

📊 왜 이렇게 다양한 실험을 하나?

'AI 및 Data Analysis > Code' 카테고리의 다른 글

관련글

티스토리툴바