scMILD 논문에 소개된 4가지 데이터셋 다운로드
1. Lupus
관련 논문 [7]. Igor Mandric, Tommer Schwarz, Arunabha Majumdar, Kangcheng Hou, Leah Briscoe, Richard Perez, Meena Subramaniam, Christoph Hafemeister, Rahul Satija, Chun Jimmie Ye, et al. Optimized design of single-cell rna sequencing experiments for cell-type-specific eqtl analysis. Nature communications, 11(1):5504, 2020.
2. COVID-19 Infection
관련 논문 [25]. Carly GK Ziegler, Vincent N Miao, Anna H Owings, Andrew W Navia, Ying Tang, Joshua D Bromley, Peter Lotfy, Meredith Sloan, Hannah Laird, Haley B Williams, et al. Impaired local intrinsic immunity to sars-cov-2 infection in severe covid-19. Cell, 184(18):4713–4733, 2021.
https://www.cell.com/cell/fulltext/S0092-8674(21)00882-5
Data and Code Availability
- Single-cell RNA-seq data is publicly available for download and visualization via the Single Cell Portal: https://singlecell.broadinstitute.org/single_cell/study/SCP1289/. This paper also analyzes existing, publicly available data. Accession numbers and links are listed in the key resources table. Interim data was also deposited in a single-cell data resource for COVID-19 studies: https://www.covid19cellatlas.org (Ballestar et al., 2020). Custom reference FASTA and GTF for SARS-CoV-2 is available for download: https://github.com/ShalekLab/SARSCoV2-genome-reference. Additional Supplemental Items are available from Mendeley Data at https://doi.org/10.17632/pjr7b8sbf8.1.
- All original code has been deposited at GitHub (https://github.com/ShalekLab) and is publicly available as of the date of publication.
- Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request
- https://singlecell.broadinstitute.org/single_cell/study/SCP1289
- (genes × cells) : (32871, 32588)
- [Hierarchical MIL] scRNA Analysis
- https://www.covid19cellatlas.org/index.patient.html
- Human SARS-CoV-2 challenge resolves local and systemic response dynamics 데이터라고 판단했다.
AnnData object with n_obs × n_vars = 371892 × 33696
obs: 'patient_id', 'time_point', 'covid_status', 'sex', 'cell_state_wVDJ', 'cell_state', 'cell_state_woIFN', 'cell_type', 'cell_compartment', 'sequencing_library', 'Institute', 'ObjectCreateDate'
var: 'name'
obsm: 'X_umap_harmony_rna_wvdj_30pcs_6000hvgs'
print(adata)
"""
AnnData object with n_obs × n_vars = 371892 × 33696
obs: 'patient_id', 'time_point', 'covid_status', 'sex', 'cell_state_wVDJ', 'cell_state', 'cell_state_woIFN', 'cell_type', 'cell_compartment', 'sequencing_library', 'Institute', 'ObjectCreateDate'
var: 'name'
obsm: 'X_umap_harmony_rna_wvdj_30pcs_6000hvgs'
"""
print("==== column : 33696개 유전자 * 정보 ====")
print(adata.var.shape) # (33696, 1)
print(adata.var.head()) # 각 유전자별 정보
"""
name
MIR1302-2HG MIR1302-2HG
FAM138A FAM138A
OR4F5 OR4F5
AL627309.1 AL627309.1
AL627309.3 AL627309.3
"""
print("==== row : 371892개 세포 * 정보 12가지 ====")
print(adata.obs.shape) # (371892, 12) → 각 세포의 메타데이터
print(adata.obs.columns) # 메타데이터 컬럼
"""
Index(['patient_id', 'time_point', 'covid_status', 'sex', 'cell_state_wVDJ',
'cell_state', 'cell_state_woIFN', 'cell_type', 'cell_compartment',
'sequencing_library', 'Institute', 'ObjectCreateDate'],
dtype='object')
"""
print(adata.obs["patient_id"].nunique()) # 16명
print(adata.obs["cell_type"].nunique()) # 27개
print(adata.obs["covid_status"].nunique()) # 3개
label_counts = adata.obs["covid_status"].value_counts()
print(label_counts)
"""
covid_status
Abortive infection 155563
Sustained infection 135919
Transient infection 80410
Name: count, dtype: int64
"""
3. COVID-19 Hosp.
관련 논문 [17]. Yapeng Su, Daniel Chen, Dan Yuan, Christopher Lausted, Jongchan Choi, Chengzhen L Dai, Valentin Voillet, Venkata R Duvvuri, Kelsey Scherler, Pamela Troisch, et al. Multi-omics resolves a sharp disease-state shift between mild and moderate covid-19. Cell, 183(6):1479–1495, 2020.
https://www.cell.com/cell/fulltext/S0092-8674(20)31444-6
Data and Code Availability
All blood scRNA-seq data used in this study can be accessed by Array Express under the accession number: E-MTAB-9357.
Additional Supplemental Items, including the metabolomic and proteomic datasets, are available from Mendeley Data at http://dx.doi. org/10.17632/tzydswhhb5.5.
https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-9357?query=E-MTAB-9357
BioStudies < The European Bioinformatics Institute < EMBL-EBI
BioStudies – one package for all the data supporting a study
www.ebi.ac.uk
- Processed Data 1340 file
- tcr = T Cell Receptor Sequences 540 file
- bcr = B Cell Receptor Sequences 260 file
- pro = Surface Protein Expression (CITE-seq or TotalSeq) 270 file
- 표면 단백질 발현량을 측정한 데이터 (항체 태깅 기반)
- 보통 ADT (Antibody-Derived Tag) 값으로 저장됨
- 사용 목적: RNA 발현 + 단백질 발현을 결합한 다중오믹스 분석
- gex = Gene Expression 270 file
- scRNA-seq 기반 유전자 발현 데이터
- 세포 × 유전자 형식의 expression matrix
A1BG A1BG-AS1 A2M A2M-AS1 A2ML1 ...
AAACCTGAGAATTGTG-1-1:1_1:1-1 0.0 0.0 0.0 0.0 0.0 ...
AAACCTGAGCAGCGTA-1-1:1_1:1-1 0.0 0.0 0.0 0.0 0.0 ...
AAACCTGAGCTACCGC-1-1:1_1:1-1 0.0 0.0 0.0 0.0 0.0 ...
...
4. UC (Ulcerative Colitis; 궤양성 대장염)
관련 논문 [16]. Christopher S Smillie, Moshe Biton, Jose OrdovasMontanes, Keri M Sullivan, Grace Burgin, Daniel B Graham, Rebecca H Herbst, Noga Rogel, Michal Slyper, Julia Waldman, et al. Intra-and inter-cellular rewiring of the human colon during ulcerative colitis. Cell, 178(3):714– 730, 2019.
https://www.cell.com/cell/fulltext/S0092-8674(19)30732-9
Data and Code Availability
The accession number for the processed data reported in this paper is Single Cell Portal: SCP259. Raw data will be available for download from the controlled-access data repository, Broad DUOS. Code used in this study will be available at https://www.github.com/cssmillie/ulcerative_colitis.
Intra- and inter-cellular rewiring of the human colon during ulcerative colitis - Single Cell Portal
Intra- and Inter-cellular Rewiring of the Human Colon during Ulcerative Colitis Christopher S. Smillie, Moshe Biton, Jose Ordovas-Montanes, Keri M. Sullivan, Grace Burgin, Daniel B. Graham, Rebecca H. Herbst, Noga Rogel, Michal Slyper, Julia Waldman, Malik
singlecell.broadinstitute.org
대장 점막에서 추출한 366,650개의 세포로부터 단일세포 전사체 데이터를 생성했습니다.
이 데이터는 세 가지 주요 세포 계열(fraction)로 나뉘어 있습니다
Metadata
- all.meta2.txt
Epithelial data – 상피세포 계열
- Epi.genes.tsv (151 KB)
- Epi.barcodes2.tsv (3.0 MB)
- gene_sorted-Epi.matrix.mtx (2.2 GB)
Stromal data – 간질세포 계열 (결합조직 포함)
- Fib.genes.tsv (141 KB)
- Fib.barcodes2.tsv (778 KB)
- gene_sorted-Fib.matrix.mtx (492 MB)
Immune data – 면역세포 계열
- Imm.genes.tsv (156 KB)
- Imm.barcodes2.tsv (5.11 MB)
- gene_sorted-Imm.matrix.mtx (2.27 GB)
<<epi>>
AnnData object with n_obs × n_vars = 123006 × 20028
obs: 'Cluster', 'nGene', 'nUMI', 'Subject', 'Health', 'Location', 'Sample'
<<fib>>
AnnData object with n_obs × n_vars = 31872 × 19076
obs: 'Cluster', 'nGene', 'nUMI', 'Subject', 'Health', 'Location', 'Sample'
<<imm>>
AnnData object with n_obs × n_vars = 210614 × 20529
obs: 'Cluster', 'nGene', 'nUMI', 'Subject', 'Health', 'Location', 'Sample'

# <<epi>> # <<fib>> # <<imm>>
print(adata.obs["Cluster"].nunique()) # 15 # 13 # 23
print(adata.obs["nGene"].nunique()) # 6108 # 3368 # 3355
print(adata.obs["nUMI"].nunique()) # 25407 # 8365 # 18064
print(adata.obs["Subject"].nunique()) # 30 # 30 # 30
print(adata.obs["Location"].nunique()) # 2 # 2 # 2
print(adata.obs["Sample"].nunique()) # 131 # 77 # 132
print(adata.obs["Health"].nunique()) # 3가지 # 3가지 # 3가지
print(adata.obs["Health"].unique())
# ['Non-inflamed', 'Inflamed', 'Healthy']
# Categories (3, object): ['Healthy', 'Inflamed', 'Non-inflamed']
'AI 및 Data Analysis' 카테고리의 다른 글
[Hierarchical MIL] compare AI Model (0) | 2025.04.07 |
---|---|
[StratifiedKFold] Key Concepts and Descriptions (0) | 2025.03.27 |
[Optuna] Key Concepts and Descriptions (0) | 2025.03.27 |
[Alzheimer] 녹음 파일 불러와서 이미지로 변환하기 (0) | 2025.03.23 |
[ScRAT] STEP 1. Sample mixup (0) | 2025.03.20 |