본문 바로가기
AI 및 Data Analysis

[scMILD] Datasets Download

by doraemin_dev 2025. 4. 3.

scMILD 논문에 소개된 4가지 데이터셋 다운로드

 

scMILD_Single-cell_Multiple_Instance_Learning_for_.pdf
2.40MB

 

 

1. Lupus

관련 논문 [7]. Igor Mandric, Tommer Schwarz, Arunabha Majumdar, Kangcheng Hou, Leah Briscoe, Richard Perez, Meena Subramaniam, Christoph Hafemeister, Rahul Satija, Chun Jimmie Ye, et al. Optimized design of single-cell rna sequencing experiments for cell-type-specific eqtl analysis. Nature communications, 11(1):5504, 2020.


2. COVID-19 Infection

관련 논문 [25].  Carly GK Ziegler, Vincent N Miao, Anna H Owings, Andrew W Navia, Ying Tang, Joshua D Bromley, Peter Lotfy, Meredith Sloan, Hannah Laird, Haley B Williams, et al. Impaired local intrinsic immunity to sars-cov-2 infection in severe covid-19. Cell, 184(18):4713–4733, 2021.

https://www.cell.com/cell/fulltext/S0092-8674(21)00882-5

 

Data and Code Availability 
- Single-cell RNA-seq data is publicly available for download and visualization via the Single Cell Portal: https://singlecell.broadinstitute.org/single_cell/study/SCP1289/. This paper also analyzes existing, publicly available data. Accession numbers and links are listed in the key resources table. Interim data was also deposited in a single-cell data resource for COVID-19 studies: https://www.covid19cellatlas.org (Ballestar et al., 2020). Custom reference FASTA and GTF for SARS-CoV-2 is available for download: https://github.com/ShalekLab/SARSCoV2-genome-reference. Additional Supplemental Items are available from Mendeley Data at https://doi.org/10.17632/pjr7b8sbf8.1.
- All original code has been deposited at GitHub (https://github.com/ShalekLab) and is publicly available as of the date of publication.
- Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request

 

AnnData object with n_obs × n_vars = 371892 × 33696
    obs: 'patient_id', 'time_point', 'covid_status', 'sex', 'cell_state_wVDJ', 'cell_state', 'cell_state_woIFN', 'cell_type', 'cell_compartment', 'sequencing_library', 'Institute', 'ObjectCreateDate'
    var: 'name'
    obsm: 'X_umap_harmony_rna_wvdj_30pcs_6000hvgs'
더보기
print(adata)                
"""
AnnData object with n_obs × n_vars = 371892 × 33696
    obs: 'patient_id', 'time_point', 'covid_status', 'sex', 'cell_state_wVDJ', 'cell_state', 'cell_state_woIFN', 'cell_type', 'cell_compartment', 'sequencing_library', 'Institute', 'ObjectCreateDate'
    var: 'name'
    obsm: 'X_umap_harmony_rna_wvdj_30pcs_6000hvgs'
"""
    
    
print("==== column : 33696개 유전자 * 정보 ====")
print(adata.var.shape)      # (33696, 1)
print(adata.var.head())     # 각 유전자별 정보
"""
                    name
MIR1302-2HG  MIR1302-2HG
FAM138A          FAM138A
OR4F5              OR4F5
AL627309.1    AL627309.1
AL627309.3    AL627309.3
"""
    
print("==== row : 371892개 세포 * 정보 12가지 ====")
print(adata.obs.shape)          # (371892, 12)  → 각 세포의 메타데이터
print(adata.obs.columns)        # 메타데이터 컬럼
"""
Index(['patient_id', 'time_point', 'covid_status', 'sex', 'cell_state_wVDJ',
       'cell_state', 'cell_state_woIFN', 'cell_type', 'cell_compartment',
       'sequencing_library', 'Institute', 'ObjectCreateDate'],
      dtype='object')
"""

print(adata.obs["patient_id"].nunique())                # 16명
print(adata.obs["cell_type"].nunique())                 # 27개

print(adata.obs["covid_status"].nunique())              # 3개
label_counts = adata.obs["covid_status"].value_counts()
print(label_counts)
"""
covid_status
Abortive infection     155563
Sustained infection    135919
Transient infection     80410
Name: count, dtype: int64
"""

3. COVID-19 Hosp.

관련 논문 [17]. Yapeng Su, Daniel Chen, Dan Yuan, Christopher Lausted, Jongchan Choi, Chengzhen L Dai, Valentin Voillet, Venkata R Duvvuri, Kelsey Scherler, Pamela Troisch, et al. Multi-omics resolves a sharp disease-state shift between mild and moderate covid-19. Cell, 183(6):1479–1495, 2020.

https://www.cell.com/cell/fulltext/S0092-8674(20)31444-6

Data and Code Availability
All blood scRNA-seq data used in this study can be accessed by Array Express under the accession number: E-MTAB-9357.
Additional Supplemental Items, including the metabolomic and proteomic datasets, are available from Mendeley Data at http://dx.doi. org/10.17632/tzydswhhb5.5.

 

https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-9357?query=E-MTAB-9357

 

BioStudies < The European Bioinformatics Institute < EMBL-EBI

BioStudies – one package for all the data supporting a study

www.ebi.ac.uk

  • Processed Data 1340 file
    • tcr = T Cell Receptor Sequences 540 file
    • bcr = B Cell Receptor Sequences  260 file
    • pro = Surface Protein Expression (CITE-seq or TotalSeq)  270 file
      • 표면 단백질 발현량을 측정한 데이터 (항체 태깅 기반)
      • 보통 ADT (Antibody-Derived Tag) 값으로 저장됨
      • 사용 목적: RNA 발현 + 단백질 발현을 결합한 다중오믹스 분석
    • gex = Gene Expression 270 file
      • scRNA-seq 기반 유전자 발현 데이터
      • 세포 × 유전자 형식의 expression matrix
				A1BG	A1BG-AS1	A2M	A2M-AS1	A2ML1 ...
AAACCTGAGAATTGTG-1-1:1_1:1-1	0.0	0.0	0.0	0.0	0.0 ...
AAACCTGAGCAGCGTA-1-1:1_1:1-1	0.0	0.0	0.0	0.0	0.0 ...
AAACCTGAGCTACCGC-1-1:1_1:1-1	0.0	0.0	0.0	0.0	0.0 ...
...

 


4. UC (Ulcerative Colitis; 궤양성 대장염)

관련 논문 [16]. Christopher S Smillie, Moshe Biton, Jose OrdovasMontanes, Keri M Sullivan, Grace Burgin, Daniel B Graham, Rebecca H Herbst, Noga Rogel, Michal Slyper, Julia Waldman, et al. Intra-and inter-cellular rewiring of the human colon during ulcerative colitis. Cell, 178(3):714– 730, 2019.

https://www.cell.com/cell/fulltext/S0092-8674(19)30732-9

 

Data and Code Availability
The accession number for the processed data reported in this paper is Single Cell Portal: SCP259. Raw data will be available for download from the controlled-access data repository, Broad DUOS. Code used in this study will be available at https://www.github.com/cssmillie/ulcerative_colitis.
https://singlecell.broadinstitute.org/single_cell/study/SCP259
 

Intra- and inter-cellular rewiring of the human colon during ulcerative colitis - Single Cell Portal

Intra- and Inter-cellular Rewiring of the Human Colon during Ulcerative Colitis Christopher S. Smillie, Moshe Biton, Jose Ordovas-Montanes, Keri M. Sullivan, Grace Burgin, Daniel B. Graham, Rebecca H. Herbst, Noga Rogel, Michal Slyper, Julia Waldman, Malik

singlecell.broadinstitute.org

대장 점막에서 추출한 366,650개의 세포로부터 단일세포 전사체 데이터를 생성했습니다.

이 데이터는 세 가지 주요 세포 계열(fraction)로 나뉘어 있습니다

 

Metadata

  • all.meta2.txt

Epithelial data – 상피세포 계열

  • Epi.genes.tsv (151 KB)
  • Epi.barcodes2.tsv (3.0 MB)
  • gene_sorted-Epi.matrix.mtx (2.2 GB)

Stromal data – 간질세포 계열 (결합조직 포함)

  • Fib.genes.tsv (141 KB)
  • Fib.barcodes2.tsv (778 KB)
  • gene_sorted-Fib.matrix.mtx (492 MB)

Immune data – 면역세포 계열

  • Imm.genes.tsv (156 KB)
  • Imm.barcodes2.tsv (5.11 MB)
  • gene_sorted-Imm.matrix.mtx (2.27 GB)
<<epi>>
AnnData object with n_obs × n_vars = 123006 × 20028
    obs: 'Cluster', 'nGene', 'nUMI', 'Subject', 'Health', 'Location', 'Sample'

<<fib>>
AnnData object with n_obs × n_vars = 31872 × 19076
    obs: 'Cluster', 'nGene', 'nUMI', 'Subject', 'Health', 'Location', 'Sample'

<<imm>>
AnnData object with n_obs × n_vars = 210614 × 20529
    obs: 'Cluster', 'nGene', 'nUMI', 'Subject', 'Health', 'Location', 'Sample'
 

 

더보기

 

                                        # <<epi>>    # <<fib>>   # <<imm>>
print(adata.obs["Cluster"].nunique())   # 15        # 13        # 23
print(adata.obs["nGene"].nunique())     # 6108      # 3368      # 3355
print(adata.obs["nUMI"].nunique())      # 25407     # 8365      # 18064
print(adata.obs["Subject"].nunique())   # 30        # 30        # 30
print(adata.obs["Location"].nunique())  # 2         # 2         # 2
print(adata.obs["Sample"].nunique())    # 131       # 77        # 132

print(adata.obs["Health"].nunique())    # 3가지     # 3가지      # 3가지
print(adata.obs["Health"].unique())   
# ['Non-inflamed', 'Inflamed', 'Healthy']
# Categories (3, object): ['Healthy', 'Inflamed', 'Non-inflamed']