ScRAT 데이터에 대한 사항은 :
[ScRAT Dataset] compare in cellxgene
[ScRAT Dataset] compare in cellxgene
Paper : Phenotype prediction from single-cell RNA-seq data using attention-based neural networks https://academic.oup.com/bioinformatics/article/40/2/btae067/7613064https://github.com/yuzhenmao/ScRATScRAT 데이터에 대해 살펴보자.COMBAT와 Haniffa
doraemin.tistory.com
1. COMBAT Dataset
총 세포 수: 836,148개
총 환자 수: 124명
2. Haniffa Datasets
총 세포 수: 647,366개
총 환자 수: 130명 (또는 143명)
3.SC4
총 세포 수: 1,462,702개
총 환자 수: 196명
CellFM 학습에 사용된 데이터들
https://github.com/biomed-AI/CellFM
GitHub - biomed-AI/CellFM
Contribute to biomed-AI/CellFM development by creating an account on GitHub.
github.com
https://zenodo.org/records/15138665
CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells
The datasets used in CellFM.
zenodo.org
CellFM_data.zip
GO_data
BP top10_data
func_dict.json170 Bytes processed_test.csv6.9 kB processed_train.csv1.9 MB processed_valid.csv13.0 kB
CC top10_data
func_dict.json170 Bytes processed_test.csv133.2 kB processed_train.csv3.3 MB processed_valid.csv13.5 kB
MF top10_data
func_dict.json170 Bytes processed_test.csv2.4 kB processed_train.csv810.9 kB processed_valid.csv5.9 kB
BCC_GSE123813.h5ad1.2 GB # 🔹 Patient: 11 unique : ['su001', 'su002', 'su003', 'su004', 'su005', ..., 'su007', 'su008', 'su009', 'su010', 'su012'] Length: 11
Cell_Lines.h5ad73.9 MB
DC.h5ad13.3 MB
Gene_classification.h5ad30.0 MB
Heart.h5ad1.9 GB # 🔹 donor_id: 13 unique : ['alexsc', 'BRC2252', 'BRC2256', 'BRC2260', 'BRC2262', ..., 'D4', 'D5', 'D6', 'D7', 'BRC2251']Length: 13
HumanPBMC.h5ad61.4 MB
Immune.h5ad215.1 MB # 🔹 sample_ID: 4 unique : ['0', '1', '2', '3'] # 🔹 study: 4 unique : ['Oetjen', 'Freytag', '10X', 'Sun']
LIHC_GSE140228.h5ad625.6 MB # 🔹 Patient: 5 unique : ['D20171109', 'D20171215', 'D20180108', 'D20180110', 'D20180116']
Liver.h5ad135.2 MB
Lung.h5ad47.5 MB # 🔹 GEO_Sample: 4 unique : ['GSM3732850', 'GSM3732852', 'GSM3732854', 'GSM3732848']
MCA.h5ad29.8 MB
Myeloid.h5ad18.9 MB
PBMC.h5ad80.8 MB
PBMC_10K.h5ad50.5 MB
PBMC_368K.h5ad41.1 MB
Pancrm.h5ad144.1 MB
Skin.h5ad392.8 MB # 🔹 donor_id: 5 unique : ['S2', 'S3', 'S4', 'S5', 'S1']
adamson.h5ad89.5 MB
hPancreas_test.h5ad30.9 MB
hPancreas_train.h5ad39.7 MB
norman.h5ad106.9 MB
데이터별 AnnData object 내용 >
"""
adamson.h5ad >
AnnData object with n_obs × n_vars = 47795 × 1069
obs: 'condition', 'pert_type', 'cell_type', 'cell_type_condition', 'guide identity', 'read count', 'UMI count', 'coverage', 'good coverage', 'number of cells', 'gene', 'dose_val', 'control', 'condition_name'
var: 'gene_name', 'index', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'hvg', 'log1p', 'non_dropout_gene_idx', 'non_zeros_gene_idx', 'rank_genes_groups_cov_all', 'top_non_dropout_de_20', 'top_non_zero_de_20'
BCC_GSE123813.h5ad >
AnnData object with n_obs × n_vars = 52884 × 20638
obs: 'UMAP_1', 'UMAP_2', 'Celltype (malignancy)', 'Celltype (major-lineage)', 'Celltype (minor-lineage)', 'Celltype (original)', 'Cluster', 'TimePoint', 'Sort', 'Celltype', 'Response', 'Patient', 'Source', 'Age', 'Gender', 'Stage', 'TNMstage', 'Treatment', 'cell_type'
var: 'gene_ids', 'gene_names', 'feature_types', 'genome'
🔹 Patient: 11 unique : ['su001', 'su002', 'su003', 'su004', 'su005', ..., 'su007', 'su008', 'su009', 'su010', 'su012']
Length: 11
Cell_Lines.h5ad >
AnnData object with n_obs × n_vars = 9531 × 32738
obs: 'cell_type', 'batch'
DC.h5ad >
AnnData object with n_obs × n_vars = 576 × 26593
obs: 'cell_type', 'batch'
Gene_classification.h5ad >
AnnData object with n_obs × n_vars = 35248 × 26390
var: 'gene_label', 'dose_cond', 't1', 't2', 't3', 'train_t1', 'train_t2', 'train_t3', 'gene', 'old_name', 'new_name', 'origin'
Heart.h5ad >
AnnData object with n_obs × n_vars = 60668 × 27411
obs: 'nCount_RNA', 'nFeature_RNA', 'age_group', 'cell_source', 'cell_states', 'sample', 'age.order', 'age.days.GA', 'size.CRL', 'size.NRL', 'stage', 'integration.groups', 'integrated_snn_res.0.1', 'clusters.low.res', 'clusters.high.res', 'clusters.res.2', 'clusters.res.3', 'condition', 'organism_ontology_term_id', 'tissue_ontology_term_id', 'assay_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'donor_id', 'suspension_type', 'is_primary_data', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid', 'batch'
var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
obsm: 'X_umap'
🔹 donor_id: 13 unique : ['alexsc', 'BRC2252', 'BRC2256', 'BRC2260', 'BRC2262', ..., 'D4', 'D5', 'D6', 'D7', 'BRC2251']
Length: 13
hPancreas_test.h5ad >
AnnData object with n_obs × n_vars = 4218 × 3000
obs: 'Celltype'
var: 'Gene Symbol'
obsm: 'X_umap'
hPancreas_train.h5ad >
AnnData object with n_obs × n_vars = 10600 × 3000
obs: 'Celltype'
var: 'Gene Symbol'
hPancreas_test.h5ad >
AnnData object with n_obs × n_vars = 15476 × 33694
obs: 'cell_type', 'batch'
Immune.h5ad >
AnnData object with n_obs × n_vars = 32484 × 12303
obs: 'batch', 'chemistry', 'data_type', 'dpt_pseudotime', 'cell_type', 'mt_frac', 'n_counts', 'n_genes', 'sample_ID', 'size_factors', 'species', 'study', 'tissue'
var: 'n_cells', 'gene', 'old_name', 'new_name'
layers: 'counts'
🔹 sample_ID: 4 unique : ['0', '1', '2', '3']
🔹 study: 4 unique : ['Oetjen', 'Freytag', '10X', 'Sun']
LIHC_GSE140228.h5ad >
AnnData object with n_obs × n_vars = 61690 × 22143
obs: 'UMAP_1', 'UMAP_2', 'Celltype (malignancy)', 'Celltype (major-lineage)', 'Celltype (minor-lineage)', 'Celltype (original)', 'Cluster', 'Source', 'Celltype_sub', 'Celltype_global', 'Sample', 'Histology', 'Tissue_sub', 'MajorCluster', 'Patient', 'Gender', 'Stage', 'TNMstage', 'cell_type'
var: 'gene_ids', 'gene_names', 'feature_types', 'genome'
🔹 Patient: 5 unique : ['D20171109', 'D20171215', 'D20180108', 'D20180110', 'D20180116']
Liver.h5ad >
AnnData object with n_obs × n_vars = 8444 × 20007
obs: 'batch', 'Cell#', 'Cluster#', 'cell_type', 'n_genes'
var: 'n_cells'
Lung.h5ad >
AnnData object with n_obs × n_vars = 10360 × 16327
obs: 'nGene', 'nUMI', 'orig.ident', 'percent.mito', 'location', 'celltype', 'ID', 'GEO_Sample', 'cell_type', 'batch'
var: 'feature'
🔹 GEO_Sample: 4 unique : ['GSM3732850', 'GSM3732852', 'GSM3732854', 'GSM3732848']
MCA.h5ad >
AnnData object with n_obs × n_vars = 6954 × 15006
obs: 'batch', 'cell_type'
Myeloid.h5ad >
AnnData object with n_obs × n_vars = 13178 × 3000
obs: 'cell_type', 'cancer_type', 'batch'
obsm: 'X_pca', 'X_umap'
norman.h5ad >
AnnData object with n_obs × n_vars = 80506 × 1049
obs: 'condition', 'pert_type', 'cell_type', 'guide_identity', 'read_count', 'UMI_count', 'coverage', 'gemgroup', 'good_coverage', 'number_of_cells', 'n_genes', 'cell_type_condition', 'lane', 'dose_val', 'control', 'condition_name', 'total_count'
var: 'gene_name', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'hvg', 'log1p', 'non_dropout_gene_idx', 'non_zeros_gene_idx', 'rank_genes_groups_cov_all', 'top_non_dropout_de_20', 'top_non_zero_de_20'
Pancrm.h5ad >
AnnData object with n_obs × n_vars = 14767 × 15558
obs: 'cell_type', 'batch'
PBMC_10K.h5ad>
adata :
AnnData object with n_obs × n_vars = 11990 × 3346
obs: 'n_counts', 'batch', 'labels', 'str_labels', 'cell_type', 'train'
var: 'n_counts-0', 'n_counts-1', 'n_counts'
obsm: 'design', 'normalized_qc', 'qc_pc', 'raw_qc'
PBMC_368K.h5ad >
AnnData object with n_obs × n_vars = 4638 × 14236
obs: 'batch', 'celltype', 'cell_type'
PBMC.h5ad >
AnnData object with n_obs × n_vars = 18868 × 6998
obs: 'condition', 'n_counts', 'n_genes', 'mt_frac', 'cell_type', 'batch'
var: 'gene_symbol', 'n_cells'
obsm: 'X_pca', 'X_tsne', 'X_umap'
Skin.h5ad >
AnnData object with n_obs × n_vars = 15457 × 30867
obs: 'age', 'tissue_ontology_term_id', 'assay_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'organism_ontology_term_id', 'is_primary_data', 'donor_id', 'suspension_type', 'Cluster', 'Celltype', 'tissue_type', 'sample_id', 'library_id', 'library_preparation_batch', 'library_sequencing_run', 'alignment_software', 'manner_of_death', 'sample_source', 'sample_collection_method', 'institute', 'sampled_site_condition', 'sample_preservation_method', 'sequenced_fragment', 'reference_genome', 'cell_enrichment', 'gene_annotation_version', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid', 'batch'
var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
obsm: 'X_pca', 'X_umap'
🔹 donor_id: 5 unique : ['S2', 'S3', 'S4', 'S5', 'S1']
"""
Data File Download

$ wget -O CellFM_data.zip "https://zenodo.org/records/15138665/files/CellFM_data.zip?download=1"
# 아래 명령어로 바로 압축을 해제할 수 있습니다 :
$ unzip CellFM_data.zip -d extracted_files'AI & Data Analysis > Deep Learning' 카테고리의 다른 글
| scRNA-seq Analysis Insight (0) | 2025.08.05 |
|---|---|
| [AMIL] Cell Type Annotation Procedure (COVID, Cardio, Parkinson) (0) | 2025.07.30 |
| [ScRAT Dataset] compare in cellxgene (0) | 2025.07.23 |
| [GAT] self-attention (0) | 2025.07.17 |
| Graph Attention Network (GAT) (0) | 2025.07.09 |