AI 및 Data Analysis/Code

[Hierarchical MIL] Exploratory Data (Summary)

doraemin_dev 2025. 3. 22. 16:33

논문

Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Data
https://www.biorxiv.org/content/10.1101/2025.02.10.637389v1.full.pdf
 

2025.02
1.55MB

 

논문 정리

2025.03.22 - [AI 및 Data Analysis/Paper] - [Hierarchical MIL] Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Data

 

[Hierarchical MIL] Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with

논문Incorporating Hierarchical Information into Multiple Instance Learning for Patient Phenotype Prediction with scRNA-seq Datahttps://www.biorxiv.org/content/10.1101/2025.02.10.637389v1.full.pdf깃허브https://github.com/minhchaudo/hier-mil GitHub -

doraemin.tistory.com


깃허브

https://github.com/minhchaudo/hier-mil

 

GitHub - minhchaudo/hier-mil

Contribute to minhchaudo/hier-mil development by creating an account on GitHub.

github.com


DATA (3가지)

1.Cardio: 심근병증 환자 데이터를 활용한 다중 분류 (DCM, HCM, 정상; 세 가지 분류)

2.COVID: COVID-19 감염 여부를 예측하는 이진 분류

3.ICB: 면역항암제 치료 반응 여부를 예측하는 이진 분류

COVID 데이터셋.
https://singlecell.broadinstitute.org/single_cell/study/SCP1289/impaired-local-intrinsic-immunity-to-sars-cov-2-infection-in-severe-covid-19

 

Impaired local intrinsic immunity to SARS-CoV-2 infection in severe COVID-19 - Single Cell Portal

ABSTRACT: Infection with SARS-CoV-2, the virus that causes COVID-19, can lead to severe lower respiratory illness including pneumonia and acute respiratory distress syndrome, which can result in profound morbidity and mortality. However, many infected indi

singlecell.broadinstitute.org

 

 

Columns : 27

 

총 샘플 수: 32,588개
총 환자 수(donor_id): 58명

  • SARSCoV2_PCR_Status
    • pos    18073
    • neg    14515
  • disease__ontology_label
    • COVID-19                   18073
    • normal                           8874
    • respiratory failure     3335
    • long COVID-19           2306
  • Coarse_Cell_Annotations (18개)
    • Ciliated Cells                           10059
      Squamous Cells                            5250
      Developing Ciliated Cells                 3854
      Secretory Cells                           3633
      Goblet Cells                              2807
      Basal Cells                               1691
      T Cells                                   1475
      Erythroblasts                              986
      Macrophages                                903
      Deuterosomal Cells                         583
      Developing Secretory and Goblet Cells      406
      Ionocytes                                  399
      Mitotic Basal Cells                        266
      Dendritic Cells                            142
      B Cells                                     71
      Enteroendocrine Cells                       41
      Plasmacytoid DCs                            13
      Mast Cells                                   9

MetaData.txt

import pandas as pd

# 원본 메타데이터 로드
df = pd.read_csv("20210701_NasalSwab_MetaData.txt", sep='\t')

# 타입 행과 실제 데이터 분리
column_types = df.iloc[0]       # 0번째 행 → 컬럼 타입
df_data = df.iloc[1:].copy()    # 1번째 행부터가 진짜 데이터
df_data.reset_index(drop=True, inplace=True)

print(f" 🔍 예제 데이터: {df_data.head(3).to_dict()}")


# 각 컬럼을 알맞은 타입으로 변환 (numeric 컬럼은 float로 변환)
for col in column_types[column_types == "numeric"].index:
    df_data[col] = pd.to_numeric(df_data[col], errors="coerce")  # 숫자로 변환, 안 되면 NaN

print("전체 세포 수:", len(df_data),"\n")  
print(f"유일한 환자(donor) 수: {df_data["donor_id"].nunique()}\n")
print("유일한 biosample 수:", df_data["biosample_id"].nunique(), "\n")

# 라벨 분포 확인
print(f"🦠 COVID 감염 여부 분포: {df_data["SARSCoV2_PCR_Status"].value_counts()}\n")
print(f"질병코드 : {df_data["disease"].value_counts()} \n")
print(f"질병이름 : {df_data["disease__ontology_label"].value_counts()} \n")

print(f"🧪 (Coarse_Cell_Annotations) 종류:{len(df_data["Coarse_Cell_Annotations"].unique())}개. \n{df_data['Coarse_Cell_Annotations'].value_counts()}")


missing = df_data.isnull().sum()
print("📉 누락값이 있는 컬럼:")
print(missing[missing > 0])

print(f"📋 전체 컬럼명 리스트: {len(df_data.columns)} 개")
for col in df_data.columns:
    print(col)

 

실행 결과

kim89@ailab-System-Product-Name:~/hier-mil/data_original$ python data_MetaData.py
 🔍 예제 데이터: {'NAME': {0: 'GTCGGGGGGTGG_Control_Participant7', 1: 'CAAATCAATTAT_Control_Participant7', 2: 'ATACAATTGACA_Control_Participant7'}, 'donor_id': {0: 'Control_Participant7', 1: 'Control_Participant7', 2: 'Control_Participant7'}, 'Peak_Respiratory_Support_WHO_Score': {0: '0', 1: '0', 2: '0'}, 'Bloody_Swab': {0: 'No', 1: 'No', 2: 'No'}, 'Percent_Mitochondrial': {0: '11.00478469', 1: '34.21052632', 2: '7.068223725'}, 'SARSCoV2_PCR_Status': {0: 'neg', 1: 'neg', 2: 'neg'}, 'SARSCoV2_PCR_Status_and_WHO_Score': {0: 'neg_0', 1: 'neg_0', 2: 'neg_0'}, 'Cohort_Disease_WHO_Score': {0: 'Control_WHO_0', 1: 'Control_WHO_0', 2: 'Control_WHO_0'}, 'biosample_id': {0: 'WHO_0_Control_Participant7', 1: 'WHO_0_Control_Participant7', 2: 'WHO_0_Control_Participant7'}, 'SingleCell_SARSCoV2_RNA_Status': {0: 'neg', 1: 'neg', 2: 'neg'}, 'SARSCoV2_Unspliced_TRS_Total_Corrected': {0: '0', 1: '0', 2: '0'}, 'SARSCoV2_Spliced_TRS_Total_Corrected': {0: '0', 1: '0', 2: '0'}, 'SARSCoV2_NegativeStrand_Total_Corrected': {0: '0', 1: '0', 2: '0'}, 'SARSCoV2_PositiveStrand_Total_Corrected': {0: '0', 1: '0', 2: '0'}, 'SARSCoV2_Total_Corrected': {0: '0', 1: '0', 2: '0'}, 'species': {0: 'NCBITaxon_9606', 1: 'NCBITaxon_9606', 2: 'NCBITaxon_9606'}, 'species__ontology_label': {0: 'Homo sapiens', 1: 'Homo sapiens', 2: 'Homo sapiens'}, 'sex': {0: 'male', 1: 'male', 2: 'male'}, 'disease': {0: 'PATO_0000461', 1: 'PATO_0000461', 2: 'PATO_0000461'}, 'disease__ontology_label': {0: 'normal', 1: 'normal', 2: 'normal'}, 'organ': {0: 'UBERON_0001728', 1: 'UBERON_0001728', 2: 'UBERON_0001728'}, 'organ__ontology_label': {0: 'nasopharynx', 1: 'nasopharynx', 2: 'nasopharynx'}, 'library_preparation_protocol': {0: 'EFO_0008919', 1: 'EFO_0008919', 2: 'EFO_0008919'}, 'library_preparation_protocol__ontology_label': {0: 'Seq-Well', 1: 'Seq-Well', 2: 'Seq-Well'}, 'age': {0: '50-59', 1: '50-59', 2: '50-59'}, 'Coarse_Cell_Annotations': {0: 'Developing Ciliated Cells', 1: 'Developing Ciliated Cells', 2: 'Developing Ciliated Cells'}, 'Detailed_Cell_Annotations': {0: 'Developing Ciliated Cells', 1: 'Developing Ciliated Cells', 2: 'Developing Ciliated Cells'}}
전체 세포 수: 32588 

유일한 환자(donor) 수: 58

유일한 biosample 수: 58 

🦠 COVID 감염 여부 분포: SARSCoV2_PCR_Status
pos    18073
neg    14515
Name: count, dtype: int64

질병코드 : disease
MONDO_0100096    18073
PATO_0000461      8874
MONDO_0021113     3335
MONDO_0100233     2306
Name: count, dtype: int64 

질병이름 : disease__ontology_label
COVID-19               18073
normal                  8874
respiratory failure     3335
long COVID-19           2306
Name: count, dtype: int64 

🧪 (Coarse_Cell_Annotations) 종류:18개. 
Coarse_Cell_Annotations
Ciliated Cells                           10059
Squamous Cells                            5250
Developing Ciliated Cells                 3854
Secretory Cells                           3633
Goblet Cells                              2807
Basal Cells                               1691
T Cells                                   1475
Erythroblasts                              986
Macrophages                                903
Deuterosomal Cells                         583
Developing Secretory and Goblet Cells      406
Ionocytes                                  399
Mitotic Basal Cells                        266
Dendritic Cells                            142
B Cells                                     71
Enteroendocrine Cells                       41
Plasmacytoid DCs                            13
Mast Cells                                   9
Name: count, dtype: int64
📉 누락값이 있는 컬럼:
Series([], dtype: int64)
📋 전체 컬럼명 리스트: 27 개
NAME
donor_id
Peak_Respiratory_Support_WHO_Score
Bloody_Swab
Percent_Mitochondrial
SARSCoV2_PCR_Status
SARSCoV2_PCR_Status_and_WHO_Score
Cohort_Disease_WHO_Score
biosample_id
SingleCell_SARSCoV2_RNA_Status
SARSCoV2_Unspliced_TRS_Total_Corrected
SARSCoV2_Spliced_TRS_Total_Corrected
SARSCoV2_NegativeStrand_Total_Corrected
SARSCoV2_PositiveStrand_Total_Corrected
SARSCoV2_Total_Corrected
species
species__ontology_label
sex
disease
disease__ontology_label
organ
organ__ontology_label
library_preparation_protocol
library_preparation_protocol__ontology_label
age
Coarse_Cell_Annotations
Detailed_Cell_Annotations

전처리 완료한 icb.h5ad 데이터셋

  • AnnData object with n_obs × n_vars = 9292 × 824
  • 총 세포 수 : 9,292개  * (9292, 197) DataFrame
  • 유전자 정보 : 824개 *(824,) Index

 


  • 세포에 대한 정보가 .obs 에 담겨있다. * (9292, 197) DataFrame

 

  • 유전자에 대한 정보가 .var 에 담겨 있다. *(824,) Index

import scanpy as sc

adata = sc.read_h5ad("icb.h5ad")
print(adata)                # 전체 구조 요약
print(adata.X.shape)        # (9292, 824)
import pandas as pd
df_X = pd.DataFrame(adata.X, columns=adata.var.index)
print(df_X.head())

print("==== column : 824개 유전자 ====")
print(adata.var.shape)      # (824, 0)     → 유전자 정보 (보통 이름만 있으면 (824, 0))
print(adata.var.index)  # 유전자 이름 5개만 보기      # Index(['HAVCR2', 'CTLA4', 'PDCD1', 'IDO1', 'CXCL10'], dtype='object')
print(adata.var.head())     # 유전자 이름

print("==== row : 9292개 세포의 정보 197가지 ====")
print(adata.obs.shape)      # (9292, 197)  → 각 세포의 메타데이터
print(adata.obs.columns[:50])    # 메타데이터 컬럼
print(adata.obs.columns[50:100])    # 메타데이터 컬럼
print(adata.obs.columns[100:150])    # 메타데이터 컬럼
print(adata.obs.columns[150:])    # 메타데이터 컬럼
print(adata.obs.head())     # 행 샘플

결과는 아래의 '접은 글' 참조

더보기
(venv) kim89@ailab-System-Product-Name:~/hier-mil$ python data/icb/icb_h5ad_analysis.py
AnnData object with n_obs × n_vars = 9292 × 824
    obs: 'sample_id', 'cell_id', 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'biosample_id', 'species', 'species__ontology_label', 'disease', 'disease__ontology_label', 'organ', 'organ__ontology_label', 'library_preparation_protocol', 'library_preparation_protocol__ontology_label', 'sex', 'cell.type', 'flow', 'X', 'Y', 'Gender', 'Primary Location', 'Immunotherapy #1', 'Immunotherapy #2', 'Immunotherapy #3', 'Immunotherapy #4', 'Targeted Therapy (dates)', 'CNS metastasic sites', 'Systemic sites of metastasis', 'SNaPshot Mutations', 'Location of surgery #1', 'Surgery #1 Single-cell ID', 'Location of surgery #2', 'Surgery #2 Single-cell ID', 'Location of surgery #3', 'Surgery #3 Single-cell ID', 'Location of surgery #4', 'Surgery #4 Single-cell ID', 'Pre/post ICI', 'outcome', 'Presence of necrosis on H&E', 'Var.41', '.1', '.2', '.3', '.4', '.5', '.6', 'donor_id_prepost', 'donor_id_responder', 'enough_cells', 'donor_id_prepost_responder', 'pre_post', 'Study_name', 'Cancer_type', 'Primary_or_met', 'sample_id_pre_post', 'total_cell_per_patient', 'cell_type_for_count', 'total_T_Cell', 'normalized_CD8_totalcells', 'RNA_snn_res.0.8', 'seurat_clusters', 'treatment', 'sort', 'cluster', 'UMAP1', 'UMAP2', 'Tumor.Type', 'Treatment', 'Ongoing.Vismodegib.treatment', 'Prior.treatment', 'Response', 'Best...change', 'scRNA.pre.site', 'scRNA.days.pre.treatment', 'scRNA.post.site', 'scRNA.days.post.treatment', 'Adaptive.pre.site', 'Adaptive.days.pre.treatment', 'Adaptive.post.site', 'Adaptive.days.post.treatment', 'PBMC.Adaptive.days.pre.treatment', 'PBMC.Adaptive.days.post.treatment', 'Exome.pre.site', 'Exome.days.pre.treatment', 'Exome.post.site', 'Exome.days.post.treatment', 'epi', 'sample_id_outcome', 'cell_type_for_count.x', 'cell_type_for_count.y', 'total_T_Cell_only', 'normalized_CD8_actual_totalcells', 'cell.types', 'treatment.group', 'Cohort', 'no.of.genes', 'no.of.reads', 'NAME', 'LABELS', 'tumor', 'immune_outcome', 'Immune_resistance.up', 'Immune_resistance.down', 'OE.Immune_resistance', 'OE.Immune_resistance.up', 'OE.Immune_resistance.down', 'no.genes', 'log.no.reads', 'technology', 'n_cells', 'patient', 'age', 'smoking_status', 'PY', 'diagnosis_recurrence', 'disease_extent', 'AJCC_T', 'AJCC_N', 'AJCC_M', 'AJCC_stage', 'sample_primary_met', 'size', 'site', 'histology', 'genetic_hormonal_features', 'grade', 'KI67', 'chemotherapy_exposed', 'chemotherapy_response', 'targeted_rx_exposed', 'targeted_rx_response', 'ICB_exposed', 'ICB_response', 'ET_exposed', 'ET_response', 'time_end_of_rx_to_sampling', 'post_sampling_rx_exposed', 'post_sampling_rx_response', 'PFS_DFS', 'OS', 'total_T.CD8', 'timepoint', 'cellType', 'cohort', 'treatment_info', 'Cancer_type_pre_post', 'sample', 'id', 'Type', 'No.a', 'Sex', 'Age..Years.b', 'Race', 'Diagnosis', 'Stage', 'Etiology', 'Biopsy.Timingc', 'Treatmentd', 'Mode.of.Actione', 'set', 'Sample', 'Source', 'Stage.y', 'Mode.of.Actione_2', 'sample_id_Mode.of.Actione_2', 'ICB_Exposed', 'ICB_Response', 'TKI_Exposed', 'Initial_Louvain_Cluster', 'Lineage', 'InferCNV', 'FinalCellType', 'sex.x', 'cancer_type', 'sex.y', 'treated_naive', 'Cancer_type_update', 'Outcome', 'Combined_outcome', 'Malignant_clusters', 'patient_ID', 'pre_post_outcome', 'percent.mito', 'percent.ribo', 'pANN_0.25_0.21_50', 'DoubletFinder', 'pANN_0.25_0.21_642', 'pANN_0.25_0.21_61', 'pANN_0.25_0.21_7', 'pANN_0.25_0.21_18', 'pANN_0.25_0.21_94', 'pANN_0.25_0.21_6', 'pANN_0.25_0.21_35', 'Study_name_cancer', 'label', 'cell_type_annotation'
(9292, 824)
   HAVCR2  CTLA4  PDCD1  IDO1  CXCL10     CXCL9   HLA-DRA     STAT1  IFNG  CD3E  GZMK  CD2  CXCL13  IL2RG  ...  WT1  TET2  ZRSR2  PTPN11  EZH2  TP53      CALR  STAG2  CEBPA  CUX1  U2AF1  EP300  PHF6  KRAS
0     0.0    0.0    0.0   0.0     0.0  0.000000  0.000000  0.000000   0.0   0.0   0.0  0.0     0.0    0.0  ...  0.0   0.0    0.0     0.0   0.0   0.0  0.000000    0.0    0.0   0.0    0.0    0.0   0.0   0.0
1     0.0    0.0    0.0   0.0     0.0  0.000000  0.000000  0.000000   0.0   0.0   0.0  0.0     0.0    0.0  ...  0.0   0.0    0.0     0.0   0.0   0.0  0.000000    0.0    0.0   0.0    0.0    0.0   0.0   0.0
2     0.0    0.0    0.0   0.0     0.0  0.000000  3.111702  3.111702   0.0   0.0   0.0  0.0     0.0    0.0  ...  0.0   0.0    0.0     0.0   0.0   0.0  0.000000    0.0    0.0   0.0    0.0    0.0   0.0   0.0
3     0.0    0.0    0.0   0.0     0.0  2.892357  0.000000  2.892357   0.0   0.0   0.0  0.0     0.0    0.0  ...  0.0   0.0    0.0     0.0   0.0   0.0  2.892357    0.0    0.0   0.0    0.0    0.0   0.0   0.0
4     0.0    0.0    0.0   0.0     0.0  0.000000  0.000000  0.000000   0.0   0.0   0.0  0.0     0.0    0.0  ...  0.0   0.0    0.0     0.0   0.0   0.0  3.905236    0.0    0.0   0.0    0.0    0.0   0.0   0.0

[5 rows x 824 columns]
==== column : 824개 유전자 ====
(824, 0)
Index(['HAVCR2', 'CTLA4', 'PDCD1', 'IDO1', 'CXCL10', 'CXCL9', 'HLA-DRA',
       'STAT1', 'IFNG', 'CD3E',
       ...
       'EZH2', 'TP53', 'CALR', 'STAG2', 'CEBPA', 'CUX1', 'U2AF1', 'EP300',
       'PHF6', 'KRAS'],
      dtype='object', length=824)
Empty DataFrame
Columns: []
Index: [HAVCR2, CTLA4, PDCD1, IDO1, CXCL10]
==== row : 9292개 세포의 정보 197가지 ====
(9292, 197)
Index(['sample_id', 'cell_id', 'orig.ident', 'nCount_RNA', 'nFeature_RNA',
       'biosample_id', 'species', 'species__ontology_label', 'disease',
       'disease__ontology_label', 'organ', 'organ__ontology_label',
       'library_preparation_protocol',
       'library_preparation_protocol__ontology_label', 'sex', 'cell.type',
       'flow', 'X', 'Y', 'Gender', 'Primary Location', 'Immunotherapy #1',
       'Immunotherapy #2', 'Immunotherapy #3', 'Immunotherapy #4',
       'Targeted Therapy (dates)', 'CNS metastasic sites',
       'Systemic sites of metastasis', 'SNaPshot Mutations',
       'Location of surgery #1', 'Surgery #1 Single-cell ID',
       'Location of surgery #2', 'Surgery #2 Single-cell ID',
       'Location of surgery #3', 'Surgery #3 Single-cell ID',
       'Location of surgery #4', 'Surgery #4 Single-cell ID', 'Pre/post ICI',
       'outcome', 'Presence of necrosis on H&E', 'Var.41', '.1', '.2', '.3',
       '.4', '.5', '.6', 'donor_id_prepost', 'donor_id_responder',
       'enough_cells'],
      dtype='object')
Index(['donor_id_prepost_responder', 'pre_post', 'Study_name', 'Cancer_type',
       'Primary_or_met', 'sample_id_pre_post', 'total_cell_per_patient',
       'cell_type_for_count', 'total_T_Cell', 'normalized_CD8_totalcells',
       'RNA_snn_res.0.8', 'seurat_clusters', 'treatment', 'sort', 'cluster',
       'UMAP1', 'UMAP2', 'Tumor.Type', 'Treatment',
       'Ongoing.Vismodegib.treatment', 'Prior.treatment', 'Response',
       'Best...change', 'scRNA.pre.site', 'scRNA.days.pre.treatment',
       'scRNA.post.site', 'scRNA.days.post.treatment', 'Adaptive.pre.site',
       'Adaptive.days.pre.treatment', 'Adaptive.post.site',
       'Adaptive.days.post.treatment', 'PBMC.Adaptive.days.pre.treatment',
       'PBMC.Adaptive.days.post.treatment', 'Exome.pre.site',
       'Exome.days.pre.treatment', 'Exome.post.site',
       'Exome.days.post.treatment', 'epi', 'sample_id_outcome',
       'cell_type_for_count.x', 'cell_type_for_count.y', 'total_T_Cell_only',
       'normalized_CD8_actual_totalcells', 'cell.types', 'treatment.group',
       'Cohort', 'no.of.genes', 'no.of.reads', 'NAME', 'LABELS'],
      dtype='object')
Index(['tumor', 'immune_outcome', 'Immune_resistance.up',
       'Immune_resistance.down', 'OE.Immune_resistance',
       'OE.Immune_resistance.up', 'OE.Immune_resistance.down', 'no.genes',
       'log.no.reads', 'technology', 'n_cells', 'patient', 'age',
       'smoking_status', 'PY', 'diagnosis_recurrence', 'disease_extent',
       'AJCC_T', 'AJCC_N', 'AJCC_M', 'AJCC_stage', 'sample_primary_met',
       'size', 'site', 'histology', 'genetic_hormonal_features', 'grade',
       'KI67', 'chemotherapy_exposed', 'chemotherapy_response',
       'targeted_rx_exposed', 'targeted_rx_response', 'ICB_exposed',
       'ICB_response', 'ET_exposed', 'ET_response',
       'time_end_of_rx_to_sampling', 'post_sampling_rx_exposed',
       'post_sampling_rx_response', 'PFS_DFS', 'OS', 'total_T.CD8',
       'timepoint', 'cellType', 'cohort', 'treatment_info',
       'Cancer_type_pre_post', 'sample', 'id', 'Type'],
      dtype='object')
Index(['No.a', 'Sex', 'Age..Years.b', 'Race', 'Diagnosis', 'Stage', 'Etiology',
       'Biopsy.Timingc', 'Treatmentd', 'Mode.of.Actione', 'set', 'Sample',
       'Source', 'Stage.y', 'Mode.of.Actione_2', 'sample_id_Mode.of.Actione_2',
       'ICB_Exposed', 'ICB_Response', 'TKI_Exposed', 'Initial_Louvain_Cluster',
       'Lineage', 'InferCNV', 'FinalCellType', 'sex.x', 'cancer_type', 'sex.y',
       'treated_naive', 'Cancer_type_update', 'Outcome', 'Combined_outcome',
       'Malignant_clusters', 'patient_ID', 'pre_post_outcome', 'percent.mito',
       'percent.ribo', 'pANN_0.25_0.21_50', 'DoubletFinder',
       'pANN_0.25_0.21_642', 'pANN_0.25_0.21_61', 'pANN_0.25_0.21_7',
       'pANN_0.25_0.21_18', 'pANN_0.25_0.21_94', 'pANN_0.25_0.21_6',
       'pANN_0.25_0.21_35', 'Study_name_cancer', 'label',
       'cell_type_annotation'],
      dtype='object')
                                                    sample_id                           cell_id orig.ident  nCount_RNA  ...  pANN_0.25_0.21_35 Study_name_cancer label cell_type_annotation
Row.names                                                                                                               ...                                                                
Breast_previous_Breast_BIOKEY_10_Pre_AAAGCAAAGC...  BIOKEY_10  BIOKEY_10_Pre_AAAGCAAAGCGTCTAT-1     BIOKEY         407  ...                NaN       Bassez:TNBC     1      Mesangial cells
Breast_previous_Breast_BIOKEY_10_Pre_AAATGCCGTT...  BIOKEY_10  BIOKEY_10_Pre_AAATGCCGTTAGGGTG-1     BIOKEY         411  ...                NaN       Bassez:TNBC     1                  HSC
Breast_previous_Breast_BIOKEY_10_Pre_AACTCTTGTA...  BIOKEY_10  BIOKEY_10_Pre_AACTCTTGTAACGTTC-1     BIOKEY         466  ...                NaN       Bassez:TNBC     1      Mesangial cells
Breast_previous_Breast_BIOKEY_10_Pre_AACTGGTAGT...  BIOKEY_10  BIOKEY_10_Pre_AACTGGTAGTACATGA-1     BIOKEY         587  ...                NaN       Bassez:TNBC     1              B-cells
Breast_previous_Breast_BIOKEY_10_Pre_AACTTTCAGG...  BIOKEY_10  BIOKEY_10_Pre_AACTTTCAGGATCGCA-1     BIOKEY         411  ...                NaN       Bassez:TNBC     1           Adipocytes

[5 rows x 197 columns]