Generative AI, and specifically Large Language Models (LLMs), hold immense potential for predictive analysis in both genomic medicine and single-cell analysis.
In Genomic Medicine:
• Identifying disease-causing variants: LLMs trained on vast datasets of genomes and associated phenotypes can scan individual genomes and predict the likelihood of harboring disease-causing variants. This can improve early diagnosis and personalized treatment decisions.
• Drug discovery and target identification: By analyzing large-scale genomic data and relevant scientific literature, LLMs can suggest novel drug targets or predict potential drug-gene interactions, accelerating the drug discovery process.
• Stratifying patients for clinical trials: LLMs can analyze patient genomic data and clinical information to identify subpopulations most likely to benefit from specific clinical trials, leading to more targeted and effective therapeutic development.
• Understanding complex biological processes: LLMs can process and analyze vast amounts of data on gene expression, protein interactions, and cellular pathways, aiding in the discovery of complex biological mechanisms underlying diseases.
In Single-Cell Analysis:
• Cell type identification and classification: LLMs trained on single-cell RNA-seq data can accurately classify different cell types within a complex tissue, revealing new cell populations and their roles in health and disease.
• Identifying cell-cell interactions: LLMs can analyze single-cell data to infer communication networks between different cell types, providing insights into tissue organization and function.
• Predicting cellular responses to stimuli: By learning from single-cell responses to various stimuli, LLMs can predict how individual cells or cell populations might react to drugs, environmental changes, or disease progression.
• Generating synthetic single-cell data: LLMs can be used to generate realistic simulations of single-cell data, facilitating the development and testing of new computational tools and analysis methods.
scGPT (single cell Generative Pre-trained Transformer):
Shown below example of scGPT over liver scRNA seq dataset.
scGPT, a Python package for single-cell multi-omic data analysis using pretrained foundation models. This model adapts the GPT approach to single-cell data, learning representations from the gene expression matrix and enabling tasks like cell type annotation, differential expression analysis, and even generating synthetic single-cell data.
scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as
-cell-type annotation,
-multi-batch integration,
-multi-omic integration,
-genetic perturbation prediction, and
-gene network inference
Example 1
Figure 1A Zero-shot single cell analysis with continual pre-trained scGPT. This scRNA-seq requires no further training of scGPT. The scRNA dataset taken from cellxgene Tabula Sapiens liver dataset.
Figure 1B Embeddings visualization
Steps: Downloaded Tabula sapiens liver dataset divided into
train(reference 80%) and test(query 20%),
preprocessed
Generated scGPT embeddings for each cell in reference and query datasets
Annotations transferred from reference to query dataset
Performance Evaluation
'accuracy': 0.9310861423220974,
'precision': 0.8325796185787914,
'recall': 0.7874799806240997,
'macro_f1': 0.8021677680969674