LLMs

Understanding Disease Biology Language Using LCM and CAG by Replacing Human Language


Understanding disease biology language involves replacing human language in multimodal oncology knowledge abstraction and integration models with Large Concept Models (LCMs) instead of traditional Large Language Models (LLMs). With over 200 human languages, the focus shifts to the language of disease biology. The recent release of LCM by Meta emphasizes knowledge abstraction based on Concept, Content, and Language (CCL). Unlike LLMs, LCMs move from token-based processing to reasoning at the sentence level by embedding entire sentences, eliminating the need for next-token prediction.

Additionally, Cache-Augmented Generation (CAG) replaces Retrieval-Augmented Generation (RAG) models. While RAG suffers from retrieval delays and errors, CAG preloads knowledge directly into large language models, ensuring lightning-fast and accurate responses. CAG thus offers a superior alternative to RAG for building knowledge bases and graph models to understand cellular-level interactions in disease versus normal states.

The year 2025 is predicted to be a transformative year for GenAI in healthcare, according to industry experts. Let's harness the power of GenAI, integrate multimodal data, and deepen our understanding of disease biology through collaboration and team building.  

Architecting LLM Embeddings for Knowledge Graph Construction Using Multimodal Spatial Omics Data

The architecture for integrating multimodal spatial omics data, including imaging and single-cell data, with PubMed database text involves several steps of data integration, feature extraction, and embedding in a high-dimensional space using large language models (LLMs). Initially, tools like SpaTalk, Seurat, and Scanpy are employed to integrate diverse data types such as CODEX, TILs, scRNA, nuclei, and text. Following this, feature extraction tools like Morph and STUtility recognize patterns and align features spatially and temporally. These features are then embedded using LLMs and transformers into vectors that capture semantic relationships and interactions among nodes, such as genes and cells. Knowledge graph embedding techniques like TransE and RotatE, along with BioBERT and PyKEEN, are utilized to refine these embeddings further. The process culminates in constructing knowledge graphs and graph neural networks (GNNs) with tools like Neo4j and RDF, enabling visualization and analysis of complex data interactions.

This approach is pivotal in drug discovery and targeted immunotherapy, especially in understanding the tumor microenvironment (TME). By integrating and analyzing multimodal data, researchers can uncover intricate patterns and correlations that are otherwise overlooked. Utilizing LLMs and foundation models, the embeddings capture the latent space of these interactions, providing a comprehensive view of biological processes. Knowledge graphs and GNNs built from these embeddings allow for detailed mapping of interactions within the TME, identifying potential drug targets and biomarkers. The no-code/less-code paradigm, facilitated by APIs and libraries, simplifies the construction of this pipeline, enabling efficient data processing and integration. Ultimately, this method enhances our understanding of complex biological systems and improves patient outcomes through personalized medicine and precision therapies.


MultiOmics RAG/GraphRAG Architecture and Data Embedding Feature Space 

The MultiOmics RAG/GraphRAG architecture is a sophisticated framework designed to process and analyze spatial omics data. It comprises several stages: Data Collection & Integration, utilizing tools like MISO, VISTA-2D, Mcadet, Pubget, and NER; Feature Extraction (LLMs) with tools such as LangChain, SpaCy, and NLTK; Chunking & Summarizing; Embedding using Milvus, Weaviate, and Pinecone; and Vector Store/RAG Query with vector databases like PostgreSQL, MariaDB, SQLite, KD.AI, Qdrant, and Vectorize. This structured approach facilitates the efficient management and retrieval of complex spatial omics data.


In the realm of drug discovery, this architecture enables the integration and analysis of diverse multimodal data, leading to the identification of novel drug targets and therapeutic pathways. By leveraging spatial omics data, researchers can uncover predictive biomarkers for targeted immunotherapy, enabling more precise and effective treatments for patients. The ability to process and analyze large-scale data with the MultiOmics RAG/GraphRAG architecture accelerates the development of new drugs and personalized therapies, ultimately improving patient outcomes. 


Mapping Contributing Features to Spatial Distribution of Cell Localization 

The knowledge graph (KG) identifies clusters of cells based on their features, such as gene expression profiles, cell types, and spatial relationships. These clusters are then mapped back to the spatial transcriptomics images, allowing researchers to visualize the spatial distribution of cells and their contributing features. This mapping helps in understanding the localization and interaction of different cell types within the tumor microenvironment, providing insights into the underlying biological processes and potential therapeutic targets. 

The image presents above a detailed workflow integrating multi-modality imaging data (histopathology and spatial omics) with large language models (LLMs) and knowledge graphs to analyze tumor-infiltrating lymphocytes (TILs) in the tumor microenvironment (TME) of colorectal cancer (CRC). The process involves identifying clusters of cells in the knowledge graph, then mapping these clusters back to spatial transcriptomics images to understand the contributing attributes (features) and their spatial localization (distribution). 

The red and green bars in the image represent the relative importance of different features in the analysis. The red bars indicate features that have a negative contribution, while the green bars indicate features that have a positive contribution to the model's prediction. This helps in understanding which features are driving the model's decision-making process 

LLMs in Biomedicine (Clinical AI) 

These algorithms enhance various aspects of clinical trials, from data analysis and patient recruitment to generating insights and improving diagnostics. They streamline processes, improve accuracy, and ultimately contribute to more efficient and effective clinical research.


Advanced Computational Approaches in GenAI for MASH Liver Diagnostics: Integrating In-Context Learning (ICL) and Large Language Models (LLMs) 

In-Context Learning (ICL), a recent model developed by Meta, offers significant advantages in the realm of machine learning and natural language processing. Unlike traditional fine-tuning methods, ICL allows models to adapt to new tasks quickly by conditioning on provided examples without the need for parameter updates. This agility and efficiency make ICL a powerful tool for prompt engineering, where the quality of prompts directly impacts the model's performance. By combining ICL with fine-tuning, one can leverage the flexibility of ICL and the specialized expertise of fine-tuned models, resulting in a robust and versatile approach to various tasks.


Metabolic dysfunction-associated steatohepatitis (MASH) is a liver disease characterized by inflammation and fibrosis. Understanding liver zonation, particularly the mid-lobular zone, is crucial for studying the onset of MASH, as this area is key to liver regeneration and fibrosis triggering. This use case of ICL is particularly relevant for testing in MASH, where Liver Sinusoidal Endothelial Cells (LSECs) play a vital role in liver zonation, regulating hepatic vascular pressure and exhibiting anti-inflammatory and anti-fibrotic functions. Hepatic Stellate Cells (HSCs), on the other hand, are central to fibrosis and inflammation, producing extracellular matrix components that contribute to fibrosis. The interplay between LSECs and HSCs is essential for understanding the progression of MASH. According to the study "Spatial Computational Hepatic Molecular Biomarker Reveals LSECs Role in Mid Lobular Liver Zonation Fibrosis in DILI and NASH Induced Liver Injury," LSECs in the mid-lobular zone are crucial for early fibrosis detection and liver regeneration.


Integrating spatial transcriptomics data, single-cell RNA-seq data, and histopathology H&E liver images using ICL and large language models (LLMs) in prompt engineering can significantly enhance our understanding of MASH. By building a knowledge graph that incorporates these diverse data types, researchers can develop computational diagnostic biomarkers that provide insights into the spatial distribution and heterogeneity of cell types and subtypes in MASH liver. This approach, powered by Generative AI (GenAI, is under testing on more data sets and at validation stages), can improve diagnostic accuracy and offer new perspectives on disease mechanisms, ultimately aiding in the development of targeted therapies.

https://lnkd.in/e2fxMJqk 

Therpeutic targets: MASH (LLMs models multimodal data integeration analysis)