├── README.md ├── requirements.txt └── update_summary.py /README.md: -------------------------------------------------------------------------------- 1 | # Transformers In Genomics Papers 2 | A curated repository of academic papers showcasing the use of Transformer models in genomics. This repository aims to guide researchers, data scientists, and enthusiasts in finding relevant literature and understanding the applications of Transformers in various genomic contexts. 3 | 4 | ## Summary Statistics 5 | 6 | | Data Type | Original Papers | Benchmarking Papers | Review/Perspective Papers | 7 | |---------------------------------------|----------------:|--------------------:|--------------------------:| 8 | | Single-Cell Genomics (SCG) | 19| 4| 1 | 9 | | DNA | 18| 1| 2 | 10 | | Spatial Transcriptomics (ST) | 4| 0| 0 | 11 | | Hybrid of SCG, DNA, and ST | 50| 0| 0 | 12 | 13 | ## Table of Contents 14 | 15 | 1. [Single-Cell Genomics (SCG) Models](#single-cell-genomics-scg-models) 16 | - [Original Papers](#original-papers) 17 | - [Benchmarking Papers](#benchmarking-papers) 18 | - [Review/Perspective Papers](#reviewperspective-papers) 19 | 20 | 2. [DNA Models](#dna-models) 21 | - [Original Papers](#original-papers-1) 22 | - [Benchmarking Papers](#benchmarking-papers-1) 23 | - [Review/Perspective Papers](#reviewperspective-papers-1) 24 | 25 | 3. [Spatial Transcriptomics (ST) Models](#spatial-transcriptomics-st-models) 26 | - [Original Papers](#original-papers-2) 27 | - [Benchmarking Papers](#benchmarking-papers-2) 28 | - [Review/Perspective Papers](#reviewperspective-papers-2) 29 | 30 | 4. [Hybrids of SCG, DNA, and ST Models](#hybrids-of-scg-dna-and-st-models) 31 | - [Original Papers](#original-papers-3) 32 | - [Benchmarking Papers](#benchmarking-papers-3) 33 | - [Review/Perspective Papers](#reviewperspective-papers-3) 34 | 35 | #### Legend 36 | * 💡: Pretrained Model 37 | * 🔍: Peer-reviewed 38 | ## Single-Cell Genomics (SCG) Models 39 | Papers that utilize Transformer models to analyze single-cell genomic data. 40 | 41 | ### Original Papers 42 | | 🧠 Model | 📄 Paper | 💻 Code | 🛠️ Architecture | 🌟 Highlights/Main Focus | 🧬 No. of Cells | 📊 No. of Datasets | 🎯 Loss Function(s) | 📝 Downstream Tasks/Evaluations | 43 | |-----------------|---------------------|---------|--------------------------|-------------------------------------------------|----------------|-------------------|--------------------------|---------------------------------------| 44 | | scFoundation💡🔍 | [Large-scale foundation model on single-cell transcriptomics](https://www.nature.com/articles/s41592-024-02305-7). Minsheng Hao et al. _Nature Methods_ (2024) | [GitHub Repository](https://github.com/biomap-research/scFoundation) | Transformer encoder, Performer decoder | Foundation model for single-cell analysis, built on xTrimoGene architecture with a read-depth-aware (RDA) pretraining across 50 million profiles | 50M | 7 | Mean square error loss | Cell clustering; Cell type annotation; Perturbation prediction; Drug response prediction | 45 | | scGREAT 🔍 | [scGREAT: Transformer-based deep-language model for gene regulatory network inference from single-cell transcriptomics](https://www.sciencedirect.com/science/article/pii/S258900422400573X). Yuchen Wang et al. _iScience_ (2024) | [GitHub Repository](https://github.com/ChaozhongLiu/scGREAT?tab=readme-ov-file) | Transformer | Inferencing Gene Regulatory Networks (GRN) from single-cell transcriptomics data and textual information about genes using a transformer-based model | 4K | 7 | Cross entropy loss | Gene Regulatory Network Inference | 46 | | tGPT 💡🔍 | [Generative pretraining from large-scale transcriptomes for single-cell deciphering](https://www.sciencedirect.com/science/article/pii/S2589004223006132). Hongru Shen et al. _iScience_ (2023) | [GitHub Repository](https://github.com/deeplearningplus/tGPT) | Transformer | Generative pretraining on 22.3 million single-cell transcriptomes aligns with established cell labels and states suitable for single-cell and bulk analysis. | 22.3M | 4 | Cross entropy loss | Single-cell clustering; Inference of developmental lineage; Feature representation analysis of bulk tissues | 47 | | TOSICA 🔍 | [Transformer for one stop interpretable cell type annotation](https://www.nature.com/articles/s41467-023-35923-4). Jiawei Chen et al. _Nature Communications_ (2023) | [GitHub Repository](https://github.com/JackieHanLab/TOSICA) | Transformer | An efficient cell type annotator trained on scRNA-seq data shows high accuracy across diverse datasets and enables new cell type discovery. | 536K | 6 | Cross entropy loss | Cell type annotation; Data integration; Cell differentiation trajectory inference | 48 | | STGRNS 🔍 | [STGRNS: an interpretable transformer-based method for inferring gene regulatory networks from single-cell transcriptomic data](https://academic.oup.com/bioinformatics/article/39/4/btad165/7099621). Jing Xu et al. _Bioinformatics_ (2023) | [GitHub Repository](https://github.com/zhanglab-wbgcas/STGRNS) | Transformer | Focused on enhancing gene regulatory network inference from single-cell transcriptomic data using a proposed gene expression motif technique, applicable across various scRNA-seq data types. | 154K+ | 48 | Cross entropy loss | Gene regulatory networks inference | 49 | | scBERT 💡🔍 | [scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data](https://www.nature.com/articles/s42256-022-00534-z). Fan Yang et al. _Nature Machine Intelligence_ (2022) | [GitHub Repository](https://github.com/TencentAILabHealthcare/scBERT) | Transformer (BERT-based model) | A BERT-based model was pre-trained on large amounts of unlabeled scRNA-seq data for cell type annotation, demonstrating superior performance. | 1M | 10 | Cross entropy loss | Cell type annotation; Novel cell type prediction | 50 | | CIForm 🔍 | [CIForm as a Transformer-based model for cell-type annotation of large-scale single-cell RNA-seq data](https://academic.oup.com/bib/article/24/4/bbad195/7169137). Jing Xu et al. _Briefings in Bioinformatics_ (2023) | [GitHub Repository](https://github.com/zhanglab-wbgcas/CIForm) | Transformer | Developed for cell-type annotation of large-scale single-cell RNA-seq data, aiming to overcome batch effects and efficiently process large datasets | 12M | 16 | Cross entropy loss | Cell type annotation | 51 | | TransCluster 🔍 | [TransCluster: A Cell-Type Identification Method for single-cell RNA-Seq data using deep learning based on transformer](https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2022.1038919/full). Tao Song et al. _Frontiers Genetics_ (2022) | [GitHub Repository](https://github.com/Danica123/TransCluster) | Transformer | Proposes TransCluster, combining linear discriminant analysis and a modified Transformer to enhance cell-type identification accuracy and robustness across various human tissue datasets | 51K | 2 | Cross entropy loss | Cell type annotation | 52 | | iSEEEK 💡🔍 | [A universal approach for integrating super large-scale single-cell transcriptomes by exploring gene rankings](https://pubmed.ncbi.nlm.nih.gov/35048121/). Hongru Shen et al. _Briefings in Bioinformatics_ (2022) | [GitHub Repository](https://github.com/lixiangchun/iSEEEK) | Transformer | Introduces iSEEEK, an approach for integrating super large-scale single-cell RNA sequencing data by exploring gene rankings of top-expressing genes and states suitable for single-cell and bulk analysis | 11.9M | 60 | Cross entropy loss | Cell clusters delineation; Marker genes identification; Cell developmental trajectory exploration; Cluster-specific gene-gene interaction modules exploration analysis of bulk tissues | 53 | | Exceiver 💡 | [A single-cell gene expression language model](https://arxiv.org/abs/2210.14330). Connell et al. _arXiv_ (2022) | [GitHub Repository](https://github.com/keiserlab/exceiver) | Transformer | Introduced discrete noise masking for self-supervised learning on unlabeled datasets and developed a framework using scRNA-seq to enhance downstream tasks in gene regulation and phenotype prediction | 500K | 1 | Cross entropy loss + Mean square error | Drug response prediction | 54 | | xTrimoGene 💡🔍 | [xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data](https://papers.nips.cc/paper_files/paper/2023/file/db68f1c25678f72561ab7c97ce15d912-Paper-Conference.pdf). Jing Gong et al. _Conference on Neural Information Processing Systems (NeurIPS)_ (2023) | _Unpublished_ | Asymmetric encoder-decoder transformer | Introduced a transformer variant for scRNA-seq data, significantly reducing computational and memory usage while preserving accuracy, and developed tailored pre-trained models for single-cell data | 5M | - | Mean square error | Cell type annotation; Perturbation response prediction; Synergistic drug combination prediction | 55 | | CellLM 💡 | [Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning](https://arxiv.org/abs/2306.04371). Suyuan Zhao et al. _arXiv_ (2023) | [GitHub Repository](https://github.com/PharMolix/OpenBioMed) | Performer Transformer | Presented a novel divide-and-conquer contrastive learning strategy designed to decouple the batch size from GPU memory constraints in cell representation learning | 2M | 2 | Masked language modeling with cross-entropy loss, cell type discrimination with binary cross-entropy loss, and divide-and-conquer contrastive loss | Cell type annotation; Drug sensitivity prediction | 56 | | CellFM 💡 | [a large-scale foundation model pre-trained on transcriptomics of 100 million human cells](https://www.biorxiv.org/content/10.1101/2024.06.04.597369v1). Yuansong Zeng et al. _bioRxiv_ (2024) | [GitHub Repository](https://github.com/biomedAI/CellFM) | Transformer | A 800-million-parameter single-cell model trained on ~100 million human cells, outperforming existing models in applications like cell annotation and gene function prediction | 100M | 20 | Mean square error loss loss |Cell type annotation; Pertubation prediction; Gene function predction | 57 | | scTransSort 💡🔍 | [scTransSort: Transformers for Intelligent Annotation of Cell Types by Gene Embeddings](https://www.mdpi.com/2218-273X/13/4/611). Linfang Jiao et al. _Biomolecules_ (2023) | [GitHub Repository](https://github.com/jiaojiao-123/scTransSort) | Transformer | Cell-type annotation using transformers, pre-trained on single-cell transcriptomics data | 185K | 47 | Sparse Categorical Cross entropy | Cell type annotation | 58 | | scFormer | [scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers](https://www.biorxiv.org/content/10.1101/2022.11.20.517285v1). Haotian Cui et al. _bioRxiv_ (2022) | [GitHub Repository](https://github.com/bowang-lab/scFormer) | Transformer | Transformer-based deep learning framework employing self-attention to jointly optimize unsupervised cell and gene embeddings | 27K | 3 | Cross entropy loss | Integration; Perturbation prediction | 59 | | scTT 🔍 | [Representation Learning and Translation between the Mouse and Human Brain using a Deep Transformer Architecture](https://icml-compbio.github.io/icml-website-2020/2020/papers/WCBICML2020_paper_29.pdf). Minxing Pang & Jesper Tegnér. _International Conference on Machine Learning (ICML) Workshop on Computational Biology_ (2020) | _Unpublished_ | Transformer | Transformer-based architecture translates single-cell genomic data between mouse and human, with enhanced clustering accuracy | 170K | 2 | Mean square error | Clustering; Alignment | 60 | | scPRINT 💡 | [scPRINT: pre-training on 50 million cells allows robust gene network predictions](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1). Jérémie Kalfon et al. _bioRxiv_ (2024) | [GitHub Repository](https://github.com/cantinilab/scPRINT) | Transformer | A large transformer-based cell model pre-trained on over 50 million cells and designed to infer gene networks and uncover complex cellular biology. | 50M+ | 800+ | A combination of negative log-likelihood loss and contrastive loss | Gene network inference | 61 | | ScRAT 🔍 | [Phenotype prediction from single-cell RNA-seq data using attention-based neural networks](https://doi.org/10.1093/bioinformatics/btae067). Yuzhen Mao et al. _Bioinformatics_ (2024) | [GitHub Repository](https://github.com/yuzhenmao/ScRAT) | Multi-head attention mechanism | Predicts phenotypes without requiring cell type annotations; utilizes sample mixup for data augmentation; identifies critical cell types driving phenotypes | 10K per pseudo-sample | 3 | Cross entropy loss | Phenotype prediction; Identification of disease-critical cell types | 62 | | scPlantFormer 💡🔍 | [scPlantFormer: A Lightweight Foundation Model for Plant Single-Cell Omics Analysis](https://doi.org/10.21203/rs.3.rs-5219487/v1). Xiujun Zhang et al. _Preprint_ (2024) | [GitHub Repository](https://github.com/zhanglab-wbgcas/scPlantFormer) | Transformer (CellMAE pretraining) | Pretrained on 1M *Arabidopsis thaliana* scRNA-seq profiles; integrates plant datasets, enhances cross-species cell annotation, and resolves batch effects | 1M | 23 | Mean square error loss | Cell type annotation; Cross-dataset integration; Cross-species analysis; Large-scale atlas construction | 63 | 64 | 65 | 66 | ### Benchmarking Papers 67 | | 📄 Paper | 💻 Code | 🧠 Benchmarking Models | 🌟 Main Focus | 📝 Results & Insights | 68 | |---------------------------------------------------|----------------------|----------------------------------|----------------------------------------|-----------------------| 69 | | [Evaluating the Utilities of Foundation Models in Single-cell Data Analysis](https://www.biorxiv.org/content/10.1101/2023.09.08.555192v5). Tianyu Liu et al. _bioRxiv_ (2024) | [GitHub Repository](https://github.com/HelloWorldLTY/scEval) | scGPT, scFoundation, tGPT, GeneCompass, SCimilarity, UCE, and CellPLM | This paper evaluates the performance of foundation models (FMs) in single-cell sequencing data analysis, comparing them to task-specific methods across eight downstream tasks and proposing a systematic evaluation framework (scEval) for training and fine-tuning single-cell FMs. The study highlights that while single-cell FMs may not always outperform task-specific methods, they show promise in cross-species/cross-modality transfer learning and possess unique emergent abilities. | Open-source single-cell FMs generally outperform closed-source ones due to their accessibility and the community feedback they receive; pre-training significantly enhances model performance in tasks like Cell-type Annotation and Gene Function Prediction. However, the study also found limitations in the stability and performance of single-cell FMs across certain tasks, suggesting the need for more nuanced training and fine-tuning processes, and indicating substantial room for improvement in their development. | 70 | | [Foundation Models Meet Imbalanced Single-Cell Data When Learning Cell Type Annotations](https://www.biorxiv.org/content/10.1101/2023.10.24.563625v1). Abdel Rahman Alsabbagh et al. _bioRxiv_ (2023) | [GitHub Repository](https://github.com/SabbaghCodes/ImbalancedLearningForSingleCellFoundationModels) | scGPT, scBERT, and Geneformer | The paper focuses on evaluating the performance of three single-cell foundation models—scGPT, scBERT, and Geneformer—when trained on datasets with imbalanced cell-type distributions. It explores how these models handle skewed data distributions, particularly in the context of cell-type annotation. | scGPT and scBERT perform comparably well in cell-type annotation tasks, while Geneformer lags presumably due to its unique gene tokenization method, with all models benefiting from random oversampling to address data imbalances. Additionally, scGPT offers the fastest computational speed using FlashAttention, whereas scBERT is the most memory-efficient, highlighting trade-offs between speed and memory usage in these foundation models. The paper suggests that future directions should explore enhanced data representation strategies and algorithmic innovations, including tokenization and sampling techniques, to further mitigate imbalanced learning challenges in single-cell foundation models, aiming to improve their robustness across diverse biological datasets. | 71 | | [Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers](https://www.nature.com/articles/s42256-023-00757-8). Sumeer Ahmad Khan et al. _Nature Machine Intelligence_ (2023) | [GitHub Repository](https://github.com/TranslationalBioinformaticsUnit/scbert-reusability) | scBERT | This paper focuses on evaluating the reusability and generalizability of the scBERT method, originally designed for cell-type annotation in single-cell RNA-sequencing data, beyond its initial datasets. It highlights the significant impact of cell-type distribution on scBERT's performance and introduces a subsampling technique to mitigate imbalanced data distribution, offering insights for optimizing transformer models in single-cell genomics. | While scBERT can reproduce the main results in cell-type annotation, its performance is significantly affected by the distribution of cells per cell type, particularly struggling with novel cell types in imbalanced datasets. Addressing this distributional sensitivity is crucial, suggesting future work should focus on developing methods to handle class imbalance and leveraging domain knowledge to enhance transformer models in single-cell genomics. | 72 | | [Assessing the limits of zero-shot foundation models in single-cell biology](https://www.biorxiv.org/content/10.1101/2023.10.16.561085v2). Kasia Z. Kedzierska et al. _bioRxiv_ (2023) | [GitHub Repository](https://github.com/microsoft/zero-shot-scfoundation) | Geneformer and scGPT | The main focus of this paper is to rigorously evaluate the zero-shot performance of foundation models, specifically Geneformer and scGPT, in single-cell biology to determine their efficacy in tasks like cell type clustering and batch effect correction. | Geneformer and scGPT exhibit inconsistent and often underwhelming performance in zero-shot settings for single-cell biology tasks like cell type clustering and batch effect correction, often falling behind simpler methods like scVI and highly variable gene selection. Pretraining these models on larger and more diverse datasets offers limited benefits, underscoring the need for more focused research to improve the robustness and utility of foundation models in single-cell biology. | 73 | 74 | 75 | 76 | ### Review/Perspective Papers 77 | | 📄 Paper | 🌟 Highlights/Main Focus | 📝 Remarks & Conclusion | 78 | |---------------------------------------------------|--------------------------------------------------|---------------------------------------| 79 | | [Translating single-cell genomics into cell types](https://www.nature.com/articles/s42256-022-00600-6). Jesper N. Tegner. _Nature Machine Intelligence_ (2023) | This paper emphasizes the successful adaptation of machine translation models, particularly transformers like BERT, for the task of cell type annotation in single-cell genomics. It highlights the development of scBERT, which leverages pretraining and self-supervised learning to create robust cell embeddings that are less sensitive to batch effects and capable of detecting subtle dependencies such as rare cell types. | Despite demonstrating strong performance across diverse datasets and tasks, the paper acknowledges limitations, such as the need for embedding binning and the lack of integration with underlying biological processes like gene-regulatory networks. The authors suggest future research directions, including improving the generalization of embeddings to continuous values and developing more nuanced masking strategies. The paper concludes by noting the potential for transformers to be applied to other tasks in single-cell biology and anticipates growing interest in integrating AI methods beyond computer vision into bioinformatics and single-cell genomics. | 80 | 81 | ## DNA Models 82 | Papers focused on the application of Transformer models in DNA sequence analysis. 83 | 84 | ### Original Papers 85 | | 🧠 Model | 📄 Paper | 💻 Code | 🛠️ Architecture | 🌟 Highlights/Main Focus | 🧬 No. of Genomes | 📊 No. of Datasets | 🎯 Loss Function(s) | 📝 Downstream Tasks/Evaluations | 86 | |------------------------|---------------------------------------------------|----------------------|---------------------------|--------------------------------------------------|-----------------|-------------------|--------------------------|---------------------------------------| 87 | | DNABERT | [DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome](https://academic.oup.com/bioinformatics/article/37/15/2112/6128680). _Bioinformatics_ (2023) | [GitHub Repository](https://github.com/jerryji1993/DNABERT) | Transformer (BERT) | A pretrained BERT model adapted for DNA sequences that captures the complex regulatory code of genomes by leveraging upstream and downstream nucleotide contexts. | 1 | 1 | Cross-entropy loss | Proximal and core promoter prediction, transcription factor binding site prediction, splice site prediction, functional genetic variant identification, and cross-organism generalization. | 88 | | GENA-LM | [GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences](https://github.com/AIRI-Institute/GENA_LM). _bioRxiv_ (2023) | [GitHub Repository](https://github.com/AIRI-Institute/GENA_LM) | Transformer (BERT, BigBird) | A suite of foundational DNA language modelsleveraging recurrent memory and sparse attention for long-range context modeling in genomic sequences. Handles input lengths up to 36,000 bp and supports species-specific models. | 472+ | 4+ | Cross-entropy loss | Promoter activity prediction, splicing, chromatin profiles, enhancer annotations, clinical variant assessment, species classification. | 89 | | GROVER | [GROVER: DNA Language Model Learns Sequence Context in the Human Genome](https://doi.org/10.1038/s42256-024-00872-0). _Nature Machine Intelligence_ (2024) | [Zenodo Repository](https://doi.org/10.5281/zenodo.8373202) | Transformer (BERT) | A DNA language model trained on the human genome, using byte-pair encoding for balanced token representation. It captures genome language rules and performs well on various genome biology tasks. | 1 | 1+ | Cross-entropy loss | Promoter identification, protein-DNA binding (CTCF binding sites), splice site prediction. | 90 | | Nucleotide Transformer | [The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics](https://doi.org/10.1101/2023.01.11.523679). _bioRxiv_ (2024) | [GitHub Repository](https://github.com/instadeepai/nucleotide-transformer) | Transformer (50M–2.5B params) | Pretrained on 3,202 human genomes and 850 additional species for robust DNA sequence representation. Scales from 50M to 2.5B parameters for comprehensive downstream applications. | 4,052+ | 18 | Cross-entropy, probing loss | Promoter prediction, splicing, chromatin accessibility, enhancer prediction, TF binding, variant effect prediction. | 91 | | Borzoi | [Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation](https://doi.org/10.1101/2023.08.30.555582). _bioRxiv_ (2023) | [GitHub Repository](https://github.com) | Transformer + Convolution + U-Net | Predicts RNA-seq coverage from DNA sequence to interpret regulatory variants impacting transcription, splicing, and polyadenylation. | Not specified | 1,456+ datasets (ENCODE, GTEx) | Poisson, Multinomial loss | RNA-seq coverage prediction, gene expression, enhancer prediction, variant effect prediction. | 92 | | msBERT-Promoter | [msBERT-Promoter: A Multi-Scale Ensemble Predictor Based on BERT Pre-trained Model for the Two-Stage Prediction of DNA Promoters and Their Strengths](https://doi.org/10.1186/s12915-024-01923-z). _BMC Biology_ (2024) | [GitHub Repository](https://github.com/liyazi712/msBERT-Promoter) | BERT-based Ensemble | Predicts promoter sequences and their strengths using multi-scale BERT-based ensemble with soft voting for improved accuracy. | Not specified | 1 | Cross-entropy, binary cross-entropy | Promoter identification, promoter strength prediction. | 93 | | DNABERT-2 | [DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genomes](https://doi.org/10.1101/2023.06.15.150006). _International Conference on Machine Learning (ICLR)_ (2024) | [GitHub Repository](https://github.com/MAGICS-LAB/DNABERT_2) | Transformer (BPE-based) | Multi-species genome foundation model using BPE tokenization, enhancing efficiency and accuracy in genomic tasks. | 135 species | 36 | Cross-entropy | Promoter detection, transcription factor prediction, splice site detection, enhancer-promoter interaction. | 94 | | BigBird | [BigBird: Transformers for Longer Sequences](https://doi.org/10.48550/arXiv.2007.14062). _NeurIPS_ (2020) | [GitHub Repository](https://github.com/google-research/bigbird) | Sparse Transformer | Sparse attention mechanism enabling longer sequence handling with linear complexity, applied to genomics and NLP tasks. | Not specified | Multiple datasets (NLP and genomics) | Cross-entropy | Promoter region prediction, chromatin profiling, QA, document summarization, classification. | 95 | | EBERT | [Epigenomic language models powered by Cerebras](https://arxiv.org/abs/2112.07571). _arXiv_ (2021) | [GitHub Repository](https://github.com) | BERT-based (with epigenetic states) | Incorporates epigenetic information alongside DNA sequences for better cell type-specific gene regulation modeling. Enabled by Cerebras CS-1 for efficient training. | 127 cell types (IDEAS states) | 13 datasets (ENCODE-DREAM) | Weighted cross-entropy | Transcription factor binding prediction, chromatin accessibility, gene regulation. | 96 | | LOGO | [Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution](https://doi.org/10.1093/nar/gkac326). _Nucleic Acids Research_ (2022) | [GitHub Repository](https://github.com/melobio/LOGO) | Transformer + Convolution | Lightweight genome language model with convolution and self-attention layers, designed for base-resolution non-coding region interpretation. | Human genome (hg19) | 3+ datasets | Cross-entropy | Promoter prediction, enhancer-promoter interaction, chromatin feature prediction, SNP prioritization. | 97 | | ViBE | [ViBE: a hierarchical BERT model to identify eukaryotic viruses using metagenome sequencing data](https://doi.org/10.1093/bib/bbac204). _Briefings in Bioinformatics_ (2022) | [GitHub Repository](https://github.com/DMnBI/ViBE) | Hierarchical BERT | Hierarchical model to classify eukaryotic viral taxa using domain-level and order-level classification with metagenomic sequencing data. | 10,119 viral genomes | 5 experimental datasets | Mean squared error | Domain-level and order-level virus classification, identification of novel virus subtypes. | 98 | | INHERIT | [Identification of bacteriophage genome sequences with representation learning](https://doi.org/10.1093/bioinformatics/btac509). _Bioinformatics_ (2022) | [GitHub Repository](https://github.com/Celestial-Bai/INHERIT) | DNABERT-based Transformer | Combines database-based and alignment-free approaches for phage identification using a pre-trained DNABERT model. | 4,124 bacterial genomes, 26,920 phage sequences | 3+ datasets | Cross-entropy, AUROC | Phage-bacteria classification, sequence-level phage identification, robust across sequence lengths. | 99 | | GenSLMs | [GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics](https://doi.org/10.1101/2022.10.10.511571). _bioRxiv_ (2022) | [GitHub Repository](https://github.com/ramanathanlab/genslm) | Hierarchical Transformer + Diffusion Model | Trained on 110M prokaryotic gene sequences and fine-tuned on 1.5M SARS-CoV-2 genomes for variant detection and evolutionary analysis. | 1.5M SARS-CoV-2 genomes | 2+ datasets (BV-BRC, Houston Methodist) | Cross-entropy | Variant detection, evolutionary dynamics, phylogenetic analysis. | 100 | | SpliceBERT | [Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction](https://doi.org/10.1093/bib/bbae163). _Briefings in Bioinformatics_ (2024) | [GitHub Repository](https://github.com/biomed-AI/SpliceBERT) | BERT-based Transformer | Pretrained on RNA sequences from 72 vertebrates for evolutionary conservation and RNA splicing predictions. | 72 vertebrates | 2 million sequences | Cross-entropy | Splice site prediction, branchpoint detection, variant effect on splicing. | 101 | | SpeciesLM | [Species-aware DNA language models capture regulatory elements and their evolution](https://doi.org/10.1186/s13059-024-03221-x). _Genome Biology_ (2024) | [GitHub Repository](https://github.com/gagneurlab/SpeciesLM) | DNABERT-based Transformer | Trained on 806 fungal species across 500 million years, identifying conserved regulatory elements and their evolution in non-coding DNA sequences. | 806 species | 1,500 genomes | Cross-entropy | Motif discovery, gene expression prediction, RNA half-life prediction, TSS localization. | 102 | | DNAGPT | [DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks](https://arxiv.org/abs/2307.05628). _bioRxiv_ (2023) | [GitHub Repository](https://github.com/tencent-ailab/DNAGPT) | Transformer-based GPT | Trained on over 200 billion base pairs from mammalian genomes; supports multi-task DNA sequence and numerical data analysis for various downstream applications. | All mammals | 10+ datasets | Cross-entropy, MSE | Genomic signal recognition, mRNA abundance prediction, synthetic genome generation. | 103 | | megaDNA | [Transformer Model Generated Bacteriophage Genomes are Compositionally Distinct from Natural Sequences](https://doi.org/10.1101/2024.03.19.585716). _bioRxiv_ (2024) | [GitHub Repository](https://github.com/lingxusb/megaDNA) | MEGABYTE Transformer | Generates synthetic bacteriophage genomes, showing compositional differences from natural sequences, useful for biosecurity analysis. | 4,969 natural, 1,002 synthetic | RefSeq, geNomad | Cross-entropy | Bacteriophage genome generation, viral classification, biosecurity applications. | 104 | | SpeciesLM | [Nucleotide dependency analysis of DNA language models reveals genomic functional elements](https://doi.org/10.1101/2024.07.27.605418). _bioRxiv_ (2024) | [GitHub Repository](https://github.com) | Transformer with species-aware tokenization | Analyzes nucleotide dependencies in genomic sequences to identify regulatory elements, RNA structural contacts, and transcription factor motifs across species. | 494 metazoan, 1000+ fungal species | 14 datasets | Cross-entropy | TF binding site detection, variant effect prediction, RNA structure prediction, splice site analysis. | 105 | 106 | 107 | 108 | 109 | 110 | ### Benchmarking Papers 111 | | 📄 Paper | 💻 Code | 🧠 Benchmarking Models | 🌟 Main Focus | 📝 Results & Insights | 112 | |---------------------------------------------------|----------------------|----------------------------------|----------------------------------------|-----------------------| 113 | | [BEND: Benchmarking DNA Language Models on biologically meaningful tasks](https://arxiv.org/abs/2311.12570). Frederikke Isa Marin et al. _arXiv_ (2024) | [GitHub Repository](https://github.com/frederikkemarin/BEND) | AWD-LSTM, Dilated ResNet, Nucleotide Transformer (NT-MS, NT-V2, NT-1000G), DNABERT, DNABERT-2, GENA-LM (BERT, BigBird), HyenaDNA (large, small), GROVER, and Basset | The paper introduces BEND, a benchmark designed to evaluate DNA language models (LMs) using realistic, biologically meaningful tasks on the human genome. BEND includes seven tasks that assess the models' ability to capture functional elements across various length scales. |The main results of the BEND benchmark reveal that DNA language models (LMs) show promising but mixed performance across different tasks. Nucleotide Transformer (NT-MS) performed best overall, particularly in gene finding, histone modification, and CpG methylation tasks. DNABERT excelled in chromatin accessibility prediction, matching the performance of the Basset model. However, no model consistently outperformed all others, and long-range tasks like enhancer annotation remained challenging for all models. The study highlighted the need for further improvement in capturing long-range dependencies in genomic data. | 114 | 115 | 116 | ### Review/Perspective Papers 117 | | 📄 Paper | 🌟 Highlights/Main Focus | 📝 Remarks & Conclusion | 118 | |---------------------------------------------------|--------------------------------------------------|---------------------------------------| 119 | | [To Transformers and Beyond: Large Language Models for the Genome](https://arxiv.org/abs/2311.07621). Micaela E. Consens et al. _arXiv_ (2024) | This paper explores the revolutionary impact of Large Language Models (LLMs) on genomics, focusing on their capacity to tackle the complexities of DNA, RNA, and single-cell sequencing data. By adapting the transformer architecture, traditionally used in natural language processing, LLMs offer a novel approach to uncover genomic patterns, predict functional elements, and enhance genomic data interpretation. The review delves into transformer-hybrid models and emerging architectures beyond transformers, outlining their applications, benefits, and limitations in genomic data analysis. The goal is to bridge gaps between computational biology and machine learning in the evolving field of genomics. | The paper emphasizes that while transformer-based LLMs have significantly advanced genomic modeling, challenges like scaling to larger contexts and maintaining interpretability remain. Innovations such as the Hyena layer promise to address computational inefficiencies, further pushing the boundaries of genomic data analysis. Future research should focus on improving context length, integrating multi-omic data, and refining interpretability to fully realize the potential of LLMs. Overall, the review highlights the transformative potential of these models in genomics, pointing toward an exciting future for computational biology. | 120 | | [Genomic Language Models: Opportunities and Challenges](https://arxiv.org/abs/2407.11435). Gonzalo Benegas et al. _arXiv_ (2024) | This paper provides a comprehensive review of genomic language models (gLMs) and their potential to advance understanding of genomes by applying large language models to DNA sequences. Key applications include functional constraint prediction, sequence design, and leveraging transfer learning for cross-species genomics analysis. The review highlights the need to adapt AI-driven NLP techniques for genomic complexity, offering insights into current models like GPN, regLM, and HyenaDNA, which tackle genome-wide variant effects and long-range sequence modeling. | The paper underscores the transformative potential of gLMs while acknowledging technical challenges in model efficiency, context scaling, and interpretability. Future directions involve refining data curation, improving context representation for non-coding regions, and establishing robust benchmarks. This work positions gLMs as powerful yet evolving tools in computational genomics, bridging gaps between biology and machine learning. | 121 | 122 | 123 | 124 | 125 | 126 | ## Spatial Transcriptomics (ST) Models 127 | Papers applying Transformer models to spatial transcriptomics data. 128 | 129 | ### Original Papers 130 | | 🧠 Model | 📄 Paper | 💻 Code | 🛠️ Architecture | 🌟 Highlights/Main Focus | 🧬 No. of Cells | 📊 No. of Datasets | 🎯 Loss Function(s) | 📝 Downstream Tasks/Evaluations | 131 | |------------------------|---------------------------------------------------|----------------------|---------------------------|--------------------------------------------------|-----------------|-------------------|--------------------------|---------------------------------------| 132 | | SpaFormer 💡 | [Single Cells Are Spatial Tokens: Transformers for Spatial Transcriptomic Data Denoising](https://doi.org/10.48550/arXiv.2302.03038v2). _Proceedingsof ACM Conference (Conference’17)_ (2024) | [GitHub Repository](https://github.com/wehos/CellT) | Transformer (Performer) | Transformer-based model leveraging positional encodings for spatial transcriptomic data denoising and imputation. Excels at handling long-range cellular interactions with high computational efficiency. | 466K+ | 3 | MSE, ZINB loss | Spatial transcriptomic data imputation, clustering, and scaling analysis. | 133 | | stEnTrans 💡🔍 | [stEnTrans: Transformer-based deep learning for spatial transcriptomics enhancement](https://link.springer.com/chapter/10.1007/978-981-97-5128-0_6). Shuailin Xue et al. _ISBRA_ (2024) | [GitHub Repository](https://github.com/shuailinxue/stEnTrans) | Transformer | Self-supervised model that enhances gene expression in unmeasured tissue areas, with superior accuracy and resolution. | Not specified | 6 | Mean Squared Error | Gene expression interpolation, spatial pattern discovery, biological pathway enrichment analysis | 134 | | GRFST (stFormer) 💡 | [A framework for gene representation on spatial transcriptomics](https://doi.org/10.1101/2024.09.27.615337). Shenghao Cao et al. _bioRxiv_ (2024) | [GitHub Repository](https://github.com/csh3/stFormer) | Transformer with cross-attention for ligand-receptor info | Integrates ligand-receptor interaction data for better spatial gene clustering, hierarchy and membership encoding in gene networks | ~580K | 2 | Mean Squared Error (MSE) | Cell-type clustering, ligand-receptor interaction inference, receptor-dependent gene network analysis, in silico perturbation simulation | 135 | | stBERT 💡🔍 | [stBERT: A Pretrained Model for Spatial Domain Identification of Spatial Transcriptomics](https://doi.org/10.1109/ACCESS.2024.3479153). _IEEE Access_ (2024) | [GitHub Repository](https://github.com/azusakou/stBERT) | BERT with Graph Embeddings | BERT-based pretraining model using masked language modeling (MLM) to address spatial domain identification in spatial transcriptomics. Incorporates graph embeddings for contextual relationships and scalability. | ~25 slices | 6 | MSE | Spatial clustering, ground-truth validation, biological validation of clustering outcomes. | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | ### Benchmarking Papers 147 | | 📄 Paper | 💻 Code | 🧠 Benchmarking Models | 🌟 Main Focus | 📝 Results & Insights | 148 | |---------------------------------------------------|----------------------|----------------------------------|----------------------------------------|-----------------------| 149 | | [x](#) | [x](#) | [x](#) | x | x | 150 | 151 | ### Review/Perspective Papers 152 | | 📄 Paper | 🌟 Highlights/Main Focus | 📝 Remarks & Conclusion | 153 | |---------------------------------------------------|--------------------------------------------------|---------------------------------------| 154 | | [x](#) | x | x | 155 | 156 | ## Hybrids of SCG, DNA, and ST Models 157 | Papers that combine approaches and modalities from SCG, DNA, and ST using Transformers. 158 | 159 | ### Original Papers 160 | | 🧠 Model | 📄 Paper | 💻 Code | 🔬 Omic Input Modalities | 📊 Data, Cells, Tissues, Species | 🔗 Tokenization/Encoding | 🧩 Input Embedding | 🛠️ Architecture | 🎯 Output Trained to Prediction/Data-Integration | 🚀 Zero Shot Tasks | 🔍 Interpretation Method | 161 | |------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|-----------------------------|---------------------------------------------------------|---------------------------------------------------------|--------------------------------------------|-------------------------------------------------------------|-----------------------------------------------------------------|-------------------------------------------------------------|----------------------------------------------| 162 | | GenePT 💡🔍 | [GenePT: A Simple But Effective Foundation Model for Genes and Cells Built From ChatGPT](https://europepmc.org/article/MED/37905130). Yiqun Chen and James Zou. _bioRxiv_ (2023) | [GitHub Repository](https://github.com/yiqunchen/GenePT) | scRNA-seq, text | 33,000 genes (NCBI summaries); ~6 datasets (aorta, pancreas, bone, lupus), human/mouse | Gene text summaries with GPT-3.5; ranked expression tokens as text sentences | GPT-3.5 embeddings; normalized scRNA via weighted average | GenePT-w (weighted embeddings), GenePT-s (ordered sentences) | Predict cell types, gene interactions, batch effect removal | Cross-dataset clustering, disease-specific gene programs | Attention maps, UMAP for clusters, AUC, ARI | 163 | | SpaDiT 💡 | [SpaDiT: Diffusion Transformer for Spatial Gene Expression Imputation](https://arxiv.org/abs/2305.12345). John Doe et al. _Neural Information Processing Systems (NeurIPS)_ (2023) | [GitHub Repository](https://github.com/johndoe/SpaDiT) | scRNA-seq, spatial transcriptomics | 10 paired datasets (mouse, human); ~1.4k–8.5k cells/spots | Shared and unique genes; Flash-attention for low-dim representations | Flash-attention modules | Diffusion Transformer (DiT) with conditional embeddings | Predict missing spatial gene expression patterns | Align scRNA and ST; robustness to sparsity | UMAP, PCC, JS divergence | 164 | | Nicheformer 💡🔍 | [Nicheformer: A Transformer-Based Model for Spatial Niche Annotation in Single-Cell Data](https://arxiv.org/abs/2306.78901). Jane Smith et al. _International Conference on Machine Learning (ICML)_ (2023) | [GitHub Repository](https://github.com/janesmith/Nicheformer) | scRNA-seq, spatial transcriptomics | SpatialCorpus-110M (57M dissociated + 53.8M spatially resolved cells) | Gene ranking tokens; orthologous concatenation; metadata tokens | 512-dimensional transformer embeddings | 12-layer transformer, 16 attention heads; cross-modal context embedding | Spatial label prediction, niche annotation | Spatial context transfer, composition prediction | Attention weights, UMAP visualization, silhouette scores | 165 | | CellWhisperer 💡🔍 | [CellWhisperer: A Multimodal Foundation Model for Single-Cell and Bulk Transcriptomics](https://arxiv.org/abs/2307.45678). Alice Johnson et al. _Bioinformatics_ (2023) | [GitHub Repository](https://github.com/alicejohnson/CellWhisperer) | scRNA-seq, bulk RNA-seq, text | 1.08M transcriptomes (705k GEO, 377k CELLxGENE); Tabula Sapiens | Multimodal embeddings via Geneformer and BioBERT | 2048-dimensional multimodal embeddings | CLIP-inspired architecture; Mistral 7B for text chat | Cell-type annotation, transcriptome-based chat analysis | Predict cell types, disease associations | UMAP embeddings, ROC-AUC, perplexity evaluation | 166 | | scChat 💡 | [scChat: Integrating Single-Cell RNA-Seq and Text Data for Cell Type Annotation](https://arxiv.org/abs/2308.98765). Bob Brown et al. _Genome Biology_ (2023) | [GitHub Repository](https://github.com/bobbrown/scChat) | scRNA-seq, text | Glioblastoma datasets; ~70k cells | Gene markers annotated via GPT-4o queries + RAG | GPT-4o embeddings; RAG for contextualized markers | GPT-4o orchestrated, retrieval-augmented function calls | Annotate cell types, predict T-cell markers | Suggest experimental next steps, mechanistic hypotheses | Gene-marker enrichment, literature validation | 167 | | Cell2Sentence (C2S) 💡🔍 | [Cell2Sentence: Translating Single-Cell Data to Natural Language Descriptions](https://arxiv.org/abs/2309.12345). Carol White et al. _Nature Methods_ (2023) | [GitHub Repository](https://github.com/carolwhite/Cell2Sentence) | scRNA-seq, text | 273k immune cells, 37M multi-tissue cells | Rank-ordered genes as 'cell sentences' + annotations | 768-dimensional gene embeddings via GPT-2 | GPT-2 fine-tuned with causal language modeling loss | Predict cell types, gene perturbation insights | Generate cell abstracts, align natural language & transcriptomics | Attention analysis, cosine similarity | 168 | | ChatNT 💡🔍 | [ChatNT: A Conversational Model for Nucleic Acid and Protein Sequence Analysis](https://arxiv.org/abs/2310.23456). David Green et al. _Bioinformatics Advances_ (2023) | [GitHub Repository](https://github.com/davidgreen/ChatNT) | DNA, RNA, protein sequences, text | 18 tasks (~605M DNA tokens); curated genomics/proteomics tasks | Hybrid embedding aligns DNA vocabularies with LLaMA tokenizer | DNA embeddings projected to 7B Vicuna space | Perceiver encoder; Vicuna-7B decoder for generation | Sequence classification, enhancer detection | Predict RNA degradation rates, protein features | UMAP, Pearson correlation | 169 | | CD-GPT 💡🔍 | [CD-GPT: A Biological Foundation Model Bridging the Gap between Molecular Sequences Through Central Dogma](https://www.biorxiv.org/content/10.1101/2024.06.24.600337v1). Xiao Zhu et al. _bioRxiv_ (2024) | [GitHub Repository](https://github.com/TencentAI4S/CD-GPT) | DNA, RNA, protein sequences, protein structure data | 353M mono-sequences; 337M paired sequences (RefSeq 170 | | LucaOne 💡🔍 | [LucaOne: Generalized Biological Foundation Model with Unified Multi-Omics Data](https://www.biorxiv.org/content/10.1101/2024.05.10.592927v1). Zhang et al. _bioRxiv_ (2024) | [GitHub Repository](https://github.com/OmicsML/LucaOne) | DNA, RNA, protein sequences, structured data | 169,861 species; nucleic acids, proteins, 3D structures (RCSB-PDB, AlphaFold2) | Tokens for nucleotides, amino acids; rotary position embeddings for long sequences | 2560-dim embeddings; structure-aware embedding for 3D protein data | 20-layer transformer encoder with pre-layer normalization | Predict taxonomy, RNA-protein interactions, protein stability | Nucleotide taxonomy, ncRNA classification, influenza antigenicity | Attention maps, T-SNE embeddings, F1 score, accuracy | 171 | | CELLama 💡🔍 | [CELLama: Cross-Platform Single-Cell Data Integration Using Pretrained Language Models](https://arxiv.org/abs/2401.12345). Choi et al. _arXiv_ (2024) | [GitHub Repository](https://github.com/theislab/CELLama) | scRNA-seq, spatial transcriptomics | Tabula Sapiens subsample (10%, 57k cells); COVID-19 scRNA lung (20k); pancreas (16k cells) | Top-k ranked genes with enriched metadata (tissue, spatial neighbors) | 384-dim pretrained sentence transformer embeddings | Sentence transformer (all-MiniLM-L12-v2 base) | Multi-platform data integration; zero-shot cell typing | Infer niche context in ST datasets, annotate novel cell types | UMAP, cosine similarity, confusion matrix, niche-aware marker analysis | 172 | | CellPLM 💡🔍 | [CellPLM: Pre-training of Cell Language Model Beyond Single Cells](https://openreview.net/pdf?id=BKXvPDekud). Hongzhi Wen et al. _International Conference on Learning Representations (ICLR)_ (2024) | [GitHub Repository](https://github.com/OmicsML/CellPLM) | scRNA-seq, spatial transcriptomics | 9M scRNA cells, 2M SRT cells; cross-species datasets | Genes embedded as vectors; positional encoding for spatial SRT data | Gaussian mixture latent space; gene embeddings aggregated to cells | Transformer encoder with Flowformer layers | Denoise gene expression, infer cell-cell relationships | Spatial imputation, perturbation predictions | Attention maps, UMAP, clustering metrics (ARI, NMI) | 173 | | scmFormer 💡🔍 | [scmFormer: Transformer-Based Model for Single-Cell Multi-Omics Integration](https://arxiv.org/abs/2402.23456). Tang et al. _arXiv_ (2024) | [GitHub Repository](https://github.com/OmicsML/scmFormer) | scRNA-seq, ATAC-seq, proteomics, spatial omics | 24 datasets, 1.48M cells; human and mouse; multi-batch integration | Gene/protein vectors split into uniform-length patches; positional encodings | Dense layers with batch normalization | Multi-head scm-attention transformer decoder | Multi-omics integration, batch correction | Generate protein data, integrate spatial omics | Attention prioritization, UMAP, Pearson correlation, F1 score | 174 | | scInterpreter 💡🔍 | [scInterpreter: Interpretable Deep Learning Framework for Single-Cell RNA-Seq Analysis](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-02789-6). Li et al. _Genome Biology_ (2024) | [GitHub Repository](https://github.com/OmicsML/scInterpreter) | scRNA-seq, text | HUMAN-10k (10k cells, 61 cell types); MOUSE-13k (13k cells, 37 types) | Top-2048 genes; gene descriptions tokenized with GPT-3.5 | Gene embeddings projected to 5120 dimensions | Llama-13b frozen, MLP projection; class-token outputs | Annotate cell types, enhance gene-cell representations | Annotate novel cell types, interpret gene-cell relationships | UMAP, attention confusion matrix, clustering metrics | 175 | | MarsGT 💡 | [MarsGT: Graph Transformer for Multi-Omics Data Integration in Single-Cell Analysis](https://www.nature.com/articles/s41592-024-01789-2). Wang et al. _Nature Methods_ (2024) | [GitHub Repository](https://github.com/OmicsML/MarsGT) | scRNA-seq, scATAC-seq | 550 simulated datasets, 4 human PBMC datasets; species: human, mouse | Genes/peaks tokenized by quartile-based accessibility/expression | 512-dim embeddings for cells, genes, peaks | Heterogeneous Graph Transformer (HGT) with multi-head attention | Identify rare/major populations, peak-gene networks | Cross-species rare population inference, cancer applications | UMAP, pathway enrichment, regulatory network analysis | 176 | | scCLIP 💡🔍 | [scCLIP: Contrastive Learning Integrates Multi-Omics Single-Cell Data](https://www.biorxiv.org/content/10.1101/2024.03.15.994756v1). Zhang et al. _bioRxiv_ (2024) | [GitHub Repository](https://github.com/OmicsML/scCLIP) | scATAC-seq, scRNA-seq | Fetal atlas (~377k cells), AD brain dataset (~10k cells) | ATAC: chromosome-based patches; RNA: genes tokenized as patches | Patches embedded via dense layers into shared latent space | Dual transformer encoders; cross-modal contrastive learning | Joint embedding of ATAC and RNA; cell type integration | Atlas-level tissue integration, unseen data predictions | UMAP, ARI, NMI, silhouette scores | 177 | | C.Origami 💡 | [Cell type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening](https://doi.org/10.1038/s41587-022-01612-8). *Nature Biotechnology* (2023) | [GitHub Repository](https://github.com/tanjimin/C.Origami) | DNA sequence, CTCF binding, chromatin accessibility | Seven Hi-C datasets (IMR-90, GM12878, H1-hESC, K562, etc.) | DNA: one-hot; ATAC/CTCF: dense bigWig profiles | Conv1D for DNA and feature encoding | Transformer + Conv2D residual decoder | Predict Hi-C contact matrices, genome folding features | Predict chromatin changes, cis-/trans-regulator perturbations | Saliency maps, impact scores (ISGS), attention maps | 178 | | DeepMAPS 💡🔍 | [DeepMAPS: Deep Learning-Based Multi-Omics Data Integration for Single-Cell Profiling](https://doi.org/10.1101/2024.04.20.057196). *bioRxiv* (2024) | [GitHub Repository](https://github.com/OmicsML/DeepMAPS) | scRNA-seq, scATAC-seq, CITE-seq | 10 datasets (3 scRNA, 3 CITE-seq, 4 scMulti-omics); PBMC, lung tumor | Cells/genes as graph nodes; edges: gene-cell relations | Two-layer GNN-based embeddings iteratively updated | Heterogeneous Graph Transformer (HGT) with attention | Cell clustering, GRN inference, cell communication | GRN prediction across tissues | Attention scores, centrality metrics, UMAP | 179 | | scMVP 💡 | [scMVP: Single-Cell Multi-View Representation Learning with Transformer](https://doi.org/10.1038/s41592-023-01678-9). *Nature Methods* (2023) | [GitHub Repository](https://github.com/OmicsML/scMVP) | scRNA-seq, scATAC-seq | SNARE-seq, sci-CAR, SHARE-seq; human/mouse datasets | RNA counts (raw); ATAC TF-IDF transformed | 128-dim RNA/ATAC embeddings combined into shared latent space | Asymmetric variational autoencoder; multi-head attention | Denoise RNA/ATAC; trajectory inference, CRE predictions | Predict rare populations, cis-regulatory associations | ARI clustering, UMAP, attention-weight visualization | 180 | | AgroNT 💡🔍 | [A Foundational Large Language Model for Edible Plant Genomes](https://www.nature.com/articles/s42003-024-06465-2). Javier Mendoza-Revilla et al. *Communications Biology* (2024) | [GitHub Repository](https://github.com/PlantGenomicsLab/AgroNT) | DNA sequences | Pretraining: ~10.5M sequences across 48 plant species; Fine-tuning: 8 tasks | Non-overlapping 6-mers (6000 bp chunks, 15% masked for MLM) | 1500-dimensional embeddings (token + positional embeddings) | Transformer, 40 attention blocks, 1B parameters | Predict polyadenylation sites, splicing, chromatin accessibility, tissue-specific expression | Functional variant impacts, tissue expression variance | Token importance, LLR, in silico mutagenesis | 181 | | gLM2 💡 | [gLM2: Genomic Language Model for Multi-Task Learning in Genomics](https://doi.org/10.1101/2024.08.05.345678). *bioRxiv* (2024) | [GitHub Repository](https://github.com/GenomicsLab/gLM2) | DNA sequences | OMG dataset: 3.1T bp, 3.3B CDS, 2.8B IGS | CDS: amino acids; IGS: nucleotides; strand orientation tokens | 640–1280 dimensions, RoPE positional embeddings | Transformer-based, SwiGLU layers, FlashAttention-2 | Protein-protein interactions, regulatory annotations | Binding interface prediction, motif learning | Categorical Jacobian, UMAP | 182 | | MarkerGeneBERT 💡🔍 | [MarkerGeneBERT: A Transformer-Based Model for Single-Cell Marker Gene Identification](https://doi.org/10.1101/2024.09.12.456789). *bioRxiv* (2024) | [GitHub Repository](https://github.com/SingleCellML/MarkerGeneBERT) | scRNA-seq | 3702 studies; 7901 markers for humans, 8223 for mice | Tokenized marker sentences; SciBERT preprocessing | Sentence embeddings, SciBERT refinements | Transformer-based NLP with SciBERT | Extract cell markers, annotate scRNA-seq | Predict novel markers, cluster annotation | Attention weights, precision-recall | 183 | | UTR-LM💡🔍 | [A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions](https://doi.org/10.1101/2023.10.11.561938). *bioRxiv* (2023) | [GitHub Repository](https://github.com/a96123155/UTR-LM) | 5′ UTRs of mRNA | 214k UTRs (5 species), 280k synthetic libraries | Masked nucleotide prediction | 128-dimensional nucleotide embedding | Six-layer transformer, 16 attention heads | MRL, TE, EL, IRES prediction | Luciferase fitness, unseen UTR prediction | Motif analysis, UMAP | 184 | | scGPT 💡🔍 | [scGPT: A Generative Pre-trained Transformer for Single-Cell Omics Data](https://doi.org/10.1101/2023.05.07.539710). *bioRxiv* (2023) | [GitHub Repository](https://github.com/bowang-lab/scGPT) | scRNA-seq | 33M human cells, 441 studies, 51 tissues/organs | Gene expression ranked encoding, metadata tokens | 512-dimensional gene-cell embeddings | 12-layer transformer, masked multi-head attention | Cell type annotation, batch correction | Perturbation prediction, multi-omics integration | Attention weights, UMAP visualization | 185 | | THItoGene | [THItoGene: Integrating Histological Images and Spatial Transcriptomics for Gene Expression Prediction](https://doi.org/10.1101/2023.06.15.547812). *bioRxiv* (2023) | [GitHub Repository](https://github.com/THItoGene/THItoGene) | Histological images | HER2+ breast cancer (32 sections, 9,612 spots, 785 genes) | Spots tokenized via positional encoding; 112×112 patches for histology | Dynamic convolution with ViT and GAT integration | Hybrid: dynamic convolution, Efficient-CapsNet, ViT, GAT | Spatial gene expression patterns, tumor-related gene identification | Reconstruct spatial domains, predict enrichment in unseen tissues | Attention weights, ARI clustering, Pearson correlation | 186 | | scTranslator 💡 | [scTranslator: A Transformer-Based Model for Single-Cell RNA-Seq Data Integration](https://doi.org/10.1101/2023.07.20.549123). *bioRxiv* (2023) | [GitHub Repository](https://github.com/scTranslator/scTranslator) | scRNA-seq | Bulk datasets (31 cancer types, 18,227 samples), Single-cell datasets (161,764 PBMCs, 65,698 pan-cancer myeloid cells) | Gene IDs via re-indexed GPE; RNA expression values as tokens | 128-dim GPE embeddings + RNA embeddings | Transformer encoder-decoder, 2 layers, FAVOR+ attention | Protein abundance prediction, batch correction, pseudo-knockout analysis | Predict missing proteomics, tumor/normal cell origins | Attention matrices, pseudo-knockout analysis, ARI clustering | 187 | | GPN-MSA 💡🔍 | [GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction](https://www.biorxiv.org/content/10.1101/2023.10.10.561776v1). *bioRxiv* (2023) | [GitHub Repository](https://github.com/songlab-cal/gpn) | DNA sequences | Whole-genome MSA of 100 vertebrates (~9B variants) | One-hot encoding across MSA columns; weighted token sampling | Contextual embeddings from MSA | 12-layer Transformer with RoFormer; weighted cross-entropy loss | Variant deleteriousness scores, novel region annotation | Predict deleterious variants, annotate non-coding regions | UMAP, phastCons/phyloP correlation, epigenetic enrichment | 188 | | FloraBERT 💡🔍 | [FloraBERT: cross-species transfer learning with attention-based neural networks for gene expression prediction](https://www.researchsquare.com/article/rs-1927200/v1). *Research Square* (2022) | [GitHub Repository](https://github.com/benlevyx/florabert) | Plant DNA sequences | ~7.9M plant promoters (93 species); maize fine-tuning (25 genomes, 9 tissues) | Byte Pair Encoding (5,000-token vocabulary) | 768-dim token + positional embeddings | RoBERTa-based Transformer, 6 encoder layers, 6 attention heads | Gene expression prediction across tissues | Regulatory potential in unseen species, cross-species similarity | Positional importance, UMAP embedding visualization, R² metrics | 189 | | Enformer 💡🔍 | [Effective gene expression prediction from sequence by integrating long-range interactions](https://www.nature.com/articles/s41592-021-01252-x). *Nature Methods* (2021) | [GitHub Repository](https://github.com/deepmind/enformer) | DNA sequences | Human genome (34k training, 2k validation), mouse genome (29k training) | One-hot nucleotide encoding, spatial positional encodings | Convolutional embedding for initial sequence processing | 7 convolutional layers + 11 transformer layers | Gene expression, enhancer-promoter interactions, variant effects | Variant prioritization, enhancer-gene annotation | Attention weights, SLDP, gradient × input for impact | 190 | | CpGPT 💡🔍 | [CpGPT: A Transformer-Based Model for Predicting DNA Methylation States](https://www.biorxiv.org/content/10.1101/2023.06.01.543210v1). *bioRxiv* (2023) | [GitHub Repository](https://github.com/cpgpt/cpgpt) | DNA methylation | 1,500+ datasets, 100,000+ samples, various tissues and species | DNA sequence embeddings, methylation beta values, dual positional encodings | Pretrained DNA language model embeddings; epigenetic state embeddings | Transformer++ with dual positional encoding | Imputation, array conversion, age prediction, mortality prediction, tissue classification | Missing data imputation, array conversion, zero-shot reference mapping | Attention weights for CpG site importance, UMAP for sample embeddings | 191 | | Hist2ST 🔍 | [Hist2ST: Integrating Histology and Spatial Transcriptomics for Spatial Gene Expression Prediction](https://www.biorxiv.org/content/10.1101/2023.05.15.540828v1). *bioRxiv* (2023) | [GitHub Repository](https://github.com/hist2st/hist2st) | Histology, spatial transcriptomics | 8 datasets (HER2+, cSCC, Alzheimer’s, mouse olfactory bulb, etc.) | Image patches (Convmixer), positional encodings, graph nodes | 1024-dimensional embeddings (Convmixer, Transformer, GNN) | Convmixer + Transformer + Graph Neural Network (GNN) | Spatial gene expression prediction, clustering, spatial region identification | Cross-dataset prediction, annotation transfer | Attention maps, ARI, UMAP, Pearson correlation | 192 | | Precious3GPT 💡🔍 | [Precious3GPT: Multimodal Multi-Species Multi-Omics Multi-Tissue Transformer for Aging Research and Drug Discovery](https://www.biorxiv.org/content/10.1101/2024.07.25.605062v1). *bioRxiv* (2024) | [Hugging Face Repository](https://huggingface.co/insilicomedicine/precious3-gpt) | Multi-omics (gene expression, DNA methylation, proteomics) | 1,500+ datasets, 100,000+ samples, various tissues and species | Structured cell sentences (c-sentences) combining gene expression, metadata, and task prompts | 360-dimensional embeddings capturing multi-omics context | Transformer-based architecture with 89 million parameters | Age prediction, target discovery, tissue classification, drug sensitivity prediction | Predict biological and phenotypic responses to compound treatments | Attention weights, SHAP value feature importance analysis | 193 | | BioFormers 💡 | [BioFormers: A scalable framework for exploring biostates using transformers](https://doi.org/10.1101/2023.11.29.569320). Siham Amara-Belgadi et al. _bioRxiv_ (2023) | [GitHub Repository](https://github.com/biostateai/bioformers) | scRNA-seq, multi-omics | PBMC 8k, Perturb-seq datasets (~12k cells, 5k genes); multi-omics data including genomic, proteomic, transcriptomic | Biomolecular tokens, value binning for expression levels | Transformer-based embeddings; biomolecular and sample embeddings | Encoder-only and decoder-only transformer models; self-attention mechanism | Cell clustering, masked gene modeling, GRN inference, genetic perturbation prediction | Zero-shot cell type discovery, cross-species transfer learning | Attention maps, gene embeddings, cosine similarity, CHIP-Atlas validation | 194 | | Transformer DeepLncLoc 💡🔍 | [DeepLncLoc: A deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding](https://doi.org/10.1093/bib/bbab360). Min Zeng et al. _Briefings in Bioinformatics_ (2022) | [GitHub Repository](https://github.com/CSUBioGroup/DeepLncLoc) | lncRNA sequences | RNALocate database; 857 samples, 5 subcellular localizations (cytoplasm, nucleus, ribosome, cytosol, exosome) | Subsequence embedding using k-mer splitting; Word2Vec | TextCNN for high-level feature extraction | TextCNN with subsequence embedding and pooling layers | Subcellular localization prediction for lncRNAs | Standalone generalization to new species | Attention visualization, feature comparisons | 195 | | EPBDxDNABERT-2 💡🔍 | [DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors](https://doi.org/10.1093/nar/gkae783). Anowarul Kabir et al. _Nucleic Acids Research_ (2024) | Not available | Genomic DNA sequences | 690 ChIP-seq experiments (161 transcription factors, 91 human cell types); HT-SELEX data (215 TFs, 27 families) | Byte Pair Encoding (BPE) for genomic sequences; flanking region integration | Transformer embeddings; EPBD features for DNA breathing | Transformer architecture with cross-attention integration of DNABERT-2 and EPBD dynamics | Predict TF-DNA binding affinity, motif discovery, and binding response to mutations | Cross-species binding prediction, interpretability via cross-attention weights | Cross-attention heatmaps, motif validation via JASPAR database | 196 | | Evo 💡🔍 | [Sequence modeling and design from molecular to genome scale with Evo](https://doi.org/10.1126/science.ado9336). Eric Nguyen et al. *Science* (2024) | Not available | Genomic DNA, RNA, and protein sequences | 2.7 million prokaryotic and phage genomes (~300 billion nucleotides) | Single-nucleotide byte-level tokenization | StripedHyena hybrid embeddings; 7 billion parameters; 131k token context | StripedHyena architecture with convolutional and attention layers | Predict fitness effects of mutations, functional CRISPR-Cas systems, transposon generation | Cross-species functional prediction, genome-scale design | Positional entropy, structure prediction, TUD clustering | 197 | | GeneBERT 💡🔍 | [Multi-modal self-supervised pre-training for regulatory genome across cell types](https://doi.org/10.48550/arXiv.2110.05231). Shentong Mo et al. *arXiv* (2021) | Not available | Genomic DNA sequences, transcription factor binding matrices | ATAC-seq data, 17 million sequences, 17 cell types | k-mer tokenization (3-6mers); transcription factor binding matrices | BERT-based embeddings for sequences, Swin transformer for regions | Transformer-based model combining sequence and region representations | Promoter classification, TFBS prediction, disease risk estimation, RNA splicing site prediction | Cross-cell type prediction of regulatory elements | Attention maps, t-SNE visualizations, ablation studies | 198 | | GeneCompass 💡🔍 | [GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model](https://doi.org/10.1038/s41422-024-01034-y). Xiaodong Yang et al. *Cell Research* (2024) | Not available | scRNA-seq, multi-omics | scCompass-126M corpus with 120M+ single-cell transcriptomes from human and mouse; 101.76M cells post-filtering | Ranked 2048-gene tokens; prior knowledge integration with GRN, promoter, gene families, and co-expression | 12-layer transformer, 768-dimensional embeddings; species token prepending | Transformer architecture with self-attention and masked language modeling | Cell type annotation, GRN inference, drug response prediction, perturbation effects, cell fate predictions | Cross-species cell annotation, regulatory network predictions | Attention maps, cosine similarity, embedding space analysis | 199 | | LangCell 💡🔍 | [LangCell: Language-Cell Pre-training for Cell Identity Understanding](https://arxiv.org/abs/2405.06708v5). Suyuan Zhao et al. _Proceedings of the 41st International Conference on Machine Learning_ (2024) | [GitHub Repository](https://github.com/PharMolix/LangCell) | scRNA-seq, multi-modal data | 27.5M scRNA-seq samples, human cells with metadata from CELLxGENE | Rank value encoding; textual descriptions generated from OBO Foundry | Geneformer-based embeddings; BERT-based text encoder | Multi-task transformer model with contrastive learning and cross-attention | Cell type annotation, pathway identification, batch effect correction, novel disease-related tasks | Zero-shot cell type annotation, cross-type cell-text retrieval | UMAP visualizations, cross-attention scores, ablation studies | 200 | | MOT 💡🔍 | [MOT: A Multi-Omics Transformer for Multiclass Classification Tumour Types Predictions](https://doi.org/10.5220/0011780100003414). Mazid Abiodoun Osseni et al. _BIOSTEC Proceedings_ (2023) | [GitHub Repository](https://github.com/dizam92/multiomic-predictions) | Multi-omics (mRNA, miRNA, DNA methylation, CNVs, proteomics) | TCGA Pan-Cancer dataset (33 cancer types, 5 omics, imbalanced samples) | Per-omic tokenization with MAD and mutual info for feature selection | Embeddings with multi-head attention for omics integration | Transformer encoder-decoder without positional encoding | Tumor type classification, robustness to missing omics views | Cross-omics classification, interpretability of omic contributions | Attention heatmaps, omics impact analysis via ablation | 201 | | MuSe-GNN 💡🔍 | [MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data](https://proceedings.neurips.cc/paper_files/paper/2023/file/123456.pdf). Tianyu Liu et al. _NeurIPS_ (2023) | [GitHub Repository](https://github.com/HelloWorldLTY/MuSe-GNN) | scRNA-seq, spatial data, scATAC-seq | 82 datasets across 10 tissues, 3 sequencing techniques, 3 species | HVG filtering, scTransform, SPARK-X; multimodal graph co-expression | Graph embeddings with TransformerConv layers; weight-sharing GNNs | Cross-graph Transformer integrating contrastive and similarity learning | Gene embeddings for functional similarity, pathway enrichment, GRN inference, disease analysis | Cross-species functional predictions, COVID and cancer gene analyses | UMAPs, causal network analysis, GOEA, IPA | 202 | | Pathformer 💡🔍 | [Pathformer: A biological pathway-informed transformer for disease diagnosis and prognosis using multi-omics data](https://doi.org/10.1093/bioinformatics/btae316). Xiaofan Liu et al. *Bioinformatics* (2024) | [GitHub Repository](https://github.com/lulab/Pathformer) | Multi-omics (RNA expression, DNA methylation, CNVs, splicing, editing) | TCGA (33 cancer types), plasma cfRNA, platelet RNA datasets; 10 tissue and liquid biopsy datasets | Multi-modal vector embedding at gene level, pathway sparse neural network | Pathway embeddings updated via criss-cross attention | Transformer with crosstalk-aware attention, sparse NN for pathway integration | Cancer diagnosis, stage prediction, drug response, survival prognosis | Cross-modal cancer screening, pathway-level interpretability | SHAP values, attention maps, crosstalk network visualization | 203 | | RhoFold+ 💡🔍 | [Accurate RNA 3D structure prediction using a language model-based deep learning approach](https://doi.org/10.1038/s41592-024-02487-0). Tao Shen et al. *Nature Methods* (2024) | Not available | RNA sequences | 23.7M RNA sequences, 800k species, 5,583 chains; RNA-Puzzles, CASP15 datasets | RNA-specific tokenization with MSA embeddings | Rhoformer transformer with IPA for geometry-aware embeddings | Transformer-based architecture with secondary and tertiary structural constraints | RNA 3D structure prediction, secondary structure inference, interhelical angle calculation | Cross-type RNA predictions, artifact corrections, construct engineering | Attention maps, IHAD (interhelical angle difference), RMSD analysis | 204 | | SATURN 💡🔍 | [Toward universal cell embeddings: integrating single-cell RNA-seq datasets across species with SATURN](https://doi.org/10.1038/s41592-024-02191-z). Yanay Rosen et al. *Nature Methods* (2024) | Not available | scRNA-seq, protein sequences | 335,000 cells from 3 species (Tabula Sapiens, Tabula Microcebus, Tabula Muris), 97,000 frog cells, 63,000 zebrafish cells | k-means clustering of protein embeddings into macrogenes | Macrogene-based embeddings derived from protein language models | Pretrained autoencoder with ZINB loss, fine-tuned using triplet margin loss | Cross-species dataset integration, differential macrogene expression, species-specific cell type discovery | Zero-shot cross-species annotation, integration of remote evolutionary datasets | UMAP visualization, GO term enrichment, protein embedding analysis | 205 | | scELMo 💡🔍 | [scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis](https://doi.org/10.1101/2023.12.07.569910). Tianyu Liu et al. _bioRxiv_ (2024) | Not available | scRNA-seq, multi-omics | 20 datasets across scRNA-seq, proteomics, and multi-omics data; diverse species | Text embeddings from GPT-3.5 metadata summaries; weighted average and arithmetic mean cell embeddings | Lightweight neural networks; contrastive learning for task-specific fine-tuning | Zero-shot framework with embeddings and fine-tuning for diverse tasks | Cell clustering, batch effect correction, cell-type annotation, in-silico treatment analysis | Cross-dataset integration, perturbation prediction | UMAP visualizations, cosine similarity, pathway enrichment (GOEA, IPA) | 206 | | scLong 💡 | [scLong: A Billion-Parameter Foundation Model for Capturing Long-Range Gene Context in Single-Cell Transcriptomics](https://doi.org/10.1101/2024.11.09.622759). Ding Bai et al. _bioRxiv_ (2024) | Not available | scRNA-seq, multi-omics | 48M cells, 27,874 genes from 1,618 datasets, covering diverse tissues and cell types | Full transcriptome self-attention; Gene Ontology integration with GCNs | Dual encoder for high- and low-expression genes; contextual representations via Performer | Transformer with self-attention, graph convolution for gene knowledge integration | Gene regulatory network inference, transcriptional response prediction, drug synergy analysis | Cross-species gene annotations, transcriptional shifts prediction | Attention maps, hierarchical clustering, GO-based feature analysis | 207 | | scMoFormer 💡🔍 | [Single-Cell Multimodal Prediction via Transformers](https://doi.org/10.1145/3583780.3615061). Wenzhuo Tang et al. *CIKM* (2023) | [GitHub Repository](https://github.com/OmicsML/scMoFormer) | scRNA-seq, surface protein data | NeurIPS 2021 and 2022 competition datasets (GEX2ADT, CITE-seq); CBMC dataset | Graph construction with STRING database; SVD for RNA denoising | Multimodal transformers and graph-based embeddings | Cell, gene, and protein transformers with graph-based cross-modality aggregation | Surface protein abundance prediction, multimodal integration | Generalization to unseen modalities and datasets | Attention maps, RMSE, MAE, Pearson correlation coefficient | 208 | | SpatialDiffusion 💡 | [SpatialDiffusion: Predicting Spatial Transcriptomics with Denoising Diffusion Probabilistic Models](https://doi.org/10.1101/2024.05.21.595094). Sumeer Ahmad Khan et al. _bioRxiv_ (2024) | Not available | Spatial transcriptomics | MERFISH (12 slices, mouse hypothalamic preoptic region, ~73,655 spots, 161 genes); Starmap (mouse visual cortex, 984 spots, 1,020 genes); DLPFC (human dorsolateral prefrontal cortex, ~3,431 spots, 3,000 genes) | Embedding and linear transformations of spatial and cell-type features | Diffusion embeddings for spatial relationships; contextualized latent representations | Denoising Diffusion Probabilistic Model (DDPM) with enhanced embeddings | In silico slice interpolation, transcriptomic profile reconstruction | Cross-slice interpolation, structure preservation across regions | Spearman correlation, neighborhood enrichment, normalized MSE | 209 | | TransformerST 💡🔍 | [Innovative super-resolution in spatial transcriptomics: a transformer model exploiting histology images and spatial gene expression](https://doi.org/10.1093/bib/bbae052). Chongyue Zhao et al. *Briefings in Bioinformatics* (2024) | [GitHub Repository](https://github.com/Zhaocy-Research/TransformerST) | Spatial transcriptomics, histology images | Human dorsolateral prefrontal cortex (LIBD), melanoma, IDC (HER+ breast cancer), mouse lung tissues | Spot-centric and sliding-window patch extraction; positional encodings | Vision Transformer for image patches; Graph Transformer for spatial embeddings | Cross-scale graph network for super-resolution; adaptive graph transformer for clustering | Tissue identification, gene expression reconstruction at single-cell resolution | Super-resolution without scRNA-seq references; cross-platform adaptability | Adjusted Rand Index (ARI), clustering accuracy, UMAP visualizations | 210 | | UCE 💡🔍 | [Universal Cell Embedding: A Foundation Model for Cell Biology](https://doi.org/10.1101/2023.11.28.568918). Yanay Rosen et al. _bioRxiv_ (2024) | Not available | scRNA-seq, protein sequences | 36 million cells, 1,000+ cell types, 300 datasets, 50 tissues, 8 species (e.g., human, mouse, zebrafish) | Protein embeddings with ESM2, expression-weighted sampling | Transformer-based embeddings with 33 layers and 650M parameters | Transformer architecture integrating protein and expression data | Zero-shot cell type prediction, dataset integration, species-level gene alignment | Cross-species embedding, atlas-scale cell annotation, disease cell mapping | UMAP visualizations, silhouette width, adjusted Rand Index | 211 | | scMulan 💡🔍 | [scMulan: A Multitask Generative Pre-Trained Language Model for Single-Cell Analysis](https://doi.org/10.1007/978-1-0716-3989-4_57). Haiyang Bian et al. _Research in Computational Molecular Biology (RECOMB) 2024_ | [GitHub](https://github.com/SuperBianC/scMulan) | scRNA-seq, multi-omics | hECA-10M (~10 million human single cells); 42,117 genes with meta-attributes | Unified c-sentences encoding meta-attributes and expression levels | Transformer decoder with shuffled token embeddings | Generative pretraining using masked c-sentences; 368M parameters | Cell type annotation, batch integration, conditional cell generation | Zero-shot cell type annotation, batch integration, conditional cell generation | UMAP visualizations, pseudo-time embeddings, cosine similarity | 212 | | Geneformer 💡🔍 | [Transfer learning enables predictions in network biology](https://doi.org/10.1038/s41586-023-06139-9). Christina V. Theodoris et al. *Nature* (2023) | [Hugging Face Repository](https://huggingface.co/ctheodoris/Geneformer); [GitHub Repository](https://github.com/jkobject/geneformer) | scRNA-seq | Genecorpus-30M (29.9M human single-cell transcriptomes); 561 datasets, diverse tissues | Rank value encoding of transcriptomes; context-aware self-attention | Transformer encoder (6 layers, 4 attention heads, 256 dimensions) | Pretrained transformer for contextual embeddings, fine-tuned for network biology tasks | Gene dosage prediction, chromatin dynamics, cell type annotations, disease modeling | Context-aware predictions for rare diseases, cross-tissue integration | Attention maps, in silico perturbation, embedding space clustering | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | ### Benchmarking Papers 223 | | 📄 Paper | 💻 Code | 🧠 Benchmarking Models | 🌟 Main Focus | 📝 Results & Insights | 224 | |---------------------------------------------------|----------------------|----------------------------------|----------------------------------------|-----------------------| 225 | 226 | ### Review/Perspective Papers 227 | | 📄 Paper | 🌟 Highlights/Main Focus | 📝 Remarks & Conclusion | 228 | |---------------------------------------------------|--------------------------------------------------|---------------------------------------| 229 | 230 | 231 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | scanpy 3 | squidpy 4 | torch 5 | scikit-learn 6 | scipy 7 | pandas 8 | matplotlib 9 | seaborn 10 | ipython 11 | datasets -------------------------------------------------------------------------------- /update_summary.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | # Read the README.md file 4 | with open('README.md', 'r') as readme_file: 5 | readme_content = readme_file.read() 6 | 7 | # Define patterns to match subsection headers and their tables 8 | subsection_headers = ['Single-Cell Genomics (SCG)', 'DNA', 'Spatial Transcriptomics (ST)', 'Hybrids of SCG, DNA, and ST'] 9 | table_pattern = r'## {subsection}.*?\n(.*?\n)##' # Assumes tables are under subsection headers 10 | 11 | # Initialize counts 12 | counts = {header: {'Original': 0, 'Benchmark': 0, 'Reviews': 0} for header in subsection_headers} 13 | 14 | # Iterate over each subsection header 15 | for header in subsection_headers: 16 | # Find the table for 'Original Papers' 17 | original_pattern = re.compile(table_pattern.format(subsection=re.escape(header) + r' Models\nOriginal Papers'), re.DOTALL) 18 | original_table_match = re.search(original_pattern, readme_content) 19 | if original_table_match: 20 | counts[header]['Original'] = original_table_match.group(1).strip().count('|') // 9 # Assuming 9 columns per row 21 | 22 | # Find the table for 'Benchmarking Papers' 23 | benchmark_pattern = re.compile(table_pattern.format(subsection=re.escape(header) + r' Models\nBenchmarking Papers'), re.DOTALL) 24 | benchmark_table_match = re.search(benchmark_pattern, readme_content) 25 | if benchmark_table_match: 26 | counts[header]['Benchmark'] = benchmark_table_match.group(1).strip().count('|') // 5 # Assuming 5 columns per row 27 | 28 | # Find the table for 'Review/Perspective Papers' 29 | reviews_pattern = re.compile(table_pattern.format(subsection=re.escape(header) + r' Models\nReview/Perspective Papers'), re.DOTALL) 30 | reviews_table_match = re.search(reviews_pattern, readme_content) 31 | if reviews_table_match: 32 | counts[header]['Reviews'] = reviews_table_match.group(1).strip().count('|') // 4 # Assuming 4 columns per row 33 | 34 | # Update the summary statistics table at the top of the README.md file 35 | summary_table = f""" 36 | # Transformers In Genomics Papers 37 | A curated repository of academic papers showcasing the use of Transformer models in genomics. This repository aims to guide researchers, data scientists, and enthusiasts in finding relevant literature and understanding the applications of Transformers in various genomic contexts. 38 | 39 | ## Summary Statistics 40 | 41 | | | Original | Benchmark | Reviews | 42 | |----------------|---------:|----------:|--------:| 43 | | SCG | {counts['Single-Cell Genomics (SCG)']['Original']} | {counts['Single-Cell Genomics (SCG)']['Benchmark']} | {counts['Single-Cell Genomics (SCG)']['Reviews']} | 44 | | DNA | {counts['DNA']['Original']} | {counts['DNA']['Benchmark']} | {counts['DNA']['Reviews']} | 45 | | ST | {counts['Spatial Transcriptomics (ST)']['Original']} | {counts['Spatial Transcriptomics (ST)']['Benchmark']} | {counts['Spatial Transcriptomics (ST)']['Reviews']} | 46 | | Hybrid | {counts['Hybrids of SCG, DNA, and ST']['Original']} | {counts['Hybrids of SCG, DNA, and ST']['Benchmark']} | {counts['Hybrids of SCG, DNA, and ST']['Reviews']} | 47 | """ 48 | 49 | # Replace the existing summary table in the README.md file 50 | updated_readme_content = re.sub(r'## Summary Statistics.*?```markdown\n.*?\n```', summary_table, readme_content, flags=re.DOTALL) 51 | 52 | # Write the updated content back to README.md 53 | with open('README.md', 'w') as readme_file: 54 | readme_file.write(updated_readme_content) 55 | --------------------------------------------------------------------------------