├── 2_BaselineModel ├── README.md └── Baseline_model_DNABERT2.ipynb ├── 4_Presentation ├── README.md └── Presentation_Fine_Tuning_LLM_for_genome_understanding_Muhammad_Aammar_Tufail.pdf ├── CoverImage └── cover_image.png ├── README.md ├── 3_Model_fine_tuning ├── fine_tuing_Data │ ├── dev.csv │ ├── test.csv │ └── train.csv ├── fine_tuning_script.sh ├── README.md └── finetune │ ├── scripts │ ├── run_dnabert2.sh │ ├── run_dnabert1.sh │ └── run_nt.sh │ └── train.py ├── 0_LiteratureReview └── README.md ├── 1_DatasetCharacteristics └── README.md └── LICENSE /2_BaselineModel/README.md: -------------------------------------------------------------------------------- 1 | # Baseline Model 2 | 3 | **[Notebook](Baseline_model_DNABERT2.ipynb)** 4 | -------------------------------------------------------------------------------- /4_Presentation/README.md: -------------------------------------------------------------------------------- 1 | # Presentation 2 | 3 | **[Slides](./Presentation_Fine_Tuning_LLM_for_genome_understanding_Muhammad_Aammar_Tufail.pdf)** 4 | -------------------------------------------------------------------------------- /CoverImage/cover_image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AammarTufail/fine_tuning_LLM_for_genome_understanding_of_covid_19/HEAD/CoverImage/cover_image.png -------------------------------------------------------------------------------- /4_Presentation/Presentation_Fine_Tuning_LLM_for_genome_understanding_Muhammad_Aammar_Tufail.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AammarTufail/fine_tuning_LLM_for_genome_understanding_of_covid_19/HEAD/4_Presentation/Presentation_Fine_Tuning_LLM_for_genome_understanding_Muhammad_Aammar_Tufail.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # **Genome Understanding using LLMs** 2 | 3 | ## Repository Link 4 | 5 | [https://github.com/your_username/your_project_name] 6 | 7 | ## Description 8 | 9 | This project aims to apply large language models (LLMs) to the problem of understanding genomic sequences, specifically FASTA files. The primary goal is to leverage transformer-based architectures such as DNABERT, DNABERT-2, and the Nucleotide Transformer to analyze and interpret DNA sequences. This approach seeks to improve the prediction of genomic elements (e.g., promoters, splice sites, transcription factor binding sites) and to explore the potential of pre-trained models in genomics for bioinformtics, personalized medicine and biotechnology applications. 10 | 11 | ### Task Type 12 | 13 | Fine Tunning LLMs for Genome Understanding 14 | 15 | ### Results Summary 16 | 17 | - **Best Model:** DNABERT-2 18 | - **Evaluation Metric:** Training time was too much for the dataset, so I stopped the training after few Hours. 19 | - **Result:** Could not complete the training, but the model was able to learn the patterns in the data. 20 | 21 | ## Documentation 22 | 23 | 1. **[Literature Review](0_LiteratureReview/README.md)** 24 | 2. **[Dataset Characteristics](1_DatasetCharacteristics/README.md)** 25 | 3. **[Baseline Model](2_BaselineModel/Baseline_model_DNABERT2.ipynb)** 26 | 4. **[Fine Tuning LLM](3_Model_fine_tuning/README.md)** 27 | 5. **[Presentation](4_Presentation/README.md)** 28 | 29 | ## Cover Image 30 | 31 | ![Project Cover Image](CoverImage/cover_image.png) 32 | 33 | ---- 34 | -------------------------------------------------------------------------------- /3_Model_fine_tuning/fine_tuing_Data/dev.csv: -------------------------------------------------------------------------------- 1 | sequence,label 2 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0 3 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1 4 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0 5 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0 6 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1 7 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0 8 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1 9 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0 10 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0 11 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1 12 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0 13 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1 14 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0 15 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0 16 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1 17 | -------------------------------------------------------------------------------- /3_Model_fine_tuning/fine_tuing_Data/test.csv: -------------------------------------------------------------------------------- 1 | sequence,label 2 | GTTCCCATGGGAGAGTTCTGTAGGACTATAGGTTGGTCGTATATTGCTGAGTCAGCACTTATGGAGCACTACCAAAAGAATCATTATCATGCATGCATTAG,1 3 | AGATGCAGTCAGCAGGAAATGTCAAGATTTTATGCAAGTGGCTGTCACACCATCTTGTGAGCTCACATTGAGTATGTAATTTTCACTCAACATGGGGTCAA,0 4 | ACATGTCTTGTGCCATGAGCCCAAAGAGACCATGAGTCAGCACACCGACCTCAGACATCATTACACAGCAGCCACGTCAGAGTGTTAGCCAAGTAGATAAA,1 5 | TATCTCCCACATCCATCCAAGTTTTGTGTTTTTCTCAGCCCATTTCATTTTTTTGTCTCGTTAAGCTTGTACTGTCGTTTTTGATTCAGTCATTAACATGT,0 6 | CGCATCTCTCAGGAAAAAAGACTCAGCAAAGCCCACACCAATTTGTTTCTTTGTTTATTCACCTACTTATATTAACAGCTCCCACCATTGTGAACCGCAGA,0 7 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0 8 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1 9 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0 10 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0 11 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1 12 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0 13 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1 14 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0 15 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0 16 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1 17 | -------------------------------------------------------------------------------- /3_Model_fine_tuning/fine_tuing_Data/train.csv: -------------------------------------------------------------------------------- 1 | sequence,label 2 | AAAAAGCCTGTGAAGCACAGAGAGCAGCCAGCCAGAGCTGATGCTCAATGGCAGAAACTGCTTAGTCACGCTGAAAGGGAGCCAAGGCAATAGCAGAGTGG,1 3 | ACCTGCTAACAATTAAGGCCTCCAGGTCTACCCTGCAGCTGGGCCTGAGGAGGTCCTCTTGAAAGGAGTGGGTAACAGCGCACTATTGAGGGCCTGTGAAG,0 4 | ACTGACCATGTGCATCCTCACTGATACCAGTCTTGCCACAGTGTGCCTTGGAAACTCTTTCACAGGCAGTTATGGTCCCTACAGATAGGGGGCAGAGTATG,0 5 | ATATTACTCAACCGCCTAACAGAACAAAAGCATTCTTGGCTTGATCTCTAGAGTCCCTTTGAACAATTGGGACGATGTTCACCGAACTCTGATAAGCTAGC,0 6 | CAACCATCCTACTTGCTCGTGGGCTAGCTGCGGGCGCGTCGCGAGCTCGTGAAGCTGACATGGCTTTCCGAGGGCACAACACGAGAACTGAATCTTGCCTT,0 7 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0 8 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1 9 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0 10 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0 11 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1 12 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0 13 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1 14 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0 15 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0 16 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1 17 | -------------------------------------------------------------------------------- /3_Model_fine_tuning/fine_tuning_script.sh: -------------------------------------------------------------------------------- 1 | cd finetune 2 | 3 | export DATA_PATH=../GUE # I used here only virus dataset, but you can use any dataset in this format to fine-tune the model 4 | export MAX_LENGTH=100 # Please set the number as 0.25 * your sequence length. 5 | # e.g., set it as 250 if your DNA sequences have 1000 nucleotide bases 6 | # This is because the tokenized will reduce the sequence length by about 5 times 7 | export LR=3e-5 # Learning rate 8 | 9 | # Training use DataParallel 10 | python train.py \ 11 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 12 | --data_path ${DATA_PATH} \ 13 | --kmer -1 \ 14 | --run_name DNABERT2_${DATA_PATH} \ 15 | --model_max_length ${MAX_LENGTH} \ 16 | --per_device_train_batch_size 8 \ 17 | --per_device_eval_batch_size 16 \ 18 | --gradient_accumulation_steps 1 \ 19 | --learning_rate ${LR} \ 20 | --num_train_epochs 5 \ 21 | --fp16 \ 22 | --save_steps 200 \ # save the model every 200 steps 23 | --output_dir output/dnabert2 \ # output directory to save the fine tuned model and logs 24 | --evaluation_strategy steps \ 25 | --eval_steps 200 \ 26 | --warmup_steps 50 \ 27 | --logging_steps 100 \ 28 | --overwrite_output_dir True \ 29 | --log_level info \ 30 | --find_unused_parameters False 31 | 32 | # Training use DistributedDataParallel (more efficient) 33 | export num_gpu=32 # please change the value based on your setup 34 | 35 | torchrun --nproc-per-node=${num_gpu} train.py \ 36 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 37 | --data_path ${DATA_PATH} \ 38 | --kmer -1 \ 39 | --run_name DNABERT2_${DATA_PATH} \ 40 | --model_max_length ${MAX_LENGTH} \ 41 | --per_device_train_batch_size 8 \ 42 | --per_device_eval_batch_size 16 \ 43 | --gradient_accumulation_steps 1 \ 44 | --learning_rate ${LR} \ 45 | --num_train_epochs 5 \ 46 | --fp16 \ 47 | --save_steps 200 \ 48 | --output_dir output/dnabert2 \ 49 | --evaluation_strategy steps \ 50 | --eval_steps 200 \ 51 | --warmup_steps 50 \ 52 | --logging_steps 100 \ 53 | --overwrite_output_dir True \ 54 | --log_level info \ 55 | --find_unused_parameters False -------------------------------------------------------------------------------- /3_Model_fine_tuning/README.md: -------------------------------------------------------------------------------- 1 | # Fine Tuning LLM for Genome Understanding 2 | 3 | ## For fine-tuning LLM for genome understanding, we need to follow the following steps: 4 | 5 | First, I generated 3 csv files from your dataset: `train.csv`, `dev.csv`, and `test.csv`. In the training process, the model is trained on `train.csv` and is evaluated on the `dev.csv` file. After the training if finished, the checkpoint with the smallest loss on the `dev.csv` file is loaded and be evaluated on `test.csv`. 6 | 7 | > `Note:` If anyone using this repository do not have a validation set, please just make the `dev.csv` and `test.csv` the same. Please see the `fine_tuing_Data` folder as data format. Each file should be in the same format, with the first row as document head named sequence, label. Each following row should contain a DNA sequence and a numerical label concatenated by `a , (e.g., ACGTCAGTCAGCGTACGT, 1 ).` 8 | 9 | 10 | I followed the following steps to fine-tune the LLM model using shell scripts `.sh` and python scripts instructed on the [github repository](https://github.com/MAGICS-LAB/DNABERT_2?tab=readme-ov-file#62-fine-tune-dnabert2-on-your-own-datasets) of the DNABERT2 model: 11 | 12 | Here is the [shell script](./fine_tuning_script.sh) file to fine-tune the model, which can also be seen here: 13 | 14 | ```bash 15 | cd finetune 16 | 17 | export DATA_PATH=../GUE # I used here only virus dataset, but you can use any dataset in this format to fine-tune the model 18 | export MAX_LENGTH=100 # Please set the number as 0.25 * your sequence length. 19 | # e.g., set it as 250 if your DNA sequences have 1000 nucleotide bases 20 | # This is because the tokenized will reduce the sequence length by about 5 times 21 | export LR=3e-5 # Learning rate 22 | 23 | # Training use DataParallel 24 | python train.py \ 25 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 26 | --data_path ${DATA_PATH} \ 27 | --kmer -1 \ 28 | --run_name DNABERT2_${DATA_PATH} \ 29 | --model_max_length ${MAX_LENGTH} \ 30 | --per_device_train_batch_size 8 \ 31 | --per_device_eval_batch_size 16 \ 32 | --gradient_accumulation_steps 1 \ 33 | --learning_rate ${LR} \ 34 | --num_train_epochs 5 \ 35 | --fp16 \ 36 | --save_steps 200 \ # save the model every 200 steps 37 | --output_dir output/dnabert2 \ # output directory to save the fine tuned model and logs 38 | --evaluation_strategy steps \ 39 | --eval_steps 200 \ 40 | --warmup_steps 50 \ 41 | --logging_steps 100 \ 42 | --overwrite_output_dir True \ 43 | --log_level info \ 44 | --find_unused_parameters False 45 | 46 | # Training use DistributedDataParallel (more efficient) 47 | export num_gpu=32 # please change the value based on your setup 48 | 49 | torchrun --nproc-per-node=${num_gpu} train.py \ 50 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 51 | --data_path ${DATA_PATH} \ 52 | --kmer -1 \ 53 | --run_name DNABERT2_${DATA_PATH} \ 54 | --model_max_length ${MAX_LENGTH} \ 55 | --per_device_train_batch_size 8 \ 56 | --per_device_eval_batch_size 16 \ 57 | --gradient_accumulation_steps 1 \ 58 | --learning_rate ${LR} \ 59 | --num_train_epochs 5 \ 60 | --fp16 \ 61 | --save_steps 200 \ 62 | --output_dir output/dnabert2 \ 63 | --evaluation_strategy steps \ 64 | --eval_steps 200 \ 65 | --warmup_steps 50 \ 66 | --logging_steps 100 \ 67 | --overwrite_output_dir True \ 68 | --log_level info \ 69 | --find_unused_parameters False 70 | ``` 71 | 72 | 73 | 74 | -------------------------------------------------------------------------------- /0_LiteratureReview/README.md: -------------------------------------------------------------------------------- 1 | # **Literature Review- Genome Understanding using LLMs** 2 | **Author:** `Muhammad Aammar Tufail (Ph.D.)` 3 | 4 | ## `Overview` 5 | 6 | The purpose of this literature review is to investigate the use of large language models (LLMs) for understanding genomic sequences, specifically `FASTA` sequences. As genomics is a data-rich field, the ability to effectively analyze and interpret these sequences is essential for advances in` Bioinformatics`, `Biotechnology` and `Personalized Medicine`. This review explores the commonly used models for analyzing genomic data, the format requirements of training data, the amount of data typically used in related studies, and whether there are existing pretrained models suitable for this problem. 7 | 8 | ### `Key Questions to Address` 9 | - Which models are commonly used for genomic sequence analysis? 10 | - What format must the training data have? 11 | - How much training data is typically used for similar problems? 12 | - Are there pretrained models available for genomic sequence analysis using LLMs? 13 | 14 | Approaches or solutions that have been tried before on similar projects. 15 | 16 | **Summary of Each Work**: 17 | 18 | - **Source 1**: `DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome` 19 | 20 | - **[Link](https://academic.oup.com/bioinformatics/article/37/15/2112/6128680?login=false)** 21 | - **Objective**: 22 | - To adapt large language models (LLMs) to effectively interpret and analyze DNA sequences from genomic FASTA files, enhancing their ability to predict and understand genomic elements. 23 | - **Methods**: 24 | - `Tokenization:` DNA sequences are converted into k-mer tokens, which capture contextual information by representing overlapping sequences. 25 | - `Pre-training:` The model learns DNA syntax and semantics through self-supervision, using masked language modeling on sequences from the human genome. 26 | - `Attention Mechanism:` DNABERT uses a multi-head self-attention mechanism to capture global contextual information from the entire input sequence 27 | - `Fine-tuning:` The pre-trained model is fine-tuned with task-specific data to predict genomic elements like promoters and splice sites. 28 | - **Outcomes**: 29 | - `Improved Prediction Accuracy:` DNABERT achieves state-of-the-art performance in predicting genomic elements such as promoters, splice sites, and transcription factor binding sites, even with limited task-specific labeled data 30 | - `Cross-Organism Applicability:` The pre-trained DNABERT model, initially trained on the human genome, can be effectively applied to other organisms, demonstrating exceptional performance across different species 31 | - **Relation to the Project**: 32 | - This approach aligns with the project's aim to leverage advanced LLM pretraining techniques for genomic analysis, providing a robust framework for understanding complex DNA sequences and facilitating cross-organism genomic studies. 33 | 34 | - **Source 2**: `DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes` 35 | 36 | - **[Link](https://arxiv.org/html/2306.15006v2)** 37 | - **Objective**: 38 | - To enhance the understanding of genome sequences by fine-tuning large language models (LLMs) using efficient tokenization and model adaptation techniques. 39 | - **Methods**: 40 | - Replace traditional k-mer tokenization with Byte Pair Encoding (BPE) to improve computational efficiency and overcome sample inefficiencies. 41 | - Utilize Attention with Linear Biases (ALiBi) and Low-Rank Adaptation (LoRA) to handle long sequences and optimize model parameters for genomic data. 42 | - Implement standard fine-tuning with specific hyperparameters like learning rate and batch size for models like DNABERT and DNABERT-2. 43 | - **Outcomes**: 44 | - DNABERT-2 achieves comparable performance to state-of-the-art models with significantly fewer parameters and reduced GPU time, demonstrating efficiency in genome understanding tasks. 45 | - **Relation to the Project**: 46 | - The methods and outcomes provide a framework for fine-tuning LLMs to effectively process and understand DNA sequences in FASTA files, aligning with the project's goal of improving genome analysis. 47 | 48 | - **Source 3**: `The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics` 49 | 50 | - **[Link](https://www.biorxiv.org/content/10.1101/2023.01.11.523679v3)** 51 | - **Objective**: 52 | - To bridge the gap between genetic information and observable traits by predicting molecular phenotypes from DNA sequences using foundation models pre-trained on DNA sequences. 53 | - **Methods**: 54 | - Utilize transformer models, named the Nucleotide Transformer, pre-trained on a large dataset of DNA sequences from diverse genomes. 55 | - Fine-tune these models in low-data settings to solve various genomics applications, focusing on key genomic elements like enhancers. 56 | - **Outcomes**: 57 | - Achieved accurate molecular phenotype predictions even with limited data. 58 | - Improved prioritization of functional genetic variants using model representations. 59 | - **Relation to the Project**: 60 | - The study demonstrates the potential of using pre-trained models for genomic data, which can be applied to understand and analyze DNA sequences in FASTA files effectively. 61 | 62 | - **Source 4**: `BioGPT: generative pre-trained transformer for biomedical text generation and mining` 63 | 64 | - **[Link](https://academic.oup.com/bib/article/23/6/bbac409/6713511)** 65 | - **Objective**: 66 | - The paper discusses the development of BioGPT, a generative pre-trained transformer model specifically designed for biomedical text generation and mining. While it does not directly address genomic data, the objective is to enhance the model's ability to handle domain-specific tasks in the biomedical field. 67 | - **Methods**: 68 | - BioGPT is pre-trained on large-scale biomedical literature, which involves using a vast corpus of domain-specific texts to fine-tune the model's understanding and generation capabilities. This approach could be adapted for genomic data by using a corpus of DNA sequences in FASTA format for pre-training. 69 | - **Outcomes**: 70 | - The model demonstrates superior performance on various biomedical NLP tasks, such as relation extraction and question answering, indicating its potential effectiveness in understanding complex domain-specific data. 71 | - **Relation to the Project**: 72 | - While the paper does not specifically address genomic data, the methods and outcomes of BioGPT suggest a framework that could be adapted for fine-tuning LLMs to understand genomic sequences by using a similar pre-training approach with genomic data. 73 | 74 | 75 | ---------------------------------------------------------------------------------------------------------------------------- 76 | > ## Summary of Key Findings 77 | >1. **Common Models**: The most common models used for genomic sequence analysis are transformer-based architectures, including DNABERT1, DNABERT2, NT, and BioGPT. These models, when adapted to biological data, show significant promise in understanding complex genomic structures. 78 | >2. **Training Data Format**: Genomic data must be preprocessed and tokenized into formats that these LLMs can handle. FASTA sequences are often converted into tokenized sequences of nucleotides, which serve as input data for the models. 79 | >3. **Amount of Training Data**: Similar problems typically require large amounts of training data, often millions of sequences, to ensure that the models can capture the diverse and complex patterns found in genomic data. 80 | >4. **Pretrained Models**: While there are pretrained models such as BioGPT and DNABERT, these models often require additional fine-tuning on specific genomic tasks to achieve optimal performance. Researchers may need to train their own models if domain-specific tasks are highly specialized. 81 | 82 | > ## Conclusion 83 | >The use of large language models in genomics, specifically for analyzing FASTA sequences, represents an exciting frontier in computational biology. Pretrained models such as DNABERT and BioGPT provide a strong starting point, but substantial training data and fine-tuning are often required for specific applications. The ability of LLMs to learn complex patterns within genomic sequences offers significant potential for advancing personalized medicine and biotechnological research. 84 | --- 85 | 86 | 87 | -------------------------------------------------------------------------------- /3_Model_fine_tuning/finetune/scripts/run_dnabert2.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | data_path=$1 4 | lr=3e-5 5 | 6 | echo "The provided data_path is $data_path" 7 | 8 | for seed in 42 9 | do 10 | for data in H3 H3K14ac H3K36me3 H3K4me1 H3K4me2 H3K4me3 H3K79me3 H3K9ac H4 H4ac 11 | do 12 | python train.py \ 13 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 14 | --data_path $data_path/GUE/EMP/$data \ 15 | --kmer -1 \ 16 | --run_name DNABERT2_${vocab}_${lr}_EMP_${data}_seed${seed} \ 17 | --model_max_length 128 \ 18 | --per_device_train_batch_size 8 \ 19 | --per_device_eval_batch_size 16 \ 20 | --gradient_accumulation_steps 1 \ 21 | --learning_rate ${lr} \ 22 | --num_train_epochs 3 \ 23 | --fp16 \ 24 | --save_steps 200 \ 25 | --output_dir output/dnabert2 \ 26 | --evaluation_strategy steps \ 27 | --eval_steps 200 \ 28 | --warmup_steps 50 \ 29 | --logging_steps 100000 \ 30 | --overwrite_output_dir True \ 31 | --log_level info \ 32 | --find_unused_parameters False 33 | done 34 | 35 | 36 | 37 | for data in prom_core_all prom_core_notata 38 | do 39 | python train.py \ 40 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 41 | --data_path $data_path/GUE/prom/$data \ 42 | --kmer -1 \ 43 | --run_name DNABERT2_${vocab}_${lr}_prom_${data}_seed${seed} \ 44 | --model_max_length 20 \ 45 | --per_device_train_batch_size 8 \ 46 | --per_device_eval_batch_size 16 \ 47 | --gradient_accumulation_steps 1 \ 48 | --learning_rate ${lr} \ 49 | --num_train_epochs 4 \ 50 | --fp16 \ 51 | --save_steps 400 \ 52 | --output_dir output/dnabert2 \ 53 | --evaluation_strategy steps \ 54 | --eval_steps 400 \ 55 | --warmup_steps 50 \ 56 | --logging_steps 100000 \ 57 | --overwrite_output_dir True \ 58 | --log_level info \ 59 | --find_unused_parameters False 60 | done 61 | 62 | 63 | for data in prom_core_tata 64 | do 65 | python train.py \ 66 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 67 | --data_path $data_path/GUE/prom/$data \ 68 | --kmer -1 \ 69 | --run_name DNABERT2_${vocab}_${lr}_prom_${data}_seed${seed} \ 70 | --model_max_length 20 \ 71 | --per_device_train_batch_size 8 \ 72 | --per_device_eval_batch_size 16 \ 73 | --gradient_accumulation_steps 1 \ 74 | --learning_rate ${lr} \ 75 | --num_train_epochs 10 \ 76 | --fp16 \ 77 | --save_steps 200 \ 78 | --output_dir output/dnabert2 \ 79 | --evaluation_strategy steps \ 80 | --eval_steps 200 \ 81 | --warmup_steps 50 \ 82 | --logging_steps 100000 \ 83 | --overwrite_output_dir True \ 84 | --log_level info \ 85 | --find_unused_parameters False 86 | done 87 | 88 | for data in prom_300_all prom_300_notata 89 | do 90 | python train.py \ 91 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 92 | --data_path $data_path/GUE/prom/$data \ 93 | --kmer -1 \ 94 | --run_name DNABERT2_${vocab}_${lr}_prom_${data}_seed${seed} \ 95 | --model_max_length 70 \ 96 | --per_device_train_batch_size 8 \ 97 | --per_device_eval_batch_size 16 \ 98 | --gradient_accumulation_steps 1 \ 99 | --learning_rate ${lr} \ 100 | --num_train_epochs 4 \ 101 | --fp16 \ 102 | --save_steps 400 \ 103 | --output_dir output/dnabert2 \ 104 | --evaluation_strategy steps \ 105 | --eval_steps 400 \ 106 | --warmup_steps 50 \ 107 | --logging_steps 100000 \ 108 | --overwrite_output_dir True \ 109 | --log_level info \ 110 | --find_unused_parameters False 111 | done 112 | 113 | 114 | 115 | for data in prom_300_tata 116 | do 117 | python train.py \ 118 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 119 | --data_path $data_path/GUE/prom/$data \ 120 | --kmer -1 \ 121 | --run_name DNABERT2_${vocab}_${lr}_prom_${data}_seed${seed} \ 122 | --model_max_length 70 \ 123 | --per_device_train_batch_size 8 \ 124 | --per_device_eval_batch_size 16 \ 125 | --gradient_accumulation_steps 1 \ 126 | --learning_rate ${lr} \ 127 | --num_train_epochs 10 \ 128 | --fp16 \ 129 | --save_steps 200 \ 130 | --output_dir output/dnabert2 \ 131 | --evaluation_strategy steps \ 132 | --eval_steps 200 \ 133 | --warmup_steps 50 \ 134 | --logging_steps 100000 \ 135 | --overwrite_output_dir True \ 136 | --log_level info \ 137 | --find_unused_parameters False 138 | done 139 | 140 | 141 | for data in reconstructed 142 | do 143 | python train.py \ 144 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 145 | --data_path $data_path/GUE/splice/$data \ 146 | --kmer -1 \ 147 | --run_name DNABERT2_${vocab}_${lr}_splice_${data}_seed${seed} \ 148 | --model_max_length 80 \ 149 | --per_device_train_batch_size 8 \ 150 | --per_device_eval_batch_size 16 \ 151 | --gradient_accumulation_steps 1 \ 152 | --learning_rate ${lr} \ 153 | --num_train_epochs 5 \ 154 | --fp16 \ 155 | --save_steps 200 \ 156 | --output_dir output/dnabert2 \ 157 | --evaluation_strategy steps \ 158 | --eval_steps 200 \ 159 | --warmup_steps 50 \ 160 | --logging_steps 100000 \ 161 | --overwrite_output_dir True \ 162 | --log_level info \ 163 | --find_unused_parameters False 164 | done 165 | 166 | 167 | 168 | for data in covid 169 | do 170 | python train.py \ 171 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 172 | --data_path $data_path/GUE/virus/$data \ 173 | --kmer -1 \ 174 | --run_name DNABERT2_${vocab}_${lr}_virus_${data}_seed${seed} \ 175 | --model_max_length 256 \ 176 | --per_device_train_batch_size 32 \ 177 | --per_device_eval_batch_size 32 \ 178 | --gradient_accumulation_steps 1 \ 179 | --learning_rate ${lr} \ 180 | --num_train_epochs 8 \ 181 | --fp16 \ 182 | --save_steps 200 \ 183 | --output_dir output/dnabert2 \ 184 | --evaluation_strategy steps \ 185 | --eval_steps 200 \ 186 | --warmup_steps 50 \ 187 | --logging_steps 100000 \ 188 | --overwrite_output_dir True \ 189 | --log_level info \ 190 | --find_unused_parameters False 191 | done 192 | 193 | for data in 0 1 2 3 4 194 | do 195 | python train.py \ 196 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 197 | --data_path $data_path/GUE/mouse/$data \ 198 | --kmer -1 \ 199 | --run_name DNABERT2_${vocab}_${lr}_mouse_${data}_seed${seed} \ 200 | --model_max_length 30 \ 201 | --per_device_train_batch_size 8 \ 202 | --per_device_eval_batch_size 64 \ 203 | --gradient_accumulation_steps 1 \ 204 | --learning_rate ${lr} \ 205 | --num_train_epochs 5 \ 206 | --max_steps 1000 \ 207 | --fp16 \ 208 | --save_steps 200 \ 209 | --output_dir output/dnabert2 \ 210 | --evaluation_strategy steps \ 211 | --eval_steps 200 \ 212 | --warmup_steps 30 \ 213 | --logging_steps 100000 \ 214 | --overwrite_output_dir True \ 215 | --log_level info \ 216 | --find_unused_parameters False 217 | done 218 | 219 | 220 | for data in 0 1 2 3 4 221 | do 222 | python train.py \ 223 | --model_name_or_path zhihan1996/DNABERT-2-117M \ 224 | --data_path $data_path/GUE/tf/$data \ 225 | --kmer -1 \ 226 | --run_name DNABERT2_${vocab}_${lr}_tf_${data}_seed${seed} \ 227 | --model_max_length 30 \ 228 | --per_device_train_batch_size 8 \ 229 | --per_device_eval_batch_size 64 \ 230 | --gradient_accumulation_steps 1 \ 231 | --learning_rate ${lr} \ 232 | --num_train_epochs 3 \ 233 | --fp16 \ 234 | --save_steps 200 \ 235 | --output_dir output/dnabert2 \ 236 | --evaluation_strategy steps \ 237 | --eval_steps 200 \ 238 | --warmup_steps 30 \ 239 | --logging_steps 100000 \ 240 | --overwrite_output_dir True \ 241 | --log_level info \ 242 | --find_unused_parameters False 243 | done 244 | done 245 | -------------------------------------------------------------------------------- /3_Model_fine_tuning/finetune/scripts/run_dnabert1.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # This is your argument 4 | data_path=$1 5 | kmer=$2 6 | 7 | echo "The provided kmer is: $kmer, data_path is $data_path" 8 | 9 | # sh scripts/run_dna1.sh 3 ; sh scripts/run_dna1.sh 4 ; sh scripts/run_dna1.sh 5 ; sh scripts/run_dna1.sh 6 10 | 11 | for seed in 42 12 | do 13 | for data in H3 H3K14ac H3K36me3 H3K4me1 H3K4me2 H3K4me3 H3K79me3 H3K9ac H4 H4ac 14 | do 15 | python train.py \ 16 | --model_name_or_path zhihan1996/DNA_bert_${kmer} \ 17 | --data_path ${data_path}/GUE/EMP/$data \ 18 | --kmer ${kmer} \ 19 | --run_name DNABERT1_${kmer}_EMP_${data}_seed${seed} \ 20 | --model_max_length 512 \ 21 | --per_device_train_batch_size 8 \ 22 | --per_device_eval_batch_size 16 \ 23 | --gradient_accumulation_steps 1 \ 24 | --learning_rate 3e-5 \ 25 | --num_train_epochs 3 \ 26 | --fp16 \ 27 | --save_steps 200 \ 28 | --output_dir output/dnabert1_${kmer} \ 29 | --evaluation_strategy steps \ 30 | --eval_steps 200 \ 31 | --warmup_steps 50 \ 32 | --logging_steps 100000 \ 33 | --overwrite_output_dir True \ 34 | --log_level info \ 35 | --seed ${seed} \ 36 | --find_unused_parameters False 37 | done 38 | 39 | 40 | for data in prom_core_all prom_core_notata 41 | do 42 | python train.py \ 43 | --model_name_or_path zhihan1996/DNA_bert_${kmer} \ 44 | --data_path ${data_path}/GUE/prom/$data \ 45 | --kmer ${kmer} \ 46 | --run_name DNABERT1_${kmer}_prom_${data}_seed${seed} \ 47 | --model_max_length 80 \ 48 | --per_device_train_batch_size 8 \ 49 | --per_device_eval_batch_size 16 \ 50 | --gradient_accumulation_steps 1 \ 51 | --learning_rate 3e-5 \ 52 | --num_train_epochs 4 \ 53 | --fp16 \ 54 | --save_steps 400 \ 55 | --output_dir output/dnabert1_${kmer} \ 56 | --evaluation_strategy steps \ 57 | --eval_steps 400 \ 58 | --warmup_steps 50 \ 59 | --logging_steps 100000 \ 60 | --overwrite_output_dir True \ 61 | --log_level info \ 62 | --seed ${seed} \ 63 | --find_unused_parameters False 64 | done 65 | 66 | 67 | for data in prom_core_tata 68 | do 69 | python train.py \ 70 | --model_name_or_path zhihan1996/DNA_bert_${kmer} \ 71 | --data_path ${data_path}/GUE/prom/$data \ 72 | --kmer ${kmer} \ 73 | --run_name DNABERT1_${kmer}_prom_${data}_seed${seed} \ 74 | --model_max_length 80 \ 75 | --per_device_train_batch_size 8 \ 76 | --per_device_eval_batch_size 16 \ 77 | --gradient_accumulation_steps 1 \ 78 | --learning_rate 3e-5 \ 79 | --num_train_epochs 10 \ 80 | --fp16 \ 81 | --save_steps 200 \ 82 | --output_dir output/dnabert1_${kmer} \ 83 | --evaluation_strategy steps \ 84 | --eval_steps 200 \ 85 | --warmup_steps 50 \ 86 | --logging_steps 100000 \ 87 | --overwrite_output_dir True \ 88 | --log_level info \ 89 | --seed ${seed} \ 90 | --find_unused_parameters False 91 | done 92 | 93 | for data in prom_300_all prom_300_notata 94 | do 95 | python train.py \ 96 | --model_name_or_path zhihan1996/DNA_bert_${kmer} \ 97 | --data_path ${data_path}/GUE/prom/$data \ 98 | --kmer ${kmer} \ 99 | --run_name DNABERT1_${kmer}_prom_${data}_seed${seed} \ 100 | --model_max_length 310 \ 101 | --per_device_train_batch_size 8 \ 102 | --per_device_eval_batch_size 16 \ 103 | --gradient_accumulation_steps 1 \ 104 | --learning_rate 3e-5 \ 105 | --num_train_epochs 4 \ 106 | --fp16 \ 107 | --save_steps 400 \ 108 | --output_dir output/dnabert1_${kmer} \ 109 | --evaluation_strategy steps \ 110 | --eval_steps 400 \ 111 | --warmup_steps 50 \ 112 | --logging_steps 100000 \ 113 | --overwrite_output_dir True \ 114 | --log_level info \ 115 | --seed ${seed} \ 116 | --find_unused_parameters False 117 | done 118 | 119 | 120 | for data in prom_300_tata 121 | do 122 | python train.py \ 123 | --model_name_or_path zhihan1996/DNA_bert_${kmer} \ 124 | --data_path ${data_path}/GUE/prom/$data \ 125 | --kmer ${kmer} \ 126 | --run_name DNABERT1_${kmer}_prom_${data}_seed${seed} \ 127 | --model_max_length 310 \ 128 | --per_device_train_batch_size 8 \ 129 | --per_device_eval_batch_size 16 \ 130 | --gradient_accumulation_steps 1 \ 131 | --learning_rate 3e-5 \ 132 | --num_train_epochs 10 \ 133 | --fp16 \ 134 | --save_steps 200 \ 135 | --output_dir output/dnabert1_${kmer} \ 136 | --evaluation_strategy steps \ 137 | --eval_steps 200 \ 138 | --warmup_steps 50 \ 139 | --logging_steps 100000 \ 140 | --overwrite_output_dir True \ 141 | --log_level info \ 142 | --seed ${seed} \ 143 | --find_unused_parameters False 144 | done 145 | 146 | for data in reconstructed 147 | do 148 | python train.py \ 149 | --model_name_or_path zhihan1996/DNA_bert_${kmer} \ 150 | --data_path ${data_path}/GUE/splice/$data \ 151 | --kmer ${kmer} \ 152 | --run_name DNABERT1_${kmer}_splice_${data}_seed${seed} \ 153 | --model_max_length 410 \ 154 | --per_device_train_batch_size 8 \ 155 | --per_device_eval_batch_size 16 \ 156 | --gradient_accumulation_steps 1 \ 157 | --learning_rate 3e-5 \ 158 | --num_train_epochs 5 \ 159 | --fp16 \ 160 | --save_steps 200 \ 161 | --output_dir output/dnabert1_${kmer} \ 162 | --evaluation_strategy steps \ 163 | --eval_steps 200 \ 164 | --warmup_steps 50 \ 165 | --logging_steps 100000 \ 166 | --overwrite_output_dir True \ 167 | --log_level info \ 168 | --seed ${seed} \ 169 | --find_unused_parameters False 170 | done 171 | 172 | 173 | for data in covid 174 | do 175 | python train.py \ 176 | --model_name_or_path zhihan1996/DNA_bert_${kmer} \ 177 | --data_path ${data_path}/GUE/virus/$data \ 178 | --kmer ${kmer} \ 179 | --run_name DNABERT1_${kmer}_virus_${data}_seed${seed} \ 180 | --model_max_length 1024 \ 181 | --per_device_train_batch_size 8 \ 182 | --per_device_eval_batch_size 8 \ 183 | --gradient_accumulation_steps 4 \ 184 | --learning_rate 3e-5 \ 185 | --num_train_epochs 9 \ 186 | --fp16 \ 187 | --save_steps 200 \ 188 | --output_dir output/dnabert1_${kmer} \ 189 | --evaluation_strategy steps \ 190 | --eval_steps 200 \ 191 | --warmup_steps 50 \ 192 | --logging_steps 100000 \ 193 | --overwrite_output_dir True \ 194 | --log_level info \ 195 | --seed ${seed} \ 196 | --find_unused_parameters False 197 | done 198 | 199 | 200 | for data in 0 1 2 3 4 201 | do 202 | python train.py \ 203 | --model_name_or_path zhihan1996/DNA_bert_${kmer} \ 204 | --data_path ${data_path}/GUE/mouse/$data \ 205 | --kmer ${kmer} \ 206 | --run_name DNABERT1_${kmer}_mouse_${data}_seed${seed} \ 207 | --model_max_length 110 \ 208 | --per_device_train_batch_size 8 \ 209 | --per_device_eval_batch_size 64 \ 210 | --gradient_accumulation_steps 1 \ 211 | --learning_rate 3e-5 \ 212 | --num_train_epochs 5 \ 213 | --max_steps 1000 \ 214 | --fp16 \ 215 | --save_steps 200 \ 216 | --output_dir output/dnabert1_${kmer} \ 217 | --evaluation_strategy steps \ 218 | --eval_steps 200 \ 219 | --warmup_steps 30 \ 220 | --logging_steps 100000 \ 221 | --overwrite_output_dir True \ 222 | --log_level info \ 223 | --seed ${seed} \ 224 | --find_unused_parameters False 225 | done 226 | 227 | 228 | for data in 0 1 2 3 4 229 | do 230 | python train.py \ 231 | --model_name_or_path zhihan1996/DNA_bert_${kmer} \ 232 | --data_path ${data_path}/GUE/tf/$data \ 233 | --kmer ${kmer} \ 234 | --run_name DNABERT1_${kmer}_tf_${data}_seed${seed} \ 235 | --model_max_length 110 \ 236 | --per_device_train_batch_size 8 \ 237 | --per_device_eval_batch_size 64 \ 238 | --gradient_accumulation_steps 1 \ 239 | --learning_rate 3e-5 \ 240 | --num_train_epochs 3 \ 241 | --fp16 \ 242 | --save_steps 200 \ 243 | --output_dir output/dnabert1_${kmer} \ 244 | --evaluation_strategy steps \ 245 | --eval_steps 200 \ 246 | --warmup_steps 30 \ 247 | --logging_steps 100000 \ 248 | --overwrite_output_dir True \ 249 | --log_level info \ 250 | --seed ${seed} \ 251 | --find_unused_parameters False 252 | done 253 | done -------------------------------------------------------------------------------- /1_DatasetCharacteristics/README.md: -------------------------------------------------------------------------------- 1 | # Dataset Characteristics for Genomic Data (DNA sequences) 2 | ## Project: `Genome Understanding using LLMs` 3 | **Author:** `Muhammad Aammar Tufail (Ph.D.)` 4 | 5 | ## `Overview` 6 | 7 | Genomic data, particularly DNA sequences, are fundamental to understanding the genetic basis of life. The characteristics of genomic datasets play a crucial role in determining the success of large language models (LLMs), in analyzing and interpreting these sequences. This document explores the key characteristics of genomic datasets, focusing on DNA sequences in FASTA format, to provide insights into the data requirements for training and fine-tuning LLMs for genomics applications. 8 | 9 | There are several key characteristics of genomic datasets that influence the performance and applicability of LLMs for genomic analysis. These characteristics include the format of the data, the length and complexity of the sequences, the presence of repetitive elements, the distribution of genomic elements, and the availability of labeled data for supervised learning tasks. Understanding these characteristics is essential for effectively leveraging LLMs to analyze genomic data and extract meaningful insights. 10 | 11 | The following table summarizes the key characteristics of genomic datasets and their implications for training and fine-tuning LLMs for genomic analysis and also includes the challenges of handling DNA sequence data compared to NLP data for LLM model training or fine-tuning: 12 | 13 | | **Characteristic** | **Normal NLP Data** | **Genomic Data** | **Challenges in Handling DNA Sequences** | 14 | |-----------------------------|-----------------------------------------------------------|--------------------------------------------------------------|------------------------------------------| 15 | | **Data Units** | Words, sentences, paragraphs | Nucleotide bases (A, T, C, G), codons, genes, sequences | Need to handle very limited alphabet size and repetitive sequences effectively. | 16 | | **Vocabulary Size** | Large (10,000 - 100,000+ words in various languages) | Small (4 bases: A, T, C, G) | Smaller vocabulary leads to potential difficulties in capturing complex patterns. | 17 | | **Length of Sequences** | Varies (e.g., tweets: ~20-50 words; books: thousands) | Varies (e.g., gene: hundreds to thousands of base pairs; genome: billions) | Extremely long sequences create memory and computational challenges for LLMs. | 18 | | **Data Structure** | Hierarchical (words → sentences → paragraphs → documents) | Linear (base pairs forming genes, regions, chromosomes) | Linear data structure lacks hierarchical organization, requiring new techniques for long-range dependencies. | 19 | | **Semantic Meaning** | Contextual (based on grammar, syntax, and meaning) | Contextual but biological (based on functional regions, regulatory elements, mutations) | Lack of clear, interpretable "semantic" structure akin to human language; meaning is highly domain-specific. | 20 | | **Context Dependencies** | Grammar rules, language syntax, domain-specific knowledge | Biological rules (e.g., codon usage, regulatory elements, gene expression) | Complex, non-intuitive dependencies such as epigenetic markers or chromatin interactions. | 21 | | **Annotations** | Part-of-speech tags, named entities, sentiment labels | Gene annotations, exon/intron boundaries, mutation labels, functional regions | Proper annotation is sparse and requires expert knowledge, making supervised learning harder. | 22 | | **Multimodal Connections** | Text linked with images, videos, or speech data | DNA linked with RNA, proteins, epigenetic marks, phenotypic data | Multimodal data integration requires specialized techniques for biological data. | 23 | | **Language Dynamics** | Constantly evolving with new words, phrases, slang | Relatively stable (evolution over long periods; mutations occur slowly) | Slow evolution of sequences provides fewer data changes, limiting adaptive learning opportunities. | 24 | | **Data Diversity** | Different languages, dialects, writing styles | Different species, populations, tissue types, genetic diversity | Species and population-specific variations require model generalization across highly diverse datasets. | 25 | | **Noise and Ambiguity** | Typos, slang, grammatical errors | Sequencing errors, ambiguous nucleotide calls (e.g., N bases) | High precision required due to biological significance of each base; sequencing errors can heavily impact outcomes. | 26 | | **Sequence Alignment** | Not relevant; words are mostly independent | Highly relevant; sequences must be aligned for comparative analysis (e.g., multiple sequence alignment) | Proper alignment is computationally expensive and critical for correct downstream interpretation. | 27 | | **Interpretation Requirements** | Requires cultural, syntactic, and contextual understanding | Requires biological and evolutionary understanding; interpretation of functional regions | Interpretation relies on complex biological phenomena, often requiring specialized knowledge not inherent to LLMs. | 28 | | **Size of Typical Dataset** | Varies (from a few hundred samples to billions of words) | Large (from a few sequences to terabytes of genomic data for entire populations) | Huge data sizes require advanced storage and processing infrastructure. | 29 | | **Contextual Meaning** | Based on co-occurrence, syntactic roles, and pragmatic context | Based on biological function, codon usage, and sequence conservation | Capturing functional meaning is difficult because it requires integrating biological rules rather than linguistic ones. | 30 | | **Specialized Tasks** | Sentiment analysis, translation, summarization | Gene prediction, variant calling, functional annotation, sequence classification | LLMs need to be retrained or fine-tuned for specialized biological tasks that are highly domain-specific. | 31 | | **Data Privacy Concerns** | Sensitive when handling personal texts or private documents | Sensitive due to implications for personal genetic information and health data | Privacy regulations (e.g., GDPR, HIPAA) impose strict controls on handling genomic data, complicating model access and use. | 32 | | **Data Representation** | Tokenized words and sentences | Tokenized as base pairs or k-mers (short nucleotide sequences) | Representing long sequences effectively while retaining functional information is challenging. | 33 | | **Error Tolerance** | Can tolerate minor errors (e.g., typos, incorrect grammar) | Low tolerance for errors due to the importance of each base in determining biological function | Errors in sequence prediction or handling can have significant downstream effects, e.g., on mutation detection or drug response prediction. | 34 | 35 | ### **Summary of Challenges**: 36 | - **Scale**: Genomic datasets are significantly larger and more complex than typical NLP datasets, which creates memory, processing, and storage challenges. 37 | - **Sequence Complexity**: The biological rules governing DNA sequences are more complex and less intuitive than the linguistic rules governing natural language. 38 | - **Interpretation**: Unlike human language, where meaning is derived from word co-occurrences and grammar, genomic data requires biological context and functional knowledge for accurate interpretation. 39 | - **Precision**: DNA sequences require much greater precision, as errors can have profound biological consequences, unlike minor errors in NLP that typically do not change the overall meaning of the text. 40 | 41 | These challenges necessitate specialized approaches to model design, training techniques, and data handling for DNA sequence data compared to standard NLP tasks. 42 | 43 | ## `Relation to Existing Literature` 44 | 45 | The characteristics of genomic datasets discussed here align with findings from existing literature on the application of LLMs in genomics. For example, studies on [DNABERT](https://academic.oup.com/bioinformatics/article/37/15/2112/6128680?login=false), [DNABERT-2](https://arxiv.org/html/2306.15006v2), and the [Nucleotide Transformer](https://www.biorxiv.org/content/10.1101/2023.01.11.523679v3) highlight the importance of handling DNA sequences effectively, including tokenization, model adaptation, and fine-tuning for genomic tasks. These studies emphasize the need for specialized models and training techniques to address the unique characteristics of genomic data and achieve optimal performance in genomics applications. 46 | 47 | ## `Example of DNA Sequences in FASTA Format` 48 | Here you may see a sample DNA sequence in FASTA format. 49 | 50 | ```plaintext 51 | >sequence_1 52 | ATCGTACGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG 53 | >sequence_2 54 | GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC 55 | >sequence_3 56 | TACGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG 57 | ``` 58 | `Note`: The above sequences are for illustrative purposes only and do not represent actual genomic data. 59 | This sequence file contains three sequences, each labeled with a unique identifier (e.g., sequence_1, sequence_2, sequence_3) and the corresponding DNA sequence. The sequences consist of combinations of four nucleotide bases: adenine (A), thymine (T), cytosine (C), and guanine (G). Each sequence is represented as a linear string of nucleotide bases, with each base corresponding to a specific position in the DNA sequence. 60 | 61 | The Gene features of Bacteroides fragillis genome are in [this file](./genome_Bacteroides_fragillis.fasta). 62 | 63 | ## `Conclusion` 64 | 65 | Understanding the key characteristics of genomic datasets is essential for effectively training and fine-tuning large language models for genomics applications. By recognizing the unique properties of DNA sequences, researchers can develop specialized models, data processing pipelines, and training strategies to optimize the performance of LLMs in genomics tasks. The challenges posed by genomic data, such as sequence complexity, interpretational requirements, and precision demands, necessitate tailored approaches to model design and training to ensure accurate and meaningful analysis of DNA sequences. 66 | 67 | --- 68 | 69 | 70 | -------------------------------------------------------------------------------- /3_Model_fine_tuning/finetune/scripts/run_nt.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | data_path=$1 4 | m=$2 5 | 6 | 7 | if [ "$m" -eq 0 ]; then 8 | model=InstaDeepAI/nucleotide-transformer-500m-1000g 9 | run_name=NT_500_1000g 10 | elif [ "$m" -eq 1 ]; then 11 | model=InstaDeepAI/nucleotide-transformer-500m-human-ref 12 | run_name=NT_500_human 13 | elif [ "$m" -eq 2 ]; then 14 | model=InstaDeepAI/nucleotide-transformer-2.5b-1000g 15 | run_name=NT_2500_1000g 16 | elif [ "$m" -eq 3 ]; then 17 | model=InstaDeepAI/nucleotide-transformer-2.5b-multi-species 18 | run_name=NT_2500_multi 19 | else 20 | echo "Wrong argument" 21 | exit 1 22 | fi 23 | echo "Use: $model" 24 | 25 | 26 | for seed in 42 27 | do 28 | for data in H3 H3K14ac H3K36me3 H3K4me1 H3K4me2 H3K4me3 H3K79me3 H3K9ac H4 H4ac 29 | do 30 | python train.py \ 31 | --model_name_or_path ${model} \ 32 | --data_path ${data_path}/GUE/EMP/$data \ 33 | --kmer -1 \ 34 | --run_name ${run_name}_EMP_${data}_seed${seed} \ 35 | --model_max_length 100 \ 36 | --use_lora \ 37 | --lora_target_modules 'query,value,key,dense' \ 38 | --per_device_train_batch_size 8 \ 39 | --per_device_eval_batch_size 16 \ 40 | --gradient_accumulation_steps 1 \ 41 | --lora_alpha 16 \ 42 | --learning_rate 1e-4 \ 43 | --num_train_epochs 3 \ 44 | --fp16 \ 45 | --save_steps 200 \ 46 | --output_dir output/nt_${run_name} \ 47 | --evaluation_strategy steps \ 48 | --eval_steps 200 \ 49 | --warmup_steps 50 \ 50 | --logging_steps 100000 \ 51 | --overwrite_output_dir True \ 52 | --log_level info \ 53 | --seed ${seed} \ 54 | --find_unused_parameters False 55 | done 56 | 57 | 58 | 59 | for data in prom_core_all prom_core_notata 60 | do 61 | python train.py \ 62 | --model_name_or_path ${model} \ 63 | --data_path ${data_path}/GUE/prom/$data \ 64 | --kmer -1 \ 65 | --run_name ${run_name}_prom_${data}_seed${seed} \ 66 | --model_max_length 20 \ 67 | --use_lora \ 68 | --lora_target_modules 'query,value,key,dense' \ 69 | --per_device_train_batch_size 8 \ 70 | --per_device_eval_batch_size 16 \ 71 | --gradient_accumulation_steps 1 \ 72 | --lora_alpha 16 \ 73 | --learning_rate 1e-4 \ 74 | --num_train_epochs 4 \ 75 | --fp16 \ 76 | --save_steps 400 \ 77 | --output_dir output/nt_${run_name} \ 78 | --evaluation_strategy steps \ 79 | --eval_steps 400 \ 80 | --warmup_steps 50 \ 81 | --logging_steps 100000 \ 82 | --overwrite_output_dir True \ 83 | --log_level info \ 84 | --seed ${seed} \ 85 | --find_unused_parameters False 86 | done 87 | 88 | 89 | for data in prom_core_tata 90 | do 91 | python train.py \ 92 | --model_name_or_path ${model} \ 93 | --data_path ${data_path}/GUE/prom/$data \ 94 | --kmer -1 \ 95 | --run_name ${run_name}_prom_${data}_seed${seed} \ 96 | --model_max_length 20 \ 97 | --use_lora \ 98 | --lora_target_modules 'query,value,key,dense' \ 99 | --per_device_train_batch_size 8 \ 100 | --per_device_eval_batch_size 16 \ 101 | --gradient_accumulation_steps 1 \ 102 | --lora_alpha 16 \ 103 | --learning_rate 1e-4 \ 104 | --num_train_epochs 10 \ 105 | --fp16 \ 106 | --save_steps 200 \ 107 | --output_dir output/nt_${run_name} \ 108 | --evaluation_strategy steps \ 109 | --eval_steps 200 \ 110 | --warmup_steps 50 \ 111 | --logging_steps 100000 \ 112 | --overwrite_output_dir True \ 113 | --log_level info \ 114 | --seed ${seed} \ 115 | --find_unused_parameters False 116 | done 117 | 118 | for data in prom_300_all prom_300_notata 119 | do 120 | python train.py \ 121 | --model_name_or_path ${model} \ 122 | --data_path ${data_path}/GUE/prom/$data \ 123 | --kmer -1 \ 124 | --run_name ${run_name}_prom_${data}_seed${seed} \ 125 | --model_max_length 70 \ 126 | --use_lora \ 127 | --lora_target_modules 'query,value,key,dense' \ 128 | --per_device_train_batch_size 8 \ 129 | --per_device_eval_batch_size 16 \ 130 | --gradient_accumulation_steps 1 \ 131 | --lora_alpha 16 \ 132 | --learning_rate 1e-4 \ 133 | --num_train_epochs 4 \ 134 | --fp16 \ 135 | --save_steps 400 \ 136 | --output_dir output/nt_${run_name} \ 137 | --evaluation_strategy steps \ 138 | --eval_steps 400 \ 139 | --warmup_steps 50 \ 140 | --logging_steps 100000 \ 141 | --overwrite_output_dir True \ 142 | --log_level info \ 143 | --seed ${seed} \ 144 | --find_unused_parameters False 145 | done 146 | 147 | 148 | for data in prom_300_tata 149 | do 150 | python train.py \ 151 | --model_name_or_path ${model} \ 152 | --data_path ${data_path}/GUE/prom/$data \ 153 | --kmer -1 \ 154 | --run_name ${run_name}_prom_${data}_seed${seed} \ 155 | --model_max_length 70 \ 156 | --use_lora \ 157 | --lora_target_modules 'query,value,key,dense' \ 158 | --per_device_train_batch_size 8 \ 159 | --per_device_eval_batch_size 16 \ 160 | --gradient_accumulation_steps 1 \ 161 | --lora_alpha 16 \ 162 | --learning_rate 1e-4 \ 163 | --num_train_epochs 10 \ 164 | --fp16 \ 165 | --save_steps 200 \ 166 | --output_dir output/nt_${run_name} \ 167 | --evaluation_strategy steps \ 168 | --eval_steps 200 \ 169 | --warmup_steps 50 \ 170 | --logging_steps 100000 \ 171 | --overwrite_output_dir True \ 172 | --log_level info \ 173 | --seed ${seed} \ 174 | --find_unused_parameters False 175 | done 176 | 177 | 178 | for data in reconstructed 179 | do 180 | python train.py \ 181 | --model_name_or_path ${model} \ 182 | --data_path ${data_path}/GUE/splice/$data \ 183 | --kmer -1 \ 184 | --run_name ${run_name}_splice_${data}_seed${seed} \ 185 | --model_max_length 80 \ 186 | --use_lora \ 187 | --lora_target_modules 'query,value,key,dense' \ 188 | --per_device_train_batch_size 8 \ 189 | --per_device_eval_batch_size 16 \ 190 | --gradient_accumulation_steps 1 \ 191 | --lora_alpha 16 \ 192 | --learning_rate 1e-4 \ 193 | --num_train_epochs 5 \ 194 | --fp16 \ 195 | --save_steps 200 \ 196 | --output_dir output/nt_${run_name} \ 197 | --evaluation_strategy steps \ 198 | --eval_steps 200 \ 199 | --warmup_steps 50 \ 200 | --logging_steps 100000 \ 201 | --overwrite_output_dir True \ 202 | --log_level info \ 203 | --seed ${seed} \ 204 | --find_unused_parameters False 205 | done 206 | 207 | 208 | 209 | for data in covid 210 | do 211 | python train.py \ 212 | --model_name_or_path ${model} \ 213 | --data_path ${data_path}/GUE/virus/$data \ 214 | --kmer -1 \ 215 | --run_name ${run_name}_virus_${data}_seed${seed} \ 216 | --model_max_length 256 \ 217 | --use_lora \ 218 | --lora_target_modules 'query,value,key,dense' \ 219 | --per_device_train_batch_size 2 \ 220 | --per_device_eval_batch_size 4 \ 221 | --gradient_accumulation_steps 4 \ 222 | --learning_rate 1e-4 \ 223 | --num_train_epochs 3 \ 224 | --fp16 \ 225 | --save_steps 10000 \ 226 | --output_dir output/nt_${run_name} \ 227 | --evaluation_strategy steps \ 228 | --eval_steps 200 \ 229 | --warmup_steps 200 \ 230 | --logging_steps 100000 \ 231 | --overwrite_output_dir True \ 232 | --log_level info \ 233 | --seed ${seed} \ 234 | --find_unused_parameters False 235 | done 236 | 237 | for data in 0 1 2 3 4 238 | do 239 | python train.py \ 240 | --model_name_or_path ${model} \ 241 | --data_path ${data_path}/GUE/mouse/$data \ 242 | --kmer -1 \ 243 | --run_name ${run_name}_mouse_${data}_seed${seed} \ 244 | --model_max_length 30 \ 245 | --use_lora \ 246 | --lora_target_modules 'query,value,key,dense' \ 247 | --per_device_train_batch_size 8 \ 248 | --per_device_eval_batch_size 64 \ 249 | --gradient_accumulation_steps 1 \ 250 | --lora_alpha 16 \ 251 | --learning_rate 1e-4 \ 252 | --num_train_epochs 5 \ 253 | --max_steps 1000 \ 254 | --fp16 \ 255 | --save_steps 200 \ 256 | --output_dir output/nt_${run_name} \ 257 | --evaluation_strategy steps \ 258 | --eval_steps 200 \ 259 | --warmup_steps 100 \ 260 | --logging_steps 100000 \ 261 | --overwrite_output_dir True \ 262 | --log_level info \ 263 | --seed ${seed} \ 264 | --find_unused_parameters False 265 | done 266 | 267 | 268 | for data in 0 1 2 3 4 269 | do 270 | python train.py \ 271 | --model_name_or_path ${model} \ 272 | --data_path ${data_path}/GUE/tf/$data \ 273 | --kmer -1 \ 274 | --run_name ${run_name}_tf_${data}_seed${seed} \ 275 | --model_max_length 30 \ 276 | --use_lora \ 277 | --lora_target_modules 'query,value,key,dense' \ 278 | --per_device_train_batch_size 8 \ 279 | --per_device_eval_batch_size 64 \ 280 | --gradient_accumulation_steps 1 \ 281 | --lora_alpha 16 \ 282 | --learning_rate 1e-4 \ 283 | --num_train_epochs 3 \ 284 | --fp16 \ 285 | --save_steps 200 \ 286 | --output_dir output/nt_${run_name} \ 287 | --evaluation_strategy steps \ 288 | --eval_steps 200 \ 289 | --warmup_steps 30 \ 290 | --logging_steps 100000 \ 291 | --overwrite_output_dir True \ 292 | --log_level info \ 293 | --seed ${seed} \ 294 | --find_unused_parameters False 295 | done 296 | done -------------------------------------------------------------------------------- /2_BaselineModel/Baseline_model_DNABERT2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Baseline Model\n", 8 | "\n", 9 | "## DNABERT2 model\n", 10 | "\n", 11 | "This model will be used to calculate the embeddings from DNA sequences. The embeddings would be used to train a classifier to predict the binding affinity of the DNA sequences. The model is based on the DNABERT2 model, which is a transformer model that is pretrained on DNA." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import torch\n", 21 | "from transformers import AutoTokenizer, AutoModel" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "name": "stderr", 31 | "output_type": "stream", 32 | "text": [ 33 | "/Users/babaaammar/mambaforge/envs/dna_llm/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n", 34 | " warnings.warn(\n" 35 | ] 36 | }, 37 | { 38 | "data": { 39 | "application/vnd.jupyter.widget-view+json": { 40 | "model_id": "2440bf06b6a24b9d99391083599122ac", 41 | "version_major": 2, 42 | "version_minor": 0 43 | }, 44 | "text/plain": [ 45 | "tokenizer_config.json: 0%| | 0.00/158 [00:00 The model can generate DNA embeddings that naturally cluster and segregate genomes of different species in the embedding space, making it useful for comparative genomics" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "> Embedding is the process of representing a DNA sequence as a fixed-length vector. The embedding of a DNA sequence is a numerical representation of the sequence that captures its biological properties. The embedding can be used as input to machine learning models for various tasks such as classification, clustering, and similarity search." 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 3, 223 | "metadata": {}, 224 | "outputs": [ 225 | { 226 | "name": "stdout", 227 | "output_type": "stream", 228 | "text": [ 229 | "torch.Size([768])\n", 230 | "torch.Size([768])\n" 231 | ] 232 | } 233 | ], 234 | "source": [ 235 | "dna = \"ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC\"\n", 236 | "inputs = tokenizer(dna, return_tensors = 'pt')[\"input_ids\"]\n", 237 | "hidden_states = model(inputs)[0] # [1, sequence_length, 768]\n", 238 | "\n", 239 | "# embedding with mean pooling\n", 240 | "embedding_mean = torch.mean(hidden_states[0], dim=0)\n", 241 | "print(embedding_mean.shape) # expect to be 768\n", 242 | "\n", 243 | "# embedding with max pooling\n", 244 | "embedding_max = torch.max(hidden_states[0], dim=0)[0]\n", 245 | "print(embedding_max.shape) # expect to be 768" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": 7, 251 | "metadata": {}, 252 | "outputs": [ 253 | { 254 | "name": "stdout", 255 | "output_type": "stream", 256 | "text": [ 257 | "torch.Size([768])\n", 258 | "torch.Size([768])\n" 259 | ] 260 | } 261 | ], 262 | "source": [ 263 | "dna = \"AAAAAGCCTGTGAAGCACAGAGAGCAGCCAGCCAGAGCTGATGCTCAATGGCAGAAACTGCTTAGTCACGCTGAAAGGGAGCCAAGGCAATAGCAGAGTGG\"\n", 264 | "inputs = tokenizer(dna, return_tensors = 'pt')[\"input_ids\"]\n", 265 | "hidden_states = model(inputs)[0] # [1, sequence_length, 768]\n", 266 | "\n", 267 | "# embedding with mean pooling\n", 268 | "embedding_mean = torch.mean(hidden_states[0], dim=0)\n", 269 | "print(embedding_mean.shape) # expect to be 768\n", 270 | "\n", 271 | "# embedding with max pooling\n", 272 | "embedding_max = torch.max(hidden_states[0], dim=0)[0]\n", 273 | "print(embedding_max.shape) # expect to be 768" 274 | ] 275 | } 276 | ], 277 | "metadata": { 278 | "kernelspec": { 279 | "display_name": "dna_llm", 280 | "language": "python", 281 | "name": "python3" 282 | }, 283 | "language_info": { 284 | "codemirror_mode": { 285 | "name": "ipython", 286 | "version": 3 287 | }, 288 | "file_extension": ".py", 289 | "mimetype": "text/x-python", 290 | "name": "python", 291 | "nbconvert_exporter": "python", 292 | "pygments_lexer": "ipython3", 293 | "version": "3.8.undefined" 294 | } 295 | }, 296 | "nbformat": 4, 297 | "nbformat_minor": 2 298 | } 299 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /3_Model_fine_tuning/finetune/train.py: -------------------------------------------------------------------------------- 1 | import os 2 | import csv 3 | import copy 4 | import json 5 | import logging 6 | from dataclasses import dataclass, field 7 | from typing import Optional, Dict, Sequence, Tuple, List 8 | 9 | import torch 10 | import transformers 11 | import sklearn 12 | import numpy as np 13 | from torch.utils.data import Dataset 14 | 15 | from peft import ( 16 | LoraConfig, 17 | get_peft_model, 18 | get_peft_model_state_dict, 19 | ) 20 | 21 | 22 | @dataclass 23 | class ModelArguments: 24 | model_name_or_path: Optional[str] = field(default="facebook/opt-125m") 25 | use_lora: bool = field(default=False, metadata={"help": "whether to use LoRA"}) 26 | lora_r: int = field(default=8, metadata={"help": "hidden dimension for LoRA"}) 27 | lora_alpha: int = field(default=32, metadata={"help": "alpha for LoRA"}) 28 | lora_dropout: float = field(default=0.05, metadata={"help": "dropout rate for LoRA"}) 29 | lora_target_modules: str = field(default="query,value", metadata={"help": "where to perform LoRA"}) 30 | 31 | 32 | @dataclass 33 | class DataArguments: 34 | data_path: str = field(default=None, metadata={"help": "Path to the training data."}) 35 | kmer: int = field(default=-1, metadata={"help": "k-mer for input sequence. -1 means not using k-mer."}) 36 | 37 | 38 | @dataclass 39 | class TrainingArguments(transformers.TrainingArguments): 40 | cache_dir: Optional[str] = field(default=None) 41 | run_name: str = field(default="run") 42 | optim: str = field(default="adamw_torch") 43 | model_max_length: int = field(default=512, metadata={"help": "Maximum sequence length."}) 44 | gradient_accumulation_steps: int = field(default=1) 45 | per_device_train_batch_size: int = field(default=1) 46 | per_device_eval_batch_size: int = field(default=1) 47 | num_train_epochs: int = field(default=1) 48 | fp16: bool = field(default=False) 49 | logging_steps: int = field(default=100) 50 | save_steps: int = field(default=100) 51 | eval_steps: int = field(default=100) 52 | evaluation_strategy: str = field(default="steps"), 53 | warmup_steps: int = field(default=50) 54 | weight_decay: float = field(default=0.01) 55 | learning_rate: float = field(default=1e-4) 56 | save_total_limit: int = field(default=3) 57 | load_best_model_at_end: bool = field(default=True) 58 | output_dir: str = field(default="output") 59 | find_unused_parameters: bool = field(default=False) 60 | checkpointing: bool = field(default=False) 61 | dataloader_pin_memory: bool = field(default=False) 62 | eval_and_save_results: bool = field(default=True) 63 | save_model: bool = field(default=False) 64 | seed: int = field(default=42) 65 | 66 | 67 | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str): 68 | """Collects the state dict and dump to disk.""" 69 | state_dict = trainer.model.state_dict() 70 | if trainer.args.should_save: 71 | cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()} 72 | del state_dict 73 | trainer._save(output_dir, state_dict=cpu_state_dict) # noqa 74 | 75 | 76 | """ 77 | Get the reversed complement of the original DNA sequence. 78 | """ 79 | def get_alter_of_dna_sequence(sequence: str): 80 | MAP = {"A": "T", "T": "A", "C": "G", "G": "C"} 81 | # return "".join([MAP[c] for c in reversed(sequence)]) 82 | return "".join([MAP[c] for c in sequence]) 83 | 84 | """ 85 | Transform a dna sequence to k-mer string 86 | """ 87 | def generate_kmer_str(sequence: str, k: int) -> str: 88 | """Generate k-mer string from DNA sequence.""" 89 | return " ".join([sequence[i:i+k] for i in range(len(sequence) - k + 1)]) 90 | 91 | 92 | """ 93 | Load or generate k-mer string for each DNA sequence. The generated k-mer string will be saved to the same directory as the original data with the same name but with a suffix of "_{k}mer". 94 | """ 95 | def load_or_generate_kmer(data_path: str, texts: List[str], k: int) -> List[str]: 96 | """Load or generate k-mer string for each DNA sequence.""" 97 | kmer_path = data_path.replace(".csv", f"_{k}mer.json") 98 | if os.path.exists(kmer_path): 99 | logging.warning(f"Loading k-mer from {kmer_path}...") 100 | with open(kmer_path, "r") as f: 101 | kmer = json.load(f) 102 | else: 103 | logging.warning(f"Generating k-mer...") 104 | kmer = [generate_kmer_str(text, k) for text in texts] 105 | with open(kmer_path, "w") as f: 106 | logging.warning(f"Saving k-mer to {kmer_path}...") 107 | json.dump(kmer, f) 108 | 109 | return kmer 110 | 111 | class SupervisedDataset(Dataset): 112 | """Dataset for supervised fine-tuning.""" 113 | 114 | def __init__(self, 115 | data_path: str, 116 | tokenizer: transformers.PreTrainedTokenizer, 117 | kmer: int = -1): 118 | 119 | super(SupervisedDataset, self).__init__() 120 | 121 | # load data from the disk 122 | with open(data_path, "r") as f: 123 | data = list(csv.reader(f))[1:] 124 | if len(data[0]) == 2: 125 | # data is in the format of [text, label] 126 | logging.warning("Perform single sequence classification...") 127 | texts = [d[0] for d in data] 128 | labels = [int(d[1]) for d in data] 129 | elif len(data[0]) == 3: 130 | # data is in the format of [text1, text2, label] 131 | logging.warning("Perform sequence-pair classification...") 132 | texts = [[d[0], d[1]] for d in data] 133 | labels = [int(d[2]) for d in data] 134 | else: 135 | raise ValueError("Data format not supported.") 136 | 137 | if kmer != -1: 138 | # only write file on the first process 139 | if torch.distributed.get_rank() not in [0, -1]: 140 | torch.distributed.barrier() 141 | 142 | logging.warning(f"Using {kmer}-mer as input...") 143 | texts = load_or_generate_kmer(data_path, texts, kmer) 144 | 145 | if torch.distributed.get_rank() == 0: 146 | torch.distributed.barrier() 147 | 148 | output = tokenizer( 149 | texts, 150 | return_tensors="pt", 151 | padding="longest", 152 | max_length=tokenizer.model_max_length, 153 | truncation=True, 154 | ) 155 | 156 | self.input_ids = output["input_ids"] 157 | self.attention_mask = output["attention_mask"] 158 | self.labels = labels 159 | self.num_labels = len(set(labels)) 160 | 161 | def __len__(self): 162 | return len(self.input_ids) 163 | 164 | def __getitem__(self, i) -> Dict[str, torch.Tensor]: 165 | return dict(input_ids=self.input_ids[i], labels=self.labels[i]) 166 | 167 | 168 | @dataclass 169 | class DataCollatorForSupervisedDataset(object): 170 | """Collate examples for supervised fine-tuning.""" 171 | 172 | tokenizer: transformers.PreTrainedTokenizer 173 | 174 | def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]: 175 | input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels")) 176 | input_ids = torch.nn.utils.rnn.pad_sequence( 177 | input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id 178 | ) 179 | labels = torch.Tensor(labels).long() 180 | return dict( 181 | input_ids=input_ids, 182 | labels=labels, 183 | attention_mask=input_ids.ne(self.tokenizer.pad_token_id), 184 | ) 185 | 186 | """ 187 | Manually calculate the accuracy, f1, matthews_correlation, precision, recall with sklearn. 188 | """ 189 | def calculate_metric_with_sklearn(logits: np.ndarray, labels: np.ndarray): 190 | if logits.ndim == 3: 191 | # Reshape logits to 2D if needed 192 | logits = logits.reshape(-1, logits.shape[-1]) 193 | predictions = np.argmax(logits, axis=-1) 194 | valid_mask = labels != -100 # Exclude padding tokens (assuming -100 is the padding token ID) 195 | valid_predictions = predictions[valid_mask] 196 | valid_labels = labels[valid_mask] 197 | return { 198 | "accuracy": sklearn.metrics.accuracy_score(valid_labels, valid_predictions), 199 | "f1": sklearn.metrics.f1_score( 200 | valid_labels, valid_predictions, average="macro", zero_division=0 201 | ), 202 | "matthews_correlation": sklearn.metrics.matthews_corrcoef( 203 | valid_labels, valid_predictions 204 | ), 205 | "precision": sklearn.metrics.precision_score( 206 | valid_labels, valid_predictions, average="macro", zero_division=0 207 | ), 208 | "recall": sklearn.metrics.recall_score( 209 | valid_labels, valid_predictions, average="macro", zero_division=0 210 | ), 211 | } 212 | 213 | 214 | """ 215 | Compute metrics used for huggingface trainer. 216 | """ 217 | def compute_metrics(eval_pred): 218 | logits, labels = eval_pred 219 | if isinstance(logits, tuple): # Unpack logits if it's a tuple 220 | logits = logits[0] 221 | return calculate_metric_with_sklearn(logits, labels) 222 | 223 | 224 | 225 | def train(): 226 | parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments)) 227 | model_args, data_args, training_args = parser.parse_args_into_dataclasses() 228 | 229 | # load tokenizer 230 | tokenizer = transformers.AutoTokenizer.from_pretrained( 231 | model_args.model_name_or_path, 232 | cache_dir=training_args.cache_dir, 233 | model_max_length=training_args.model_max_length, 234 | padding_side="right", 235 | use_fast=True, 236 | trust_remote_code=True, 237 | ) 238 | 239 | if "InstaDeepAI" in model_args.model_name_or_path: 240 | tokenizer.eos_token = tokenizer.pad_token 241 | 242 | # define datasets and data collator 243 | train_dataset = SupervisedDataset(tokenizer=tokenizer, 244 | data_path=os.path.join(data_args.data_path, "train.csv"), 245 | kmer=data_args.kmer) 246 | val_dataset = SupervisedDataset(tokenizer=tokenizer, 247 | data_path=os.path.join(data_args.data_path, "dev.csv"), 248 | kmer=data_args.kmer) 249 | test_dataset = SupervisedDataset(tokenizer=tokenizer, 250 | data_path=os.path.join(data_args.data_path, "test.csv"), 251 | kmer=data_args.kmer) 252 | data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer) 253 | 254 | 255 | # load model 256 | model = transformers.AutoModelForSequenceClassification.from_pretrained( 257 | model_args.model_name_or_path, 258 | cache_dir=training_args.cache_dir, 259 | num_labels=train_dataset.num_labels, 260 | trust_remote_code=True, 261 | ) 262 | 263 | # configure LoRA 264 | if model_args.use_lora: 265 | lora_config = LoraConfig( 266 | r=model_args.lora_r, 267 | lora_alpha=model_args.lora_alpha, 268 | target_modules=list(model_args.lora_target_modules.split(",")), 269 | lora_dropout=model_args.lora_dropout, 270 | bias="none", 271 | task_type="SEQ_CLS", 272 | inference_mode=False, 273 | ) 274 | model = get_peft_model(model, lora_config) 275 | model.print_trainable_parameters() 276 | 277 | # define trainer 278 | trainer = transformers.Trainer(model=model, 279 | tokenizer=tokenizer, 280 | args=training_args, 281 | compute_metrics=compute_metrics, 282 | train_dataset=train_dataset, 283 | eval_dataset=val_dataset, 284 | data_collator=data_collator) 285 | trainer.train() 286 | 287 | if training_args.save_model: 288 | trainer.save_state() 289 | safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir) 290 | 291 | # get the evaluation results from trainer 292 | if training_args.eval_and_save_results: 293 | results_path = os.path.join(training_args.output_dir, "results", training_args.run_name) 294 | results = trainer.evaluate(eval_dataset=test_dataset) 295 | os.makedirs(results_path, exist_ok=True) 296 | with open(os.path.join(results_path, "eval_results.json"), "w") as f: 297 | json.dump(results, f) 298 | 299 | 300 | 301 | 302 | if __name__ == "__main__": 303 | train() 304 | --------------------------------------------------------------------------------