├── 2_BaselineModel
    ├── README.md
    └── Baseline_model_DNABERT2.ipynb
├── 4_Presentation
    ├── README.md
    └── Presentation_Fine_Tuning_LLM_for_genome_understanding_Muhammad_Aammar_Tufail.pdf
├── CoverImage
    └── cover_image.png
├── README.md
├── 3_Model_fine_tuning
    ├── fine_tuing_Data
    │   ├── dev.csv
    │   ├── test.csv
    │   └── train.csv
    ├── fine_tuning_script.sh
    ├── README.md
    └── finetune
    │   ├── scripts
    │       ├── run_dnabert2.sh
    │       ├── run_dnabert1.sh
    │       └── run_nt.sh
    │   └── train.py
├── 0_LiteratureReview
    └── README.md
├── 1_DatasetCharacteristics
    └── README.md
└── LICENSE


/2_BaselineModel/README.md:
--------------------------------------------------------------------------------
1 | # Baseline Model
2 | 
3 | **[Notebook](Baseline_model_DNABERT2.ipynb)**
4 | 


--------------------------------------------------------------------------------
/4_Presentation/README.md:
--------------------------------------------------------------------------------
1 | # Presentation
2 | 
3 | **[Slides](./Presentation_Fine_Tuning_LLM_for_genome_understanding_Muhammad_Aammar_Tufail.pdf)**
4 | 


--------------------------------------------------------------------------------
/CoverImage/cover_image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AammarTufail/fine_tuning_LLM_for_genome_understanding_of_covid_19/HEAD/CoverImage/cover_image.png


--------------------------------------------------------------------------------
/4_Presentation/Presentation_Fine_Tuning_LLM_for_genome_understanding_Muhammad_Aammar_Tufail.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AammarTufail/fine_tuning_LLM_for_genome_understanding_of_covid_19/HEAD/4_Presentation/Presentation_Fine_Tuning_LLM_for_genome_understanding_Muhammad_Aammar_Tufail.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # **Genome Understanding using LLMs**
 2 | 
 3 | ## Repository Link
 4 | 
 5 | [https://github.com/your_username/your_project_name]
 6 | 
 7 | ## Description
 8 | 
 9 | This project aims to apply large language models (LLMs) to the problem of understanding genomic sequences, specifically FASTA files. The primary goal is to leverage transformer-based architectures such as DNABERT, DNABERT-2, and the Nucleotide Transformer to analyze and interpret DNA sequences. This approach seeks to improve the prediction of genomic elements (e.g., promoters, splice sites, transcription factor binding sites) and to explore the potential of pre-trained models in genomics for bioinformtics, personalized medicine and biotechnology applications.
10 | 
11 | ### Task Type
12 | 
13 | Fine Tunning LLMs for Genome Understanding
14 | 
15 | ### Results Summary
16 | 
17 | - **Best Model:** DNABERT-2
18 | - **Evaluation Metric:** Training time was too much for the dataset, so I stopped the training after few Hours.
19 | - **Result:** Could not complete the training, but the model was able to learn the patterns in the data.
20 | 
21 | ## Documentation
22 | 
23 | 1. **[Literature Review](0_LiteratureReview/README.md)**
24 | 2. **[Dataset Characteristics](1_DatasetCharacteristics/README.md)**
25 | 3. **[Baseline Model](2_BaselineModel/Baseline_model_DNABERT2.ipynb)**
26 | 4. **[Fine Tuning LLM](3_Model_fine_tuning/README.md)**
27 | 5. **[Presentation](4_Presentation/README.md)**
28 | 
29 | ## Cover Image
30 | 
31 | ![Project Cover Image](CoverImage/cover_image.png)
32 | 
33 | ----
34 | 


--------------------------------------------------------------------------------
/3_Model_fine_tuning/fine_tuing_Data/dev.csv:
--------------------------------------------------------------------------------
 1 | sequence,label
 2 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0
 3 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1
 4 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0
 5 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0
 6 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1
 7 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0
 8 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1
 9 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0
10 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0
11 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1
12 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0
13 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1
14 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0
15 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0
16 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1
17 | 


--------------------------------------------------------------------------------
/3_Model_fine_tuning/fine_tuing_Data/test.csv:
--------------------------------------------------------------------------------
 1 | sequence,label
 2 | GTTCCCATGGGAGAGTTCTGTAGGACTATAGGTTGGTCGTATATTGCTGAGTCAGCACTTATGGAGCACTACCAAAAGAATCATTATCATGCATGCATTAG,1
 3 | AGATGCAGTCAGCAGGAAATGTCAAGATTTTATGCAAGTGGCTGTCACACCATCTTGTGAGCTCACATTGAGTATGTAATTTTCACTCAACATGGGGTCAA,0
 4 | ACATGTCTTGTGCCATGAGCCCAAAGAGACCATGAGTCAGCACACCGACCTCAGACATCATTACACAGCAGCCACGTCAGAGTGTTAGCCAAGTAGATAAA,1
 5 | TATCTCCCACATCCATCCAAGTTTTGTGTTTTTCTCAGCCCATTTCATTTTTTTGTCTCGTTAAGCTTGTACTGTCGTTTTTGATTCAGTCATTAACATGT,0
 6 | CGCATCTCTCAGGAAAAAAGACTCAGCAAAGCCCACACCAATTTGTTTCTTTGTTTATTCACCTACTTATATTAACAGCTCCCACCATTGTGAACCGCAGA,0
 7 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0
 8 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1
 9 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0
10 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0
11 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1
12 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0
13 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1
14 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0
15 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0
16 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1
17 | 


--------------------------------------------------------------------------------
/3_Model_fine_tuning/fine_tuing_Data/train.csv:
--------------------------------------------------------------------------------
 1 | sequence,label
 2 | AAAAAGCCTGTGAAGCACAGAGAGCAGCCAGCCAGAGCTGATGCTCAATGGCAGAAACTGCTTAGTCACGCTGAAAGGGAGCCAAGGCAATAGCAGAGTGG,1
 3 | ACCTGCTAACAATTAAGGCCTCCAGGTCTACCCTGCAGCTGGGCCTGAGGAGGTCCTCTTGAAAGGAGTGGGTAACAGCGCACTATTGAGGGCCTGTGAAG,0
 4 | ACTGACCATGTGCATCCTCACTGATACCAGTCTTGCCACAGTGTGCCTTGGAAACTCTTTCACAGGCAGTTATGGTCCCTACAGATAGGGGGCAGAGTATG,0
 5 | ATATTACTCAACCGCCTAACAGAACAAAAGCATTCTTGGCTTGATCTCTAGAGTCCCTTTGAACAATTGGGACGATGTTCACCGAACTCTGATAAGCTAGC,0
 6 | CAACCATCCTACTTGCTCGTGGGCTAGCTGCGGGCGCGTCGCGAGCTCGTGAAGCTGACATGGCTTTCCGAGGGCACAACACGAGAACTGAATCTTGCCTT,0
 7 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0
 8 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1
 9 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0
10 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0
11 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1
12 | TCCATGTGTCATCATGGACAGGCTTGTAATGCTTCTCGGGCTTTCACATCCCCACATACTGCAGAGAGTAGTCCAAATTCTTGGGAGATTCCGTATAATAA,0
13 | CTGAGAGGAAAAGGCCCTGAAGAACTGGGAATGCTGAGTCAGCTTTCCAGAAGTTCTGTTGTAGAGCGGTGCACAGCAGCGAGCATAGCTAGGTTTTTAGA,1
14 | ACATCAGGGTGAATGTTGAGGAGTTTAATTATTTTGCTTGGTGACAGTGAGATGTTCACAGTATCAGTTAGTGTGCTTGGAATAGCGCTCCCATGAGATGG,0
15 | TCAGACTAAGTGCCAGTTCTGTGAGTGAATGTTCAATGACTGAGGTGCACATTATTTGCCATATCTACTGAAAAGCAATGCTATCTTTCCGCTTCTGAGGA,0
16 | GTTGGGAACCCCAGCTGACTACTTAAGTCTCCCCATTAATTTCTGCTTGATCGGCAGTTTTTAACCTGTGGCCTTAACCTTCAGGCTTTGCATAGTAAATA,1
17 | 


--------------------------------------------------------------------------------
/3_Model_fine_tuning/fine_tuning_script.sh:
--------------------------------------------------------------------------------
 1 | cd finetune
 2 | 
 3 | export DATA_PATH=../GUE  # I used here only virus dataset, but you can use any dataset in this format to fine-tune the model
 4 | export MAX_LENGTH=100 # Please set the number as 0.25 * your sequence length. 
 5 | 											# e.g., set it as 250 if your DNA sequences have 1000 nucleotide bases
 6 | 											# This is because the tokenized will reduce the sequence length by about 5 times
 7 | export LR=3e-5 # Learning rate
 8 | 
 9 | # Training use DataParallel
10 | python train.py \
11 |     --model_name_or_path zhihan1996/DNABERT-2-117M \
12 |     --data_path  ${DATA_PATH} \
13 |     --kmer -1 \
14 |     --run_name DNABERT2_${DATA_PATH} \
15 |     --model_max_length ${MAX_LENGTH} \
16 |     --per_device_train_batch_size 8 \
17 |     --per_device_eval_batch_size 16 \
18 |     --gradient_accumulation_steps 1 \
19 |     --learning_rate ${LR} \
20 |     --num_train_epochs 5 \
21 |     --fp16 \
22 |     --save_steps 200 \ # save the model every 200 steps
23 |     --output_dir output/dnabert2 \ # output directory to save the fine tuned model and logs
24 |     --evaluation_strategy steps \
25 |     --eval_steps 200 \
26 |     --warmup_steps 50 \
27 |     --logging_steps 100 \
28 |     --overwrite_output_dir True \
29 |     --log_level info \
30 |     --find_unused_parameters False
31 |     
32 | # Training use DistributedDataParallel (more efficient)
33 | export num_gpu=32 # please change the value based on your setup
34 | 
35 | torchrun --nproc-per-node=${num_gpu} train.py \
36 |     --model_name_or_path zhihan1996/DNABERT-2-117M \
37 |     --data_path  ${DATA_PATH} \
38 |     --kmer -1 \
39 |     --run_name DNABERT2_${DATA_PATH} \
40 |     --model_max_length ${MAX_LENGTH} \
41 |     --per_device_train_batch_size 8 \
42 |     --per_device_eval_batch_size 16 \
43 |     --gradient_accumulation_steps 1 \
44 |     --learning_rate ${LR} \
45 |     --num_train_epochs 5 \
46 |     --fp16 \
47 |     --save_steps 200 \
48 |     --output_dir output/dnabert2 \
49 |     --evaluation_strategy steps \
50 |     --eval_steps 200 \
51 |     --warmup_steps 50 \
52 |     --logging_steps 100 \
53 |     --overwrite_output_dir True \
54 |     --log_level info \
55 |     --find_unused_parameters False


--------------------------------------------------------------------------------
/3_Model_fine_tuning/README.md:
--------------------------------------------------------------------------------
 1 | # Fine Tuning LLM for Genome Understanding
 2 | 
 3 | ## For fine-tuning LLM for genome understanding, we need to follow the following steps:
 4 | 
 5 | First, I generated 3 csv files from your dataset: `train.csv`, `dev.csv`, and `test.csv`. In the training process, the model is trained on `train.csv` and is evaluated on the `dev.csv` file. After the training if finished, the checkpoint with the smallest loss on the `dev.csv` file is loaded and be evaluated on `test.csv`. 
 6 | 
 7 | > `Note:` If anyone using this repository do not have a validation set, please just make the `dev.csv` and `test.csv` the same. Please see the `fine_tuing_Data` folder as data format. Each file should be in the same format, with the first row as document head named sequence, label. Each following row should contain a DNA sequence and a numerical label concatenated by `a , (e.g., ACGTCAGTCAGCGTACGT, 1 ).`
 8 | 
 9 | 
10 | I followed the following steps to fine-tune the LLM model using shell scripts `.sh` and python scripts instructed on the [github repository](https://github.com/MAGICS-LAB/DNABERT_2?tab=readme-ov-file#62-fine-tune-dnabert2-on-your-own-datasets) of the DNABERT2 model:
11 | 
12 | Here is the [shell script](./fine_tuning_script.sh) file to fine-tune the model, which can also be seen here:
13 | 
14 | ```bash
15 | cd finetune
16 | 
17 | export DATA_PATH=../GUE  # I used here only virus dataset, but you can use any dataset in this format to fine-tune the model
18 | export MAX_LENGTH=100 # Please set the number as 0.25 * your sequence length. 
19 | 											# e.g., set it as 250 if your DNA sequences have 1000 nucleotide bases
20 | 											# This is because the tokenized will reduce the sequence length by about 5 times
21 | export LR=3e-5 # Learning rate
22 | 
23 | # Training use DataParallel
24 | python train.py \
25 |     --model_name_or_path zhihan1996/DNABERT-2-117M \
26 |     --data_path  ${DATA_PATH} \
27 |     --kmer -1 \
28 |     --run_name DNABERT2_${DATA_PATH} \
29 |     --model_max_length ${MAX_LENGTH} \
30 |     --per_device_train_batch_size 8 \
31 |     --per_device_eval_batch_size 16 \
32 |     --gradient_accumulation_steps 1 \
33 |     --learning_rate ${LR} \
34 |     --num_train_epochs 5 \
35 |     --fp16 \
36 |     --save_steps 200 \ # save the model every 200 steps
37 |     --output_dir output/dnabert2 \ # output directory to save the fine tuned model and logs
38 |     --evaluation_strategy steps \
39 |     --eval_steps 200 \
40 |     --warmup_steps 50 \
41 |     --logging_steps 100 \
42 |     --overwrite_output_dir True \
43 |     --log_level info \
44 |     --find_unused_parameters False
45 |     
46 | # Training use DistributedDataParallel (more efficient)
47 | export num_gpu=32 # please change the value based on your setup
48 | 
49 | torchrun --nproc-per-node=${num_gpu} train.py \
50 |     --model_name_or_path zhihan1996/DNABERT-2-117M \
51 |     --data_path  ${DATA_PATH} \
52 |     --kmer -1 \
53 |     --run_name DNABERT2_${DATA_PATH} \
54 |     --model_max_length ${MAX_LENGTH} \
55 |     --per_device_train_batch_size 8 \
56 |     --per_device_eval_batch_size 16 \
57 |     --gradient_accumulation_steps 1 \
58 |     --learning_rate ${LR} \
59 |     --num_train_epochs 5 \
60 |     --fp16 \
61 |     --save_steps 200 \
62 |     --output_dir output/dnabert2 \
63 |     --evaluation_strategy steps \
64 |     --eval_steps 200 \
65 |     --warmup_steps 50 \
66 |     --logging_steps 100 \
67 |     --overwrite_output_dir True \
68 |     --log_level info \
69 |     --find_unused_parameters False
70 | ```
71 | 
72 | 
73 | 
74 | 


--------------------------------------------------------------------------------
/0_LiteratureReview/README.md:
--------------------------------------------------------------------------------
 1 | # **Literature Review- Genome Understanding using LLMs**
 2 | **Author:** `Muhammad Aammar Tufail (Ph.D.)`
 3 | 
 4 | ## `Overview`
 5 | 
 6 | The purpose of this literature review is to investigate the use of large language models (LLMs) for understanding genomic sequences, specifically `FASTA` sequences. As genomics is a data-rich field, the ability to effectively analyze and interpret these sequences is essential for advances in` Bioinformatics`, `Biotechnology` and `Personalized Medicine`. This review explores the commonly used models for analyzing genomic data, the format requirements of training data, the amount of data typically used in related studies, and whether there are existing pretrained models suitable for this problem.
 7 | 
 8 | ### `Key Questions to Address`
 9 | - Which models are commonly used for genomic sequence analysis?
10 | - What format must the training data have?
11 | - How much training data is typically used for similar problems?
12 | - Are there pretrained models available for genomic sequence analysis using LLMs?
13 | 
14 | Approaches or solutions that have been tried before on similar projects.
15 | 
16 | **Summary of Each Work**:
17 | 
18 | - **Source 1**: `DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome`
19 | 
20 |   - **[Link](https://academic.oup.com/bioinformatics/article/37/15/2112/6128680?login=false)**
21 |   - **Objective**: 
22 |     - To adapt large language models (LLMs) to effectively interpret and analyze DNA sequences from genomic FASTA files, enhancing their ability to predict and understand genomic elements.
23 |   - **Methods**:
24 |     - `Tokenization:` DNA sequences are converted into k-mer tokens, which capture contextual information by representing overlapping sequences.
25 |     - `Pre-training:` The model learns DNA syntax and semantics through self-supervision, using masked language modeling on sequences from the human genome.
26 |     - `Attention Mechanism:` DNABERT uses a multi-head self-attention mechanism to capture global contextual information from the entire input sequence
27 |     - `Fine-tuning:` The pre-trained model is fine-tuned with task-specific data to predict genomic elements like promoters and splice sites.
28 |   - **Outcomes**: 
29 |     - `Improved Prediction Accuracy:` DNABERT achieves state-of-the-art performance in predicting genomic elements such as promoters, splice sites, and transcription factor binding sites, even with limited task-specific labeled data
30 |     - `Cross-Organism Applicability:` The pre-trained DNABERT model, initially trained on the human genome, can be effectively applied to other organisms, demonstrating exceptional performance across different species
31 |   - **Relation to the Project**:
32 |     - This approach aligns with the project's aim to leverage advanced LLM pretraining techniques for genomic analysis, providing a robust framework for understanding complex DNA sequences and facilitating cross-organism genomic studies.
33 | 
34 | - **Source 2**: `DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes`
35 | 
36 |   - **[Link](https://arxiv.org/html/2306.15006v2)**
37 |   - **Objective**: 
38 |     - To enhance the understanding of genome sequences by fine-tuning large language models (LLMs) using efficient tokenization and model adaptation techniques.
39 |   - **Methods**: 
40 |     - Replace traditional k-mer tokenization with Byte Pair Encoding (BPE) to improve computational efficiency and overcome sample inefficiencies.
41 |     - Utilize Attention with Linear Biases (ALiBi) and Low-Rank Adaptation (LoRA) to handle long sequences and optimize model parameters for genomic data.
42 |     - Implement standard fine-tuning with specific hyperparameters like learning rate and batch size for models like DNABERT and DNABERT-2.
43 |   - **Outcomes**: 
44 |     - DNABERT-2 achieves comparable performance to state-of-the-art models with significantly fewer parameters and reduced GPU time, demonstrating efficiency in genome understanding tasks.
45 |   - **Relation to the Project**: 
46 |     - The methods and outcomes provide a framework for fine-tuning LLMs to effectively process and understand DNA sequences in FASTA files, aligning with the project's goal of improving genome analysis.
47 | 
48 | - **Source 3**: `The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics`
49 | 
50 |   - **[Link](https://www.biorxiv.org/content/10.1101/2023.01.11.523679v3)**
51 |   - **Objective**:
52 |     - To bridge the gap between genetic information and observable traits by predicting molecular phenotypes from DNA sequences using foundation models pre-trained on DNA sequences.  
53 |   - **Methods**:
54 |     - Utilize transformer models, named the Nucleotide Transformer, pre-trained on a large dataset of DNA sequences from diverse genomes.
55 |     - Fine-tune these models in low-data settings to solve various genomics applications, focusing on key genomic elements like enhancers.
56 |   - **Outcomes**:
57 |     - Achieved accurate molecular phenotype predictions even with limited data.
58 |     - Improved prioritization of functional genetic variants using model representations.
59 |   - **Relation to the Project**:
60 |     - The study demonstrates the potential of using pre-trained models for genomic data, which can be applied to understand and analyze DNA sequences in FASTA files effectively.
61 | 
62 | - **Source 4**: `BioGPT: generative pre-trained transformer for biomedical text generation and mining`
63 | 
64 |   - **[Link](https://academic.oup.com/bib/article/23/6/bbac409/6713511)**
65 |   - **Objective**:
66 |     - The paper discusses the development of BioGPT, a generative pre-trained transformer model specifically designed for biomedical text generation and mining. While it does not directly address genomic data, the objective is to enhance the model's ability to handle domain-specific tasks in the biomedical field.
67 |   - **Methods**:
68 |     - BioGPT is pre-trained on large-scale biomedical literature, which involves using a vast corpus of domain-specific texts to fine-tune the model's understanding and generation capabilities. This approach could be adapted for genomic data by using a corpus of DNA sequences in FASTA format for pre-training.
69 |   - **Outcomes**:
70 |     - The model demonstrates superior performance on various biomedical NLP tasks, such as relation extraction and question answering, indicating its potential effectiveness in understanding complex domain-specific data.
71 |   - **Relation to the Project**:
72 |     - While the paper does not specifically address genomic data, the methods and outcomes of BioGPT suggest a framework that could be adapted for fine-tuning LLMs to understand genomic sequences by using a similar pre-training approach with genomic data.
73 | 
74 | 
75 | ----------------------------------------------------------------------------------------------------------------------------
76 | > ## Summary of Key Findings
77 | >1. **Common Models**: The most common models used for genomic sequence analysis are transformer-based architectures, including DNABERT1, DNABERT2, NT, and BioGPT. These models, when adapted to biological data, show significant promise in understanding complex genomic structures.
78 | >2. **Training Data Format**: Genomic data must be preprocessed and tokenized into formats that these LLMs can handle. FASTA sequences are often converted into tokenized sequences of nucleotides, which serve as input data for the models.
79 | >3. **Amount of Training Data**: Similar problems typically require large amounts of training data, often millions of sequences, to ensure that the models can capture the diverse and complex patterns found in genomic data.
80 | >4. **Pretrained Models**: While there are pretrained models such as BioGPT and DNABERT, these models often require additional fine-tuning on specific genomic tasks to achieve optimal performance. Researchers may need to train their own models if domain-specific tasks are highly specialized.
81 | 
82 | > ## Conclusion
83 | >The use of large language models in genomics, specifically for analyzing FASTA sequences, represents an exciting frontier in computational biology. Pretrained models such as DNABERT and BioGPT provide a strong starting point, but substantial training data and fine-tuning are often required for specific applications. The ability of LLMs to learn complex patterns within genomic sequences offers significant potential for advancing personalized medicine and biotechnological research.
84 | ---
85 | 
86 | 
87 | 


--------------------------------------------------------------------------------
/3_Model_fine_tuning/finetune/scripts/run_dnabert2.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | data_path=$1
  4 | lr=3e-5
  5 | 
  6 | echo "The provided data_path is $data_path"
  7 | 
  8 | for seed in 42
  9 | do
 10 |     for data in H3 H3K14ac H3K36me3 H3K4me1 H3K4me2 H3K4me3 H3K79me3 H3K9ac H4 H4ac
 11 |     do
 12 |         python train.py \
 13 |             --model_name_or_path zhihan1996/DNABERT-2-117M \
 14 |             --data_path  $data_path/GUE/EMP/$data \
 15 |             --kmer -1 \
 16 |             --run_name DNABERT2_${vocab}_${lr}_EMP_${data}_seed${seed} \
 17 |             --model_max_length 128 \
 18 |             --per_device_train_batch_size 8 \
 19 |             --per_device_eval_batch_size 16 \
 20 |             --gradient_accumulation_steps 1 \
 21 |             --learning_rate ${lr} \
 22 |             --num_train_epochs 3 \
 23 |             --fp16 \
 24 |             --save_steps 200 \
 25 |             --output_dir output/dnabert2 \
 26 |             --evaluation_strategy steps \
 27 |             --eval_steps 200 \
 28 |             --warmup_steps 50 \
 29 |             --logging_steps 100000 \
 30 |             --overwrite_output_dir True \
 31 |             --log_level info \
 32 |             --find_unused_parameters False
 33 |     done
 34 | 
 35 | 
 36 | 
 37 |     for data in prom_core_all prom_core_notata
 38 |     do
 39 |         python train.py \
 40 |             --model_name_or_path zhihan1996/DNABERT-2-117M \
 41 |             --data_path  $data_path/GUE/prom/$data \
 42 |             --kmer -1 \
 43 |             --run_name DNABERT2_${vocab}_${lr}_prom_${data}_seed${seed} \
 44 |             --model_max_length 20 \
 45 |             --per_device_train_batch_size 8 \
 46 |             --per_device_eval_batch_size 16 \
 47 |             --gradient_accumulation_steps 1 \
 48 |             --learning_rate ${lr} \
 49 |             --num_train_epochs 4 \
 50 |             --fp16 \
 51 |             --save_steps 400 \
 52 |             --output_dir output/dnabert2 \
 53 |             --evaluation_strategy steps \
 54 |             --eval_steps 400 \
 55 |             --warmup_steps 50 \
 56 |             --logging_steps 100000 \
 57 |             --overwrite_output_dir True \
 58 |             --log_level info \
 59 |             --find_unused_parameters False
 60 |     done
 61 | 
 62 | 
 63 |     for data in prom_core_tata
 64 |     do
 65 |         python train.py \
 66 |             --model_name_or_path zhihan1996/DNABERT-2-117M \
 67 |             --data_path  $data_path/GUE/prom/$data \
 68 |             --kmer -1 \
 69 |             --run_name DNABERT2_${vocab}_${lr}_prom_${data}_seed${seed} \
 70 |             --model_max_length 20 \
 71 |             --per_device_train_batch_size 8 \
 72 |             --per_device_eval_batch_size 16 \
 73 |             --gradient_accumulation_steps 1 \
 74 |             --learning_rate ${lr} \
 75 |             --num_train_epochs 10 \
 76 |             --fp16 \
 77 |             --save_steps 200 \
 78 |             --output_dir output/dnabert2 \
 79 |             --evaluation_strategy steps \
 80 |             --eval_steps 200 \
 81 |             --warmup_steps 50 \
 82 |             --logging_steps 100000 \
 83 |             --overwrite_output_dir True \
 84 |             --log_level info \
 85 |             --find_unused_parameters False
 86 |     done
 87 | 
 88 |     for data in prom_300_all prom_300_notata
 89 |     do
 90 |         python train.py \
 91 |             --model_name_or_path zhihan1996/DNABERT-2-117M \
 92 |             --data_path  $data_path/GUE/prom/$data \
 93 |             --kmer -1 \
 94 |             --run_name DNABERT2_${vocab}_${lr}_prom_${data}_seed${seed} \
 95 |             --model_max_length 70 \
 96 |             --per_device_train_batch_size 8 \
 97 |             --per_device_eval_batch_size 16 \
 98 |             --gradient_accumulation_steps 1 \
 99 |             --learning_rate ${lr} \
100 |             --num_train_epochs 4 \
101 |             --fp16 \
102 |             --save_steps 400 \
103 |             --output_dir output/dnabert2 \
104 |             --evaluation_strategy steps \
105 |             --eval_steps 400 \
106 |             --warmup_steps 50 \
107 |             --logging_steps 100000 \
108 |             --overwrite_output_dir True \
109 |             --log_level info \
110 |             --find_unused_parameters False
111 |     done
112 | 
113 | 
114 | 
115 |     for data in prom_300_tata
116 |     do 
117 |         python train.py \
118 |             --model_name_or_path zhihan1996/DNABERT-2-117M \
119 |             --data_path  $data_path/GUE/prom/$data \
120 |             --kmer -1 \
121 |             --run_name DNABERT2_${vocab}_${lr}_prom_${data}_seed${seed} \
122 |             --model_max_length 70 \
123 |             --per_device_train_batch_size 8 \
124 |             --per_device_eval_batch_size 16 \
125 |             --gradient_accumulation_steps 1 \
126 |             --learning_rate ${lr} \
127 |             --num_train_epochs 10 \
128 |             --fp16 \
129 |             --save_steps 200 \
130 |             --output_dir output/dnabert2 \
131 |             --evaluation_strategy steps \
132 |             --eval_steps 200 \
133 |             --warmup_steps 50 \
134 |             --logging_steps 100000 \
135 |             --overwrite_output_dir True \
136 |             --log_level info \
137 |             --find_unused_parameters False
138 |     done 
139 | 
140 | 
141 |     for data in reconstructed
142 |     do
143 |         python train.py \
144 |             --model_name_or_path zhihan1996/DNABERT-2-117M \
145 |             --data_path  $data_path/GUE/splice/$data \
146 |             --kmer -1 \
147 |             --run_name DNABERT2_${vocab}_${lr}_splice_${data}_seed${seed} \
148 |             --model_max_length 80 \
149 |             --per_device_train_batch_size 8 \
150 |             --per_device_eval_batch_size 16 \
151 |             --gradient_accumulation_steps 1 \
152 |             --learning_rate ${lr} \
153 |             --num_train_epochs 5 \
154 |             --fp16 \
155 |             --save_steps 200 \
156 |             --output_dir output/dnabert2 \
157 |             --evaluation_strategy steps \
158 |             --eval_steps 200 \
159 |             --warmup_steps 50 \
160 |             --logging_steps 100000 \
161 |             --overwrite_output_dir True \
162 |             --log_level info \
163 |             --find_unused_parameters False
164 |     done
165 | 
166 | 
167 | 
168 |     for data in covid
169 |     do
170 |         python train.py \
171 |             --model_name_or_path zhihan1996/DNABERT-2-117M \
172 |             --data_path  $data_path/GUE/virus/$data \
173 |             --kmer -1 \
174 |             --run_name DNABERT2_${vocab}_${lr}_virus_${data}_seed${seed} \
175 |             --model_max_length 256 \
176 |             --per_device_train_batch_size 32 \
177 |             --per_device_eval_batch_size 32 \
178 |             --gradient_accumulation_steps 1 \
179 |             --learning_rate ${lr} \
180 |             --num_train_epochs 8 \
181 |             --fp16 \
182 |             --save_steps 200 \
183 |             --output_dir output/dnabert2 \
184 |             --evaluation_strategy steps \
185 |             --eval_steps 200 \
186 |             --warmup_steps 50 \
187 |             --logging_steps 100000 \
188 |             --overwrite_output_dir True \
189 |             --log_level info \
190 |             --find_unused_parameters False
191 |     done
192 | 
193 |     for data in 0 1 2 3 4
194 |     do 
195 |         python train.py \
196 |             --model_name_or_path zhihan1996/DNABERT-2-117M \
197 |             --data_path  $data_path/GUE/mouse/$data \
198 |             --kmer -1 \
199 |             --run_name DNABERT2_${vocab}_${lr}_mouse_${data}_seed${seed} \
200 |             --model_max_length 30 \
201 |             --per_device_train_batch_size 8 \
202 |             --per_device_eval_batch_size 64 \
203 |             --gradient_accumulation_steps 1 \
204 |             --learning_rate ${lr} \
205 |             --num_train_epochs 5 \
206 |             --max_steps 1000 \
207 |             --fp16 \
208 |             --save_steps 200 \
209 |             --output_dir output/dnabert2 \
210 |             --evaluation_strategy steps \
211 |             --eval_steps 200 \
212 |             --warmup_steps 30 \
213 |             --logging_steps 100000 \
214 |             --overwrite_output_dir True \
215 |             --log_level info \
216 |             --find_unused_parameters False
217 |     done
218 | 
219 | 
220 |     for data in 0 1 2 3 4
221 |     do 
222 |         python train.py \
223 |             --model_name_or_path zhihan1996/DNABERT-2-117M \
224 |             --data_path  $data_path/GUE/tf/$data \
225 |             --kmer -1 \
226 |             --run_name DNABERT2_${vocab}_${lr}_tf_${data}_seed${seed} \
227 |             --model_max_length 30 \
228 |             --per_device_train_batch_size 8 \
229 |             --per_device_eval_batch_size 64 \
230 |             --gradient_accumulation_steps 1 \
231 |             --learning_rate ${lr} \
232 |             --num_train_epochs 3 \
233 |             --fp16 \
234 |             --save_steps 200 \
235 |             --output_dir output/dnabert2 \
236 |             --evaluation_strategy steps \
237 |             --eval_steps 200 \
238 |             --warmup_steps 30 \
239 |             --logging_steps 100000 \
240 |             --overwrite_output_dir True \
241 |             --log_level info \
242 |             --find_unused_parameters False
243 |     done
244 | done
245 | 


--------------------------------------------------------------------------------
/3_Model_fine_tuning/finetune/scripts/run_dnabert1.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | # This is your argument
  4 | data_path=$1
  5 | kmer=$2
  6 | 
  7 | echo "The provided kmer is: $kmer, data_path is $data_path"
  8 | 
  9 | # sh scripts/run_dna1.sh 3 ; sh scripts/run_dna1.sh 4 ; sh scripts/run_dna1.sh 5 ; sh scripts/run_dna1.sh 6
 10 | 
 11 | for seed in 42
 12 | do
 13 |     for data in H3 H3K14ac H3K36me3 H3K4me1 H3K4me2 H3K4me3 H3K79me3 H3K9ac H4 H4ac
 14 |     do
 15 |         python train.py \
 16 |             --model_name_or_path zhihan1996/DNA_bert_${kmer} \
 17 |             --data_path  ${data_path}/GUE/EMP/$data \
 18 |             --kmer ${kmer} \
 19 |             --run_name DNABERT1_${kmer}_EMP_${data}_seed${seed} \
 20 |             --model_max_length 512 \
 21 |             --per_device_train_batch_size 8 \
 22 |             --per_device_eval_batch_size 16 \
 23 |             --gradient_accumulation_steps 1 \
 24 |             --learning_rate 3e-5 \
 25 |             --num_train_epochs 3 \
 26 |             --fp16 \
 27 |             --save_steps 200 \
 28 |             --output_dir output/dnabert1_${kmer} \
 29 |             --evaluation_strategy steps \
 30 |             --eval_steps 200 \
 31 |             --warmup_steps 50 \
 32 |             --logging_steps 100000 \
 33 |             --overwrite_output_dir True \
 34 |             --log_level info \
 35 |             --seed ${seed} \
 36 |             --find_unused_parameters False
 37 |     done
 38 | 
 39 | 
 40 |     for data in prom_core_all prom_core_notata
 41 |     do
 42 |         python train.py \
 43 |             --model_name_or_path zhihan1996/DNA_bert_${kmer} \
 44 |             --data_path  ${data_path}/GUE/prom/$data \
 45 |             --kmer ${kmer} \
 46 |             --run_name DNABERT1_${kmer}_prom_${data}_seed${seed} \
 47 |             --model_max_length 80 \
 48 |             --per_device_train_batch_size 8 \
 49 |             --per_device_eval_batch_size 16 \
 50 |             --gradient_accumulation_steps 1 \
 51 |             --learning_rate 3e-5 \
 52 |             --num_train_epochs 4 \
 53 |             --fp16 \
 54 |             --save_steps 400 \
 55 |             --output_dir output/dnabert1_${kmer} \
 56 |             --evaluation_strategy steps \
 57 |             --eval_steps 400 \
 58 |             --warmup_steps 50 \
 59 |             --logging_steps 100000 \
 60 |             --overwrite_output_dir True \
 61 |             --log_level info \
 62 |             --seed ${seed} \
 63 |             --find_unused_parameters False
 64 |     done
 65 | 
 66 | 
 67 |     for data in prom_core_tata
 68 |     do
 69 |         python train.py \
 70 |             --model_name_or_path zhihan1996/DNA_bert_${kmer} \
 71 |             --data_path  ${data_path}/GUE/prom/$data \
 72 |             --kmer ${kmer} \
 73 |             --run_name DNABERT1_${kmer}_prom_${data}_seed${seed} \
 74 |             --model_max_length 80 \
 75 |             --per_device_train_batch_size 8 \
 76 |             --per_device_eval_batch_size 16 \
 77 |             --gradient_accumulation_steps 1 \
 78 |             --learning_rate 3e-5 \
 79 |             --num_train_epochs 10 \
 80 |             --fp16 \
 81 |             --save_steps 200 \
 82 |             --output_dir output/dnabert1_${kmer} \
 83 |             --evaluation_strategy steps \
 84 |             --eval_steps 200 \
 85 |             --warmup_steps 50 \
 86 |             --logging_steps 100000 \
 87 |             --overwrite_output_dir True \
 88 |             --log_level info \
 89 |             --seed ${seed} \
 90 |             --find_unused_parameters False
 91 |     done
 92 | 
 93 |     for data in prom_300_all prom_300_notata
 94 |     do
 95 |         python train.py \
 96 |             --model_name_or_path zhihan1996/DNA_bert_${kmer} \
 97 |             --data_path  ${data_path}/GUE/prom/$data \
 98 |             --kmer ${kmer} \
 99 |             --run_name DNABERT1_${kmer}_prom_${data}_seed${seed} \
100 |             --model_max_length 310 \
101 |             --per_device_train_batch_size 8 \
102 |             --per_device_eval_batch_size 16 \
103 |             --gradient_accumulation_steps 1 \
104 |             --learning_rate 3e-5 \
105 |             --num_train_epochs 4 \
106 |             --fp16 \
107 |             --save_steps 400 \
108 |             --output_dir output/dnabert1_${kmer} \
109 |             --evaluation_strategy steps \
110 |             --eval_steps 400 \
111 |             --warmup_steps 50 \
112 |             --logging_steps 100000 \
113 |             --overwrite_output_dir True \
114 |             --log_level info \
115 |             --seed ${seed} \
116 |             --find_unused_parameters False
117 |     done
118 | 
119 | 
120 |     for data in prom_300_tata
121 |     do
122 |         python train.py \
123 |             --model_name_or_path zhihan1996/DNA_bert_${kmer} \
124 |             --data_path  ${data_path}/GUE/prom/$data \
125 |             --kmer ${kmer} \
126 |             --run_name DNABERT1_${kmer}_prom_${data}_seed${seed} \
127 |             --model_max_length 310 \
128 |             --per_device_train_batch_size 8 \
129 |             --per_device_eval_batch_size 16 \
130 |             --gradient_accumulation_steps 1 \
131 |             --learning_rate 3e-5 \
132 |             --num_train_epochs 10 \
133 |             --fp16 \
134 |             --save_steps 200 \
135 |             --output_dir output/dnabert1_${kmer} \
136 |             --evaluation_strategy steps \
137 |             --eval_steps 200 \
138 |             --warmup_steps 50 \
139 |             --logging_steps 100000 \
140 |             --overwrite_output_dir True \
141 |             --log_level info \
142 |             --seed ${seed} \
143 |             --find_unused_parameters False
144 |     done
145 | 
146 |     for data in reconstructed
147 |     do
148 |         python train.py \
149 |             --model_name_or_path zhihan1996/DNA_bert_${kmer} \
150 |             --data_path  ${data_path}/GUE/splice/$data \
151 |             --kmer ${kmer} \
152 |             --run_name DNABERT1_${kmer}_splice_${data}_seed${seed} \
153 |             --model_max_length 410 \
154 |             --per_device_train_batch_size 8 \
155 |             --per_device_eval_batch_size 16 \
156 |             --gradient_accumulation_steps 1 \
157 |             --learning_rate 3e-5 \
158 |             --num_train_epochs 5 \
159 |             --fp16 \
160 |             --save_steps 200 \
161 |             --output_dir output/dnabert1_${kmer} \
162 |             --evaluation_strategy steps \
163 |             --eval_steps 200 \
164 |             --warmup_steps 50 \
165 |             --logging_steps 100000 \
166 |             --overwrite_output_dir True \
167 |             --log_level info \
168 |             --seed ${seed} \
169 |             --find_unused_parameters False
170 |     done
171 | 
172 | 
173 |     for data in covid
174 |     do
175 |         python train.py \
176 |             --model_name_or_path zhihan1996/DNA_bert_${kmer} \
177 |             --data_path  ${data_path}/GUE/virus/$data \
178 |             --kmer ${kmer} \
179 |             --run_name DNABERT1_${kmer}_virus_${data}_seed${seed} \
180 |             --model_max_length 1024 \
181 |             --per_device_train_batch_size 8 \
182 |             --per_device_eval_batch_size 8 \
183 |             --gradient_accumulation_steps 4 \
184 |             --learning_rate 3e-5 \
185 |             --num_train_epochs 9 \
186 |             --fp16 \
187 |             --save_steps 200 \
188 |             --output_dir output/dnabert1_${kmer} \
189 |             --evaluation_strategy steps \
190 |             --eval_steps 200 \
191 |             --warmup_steps 50 \
192 |             --logging_steps 100000 \
193 |             --overwrite_output_dir True \
194 |             --log_level info \
195 |             --seed ${seed} \
196 |             --find_unused_parameters False
197 |     done
198 | 
199 | 
200 |     for data in 0 1 2 3 4
201 |     do 
202 |         python train.py \
203 |             --model_name_or_path zhihan1996/DNA_bert_${kmer} \
204 |             --data_path  ${data_path}/GUE/mouse/$data \
205 |             --kmer ${kmer} \
206 |             --run_name DNABERT1_${kmer}_mouse_${data}_seed${seed} \
207 |             --model_max_length 110 \
208 |             --per_device_train_batch_size 8 \
209 |             --per_device_eval_batch_size 64 \
210 |             --gradient_accumulation_steps 1 \
211 |             --learning_rate 3e-5 \
212 |             --num_train_epochs 5 \
213 |             --max_steps 1000 \
214 |             --fp16 \
215 |             --save_steps 200 \
216 |             --output_dir output/dnabert1_${kmer} \
217 |             --evaluation_strategy steps \
218 |             --eval_steps 200 \
219 |             --warmup_steps 30 \
220 |             --logging_steps 100000 \
221 |             --overwrite_output_dir True \
222 |             --log_level info \
223 |             --seed ${seed} \
224 |             --find_unused_parameters False
225 |     done
226 | 
227 | 
228 |     for data in 0 1 2 3 4
229 |     do 
230 |         python train.py \
231 |             --model_name_or_path zhihan1996/DNA_bert_${kmer} \
232 |             --data_path  ${data_path}/GUE/tf/$data \
233 |             --kmer ${kmer} \
234 |             --run_name DNABERT1_${kmer}_tf_${data}_seed${seed} \
235 |             --model_max_length 110 \
236 |             --per_device_train_batch_size 8 \
237 |             --per_device_eval_batch_size 64 \
238 |             --gradient_accumulation_steps 1 \
239 |             --learning_rate 3e-5 \
240 |             --num_train_epochs 3 \
241 |             --fp16 \
242 |             --save_steps 200 \
243 |             --output_dir output/dnabert1_${kmer} \
244 |             --evaluation_strategy steps \
245 |             --eval_steps 200 \
246 |             --warmup_steps 30 \
247 |             --logging_steps 100000 \
248 |             --overwrite_output_dir True \
249 |             --log_level info \
250 |             --seed ${seed} \
251 |             --find_unused_parameters False
252 |     done
253 | done


--------------------------------------------------------------------------------
/1_DatasetCharacteristics/README.md:
--------------------------------------------------------------------------------
 1 | # Dataset Characteristics for Genomic Data (DNA sequences)
 2 | ## Project: `Genome Understanding using LLMs`
 3 | **Author:** `Muhammad Aammar Tufail (Ph.D.)`
 4 | 
 5 | ## `Overview`
 6 | 
 7 | Genomic data, particularly DNA sequences, are fundamental to understanding the genetic basis of life. The characteristics of genomic datasets play a crucial role in determining the success of large language models (LLMs), in analyzing and interpreting these sequences. This document explores the key characteristics of genomic datasets, focusing on DNA sequences in FASTA format, to provide insights into the data requirements for training and fine-tuning LLMs for genomics applications.
 8 | 
 9 | There are several key characteristics of genomic datasets that influence the performance and applicability of LLMs for genomic analysis. These characteristics include the format of the data, the length and complexity of the sequences, the presence of repetitive elements, the distribution of genomic elements, and the availability of labeled data for supervised learning tasks. Understanding these characteristics is essential for effectively leveraging LLMs to analyze genomic data and extract meaningful insights.
10 | 
11 | The following table summarizes the key characteristics of genomic datasets and their implications for training and fine-tuning LLMs for genomic analysis and also includes the challenges of handling DNA sequence data compared to NLP data for LLM model training or fine-tuning:
12 | 
13 | | **Characteristic**          | **Normal NLP Data**                                       | **Genomic Data**                                             | **Challenges in Handling DNA Sequences** |
14 | |-----------------------------|-----------------------------------------------------------|--------------------------------------------------------------|------------------------------------------|
15 | | **Data Units**               | Words, sentences, paragraphs                              | Nucleotide bases (A, T, C, G), codons, genes, sequences       | Need to handle very limited alphabet size and repetitive sequences effectively. |
16 | | **Vocabulary Size**          | Large (10,000 - 100,000+ words in various languages)      | Small (4 bases: A, T, C, G)                                   | Smaller vocabulary leads to potential difficulties in capturing complex patterns. |
17 | | **Length of Sequences**      | Varies (e.g., tweets: ~20-50 words; books: thousands)     | Varies (e.g., gene: hundreds to thousands of base pairs; genome: billions) | Extremely long sequences create memory and computational challenges for LLMs. |
18 | | **Data Structure**           | Hierarchical (words → sentences → paragraphs → documents) | Linear (base pairs forming genes, regions, chromosomes)       | Linear data structure lacks hierarchical organization, requiring new techniques for long-range dependencies. |
19 | | **Semantic Meaning**         | Contextual (based on grammar, syntax, and meaning)        | Contextual but biological (based on functional regions, regulatory elements, mutations) | Lack of clear, interpretable "semantic" structure akin to human language; meaning is highly domain-specific. |
20 | | **Context Dependencies**     | Grammar rules, language syntax, domain-specific knowledge | Biological rules (e.g., codon usage, regulatory elements, gene expression) | Complex, non-intuitive dependencies such as epigenetic markers or chromatin interactions. |
21 | | **Annotations**              | Part-of-speech tags, named entities, sentiment labels     | Gene annotations, exon/intron boundaries, mutation labels, functional regions | Proper annotation is sparse and requires expert knowledge, making supervised learning harder. |
22 | | **Multimodal Connections**   | Text linked with images, videos, or speech data           | DNA linked with RNA, proteins, epigenetic marks, phenotypic data | Multimodal data integration requires specialized techniques for biological data. |
23 | | **Language Dynamics**        | Constantly evolving with new words, phrases, slang        | Relatively stable (evolution over long periods; mutations occur slowly) | Slow evolution of sequences provides fewer data changes, limiting adaptive learning opportunities. |
24 | | **Data Diversity**           | Different languages, dialects, writing styles             | Different species, populations, tissue types, genetic diversity | Species and population-specific variations require model generalization across highly diverse datasets. |
25 | | **Noise and Ambiguity**      | Typos, slang, grammatical errors                          | Sequencing errors, ambiguous nucleotide calls (e.g., N bases) | High precision required due to biological significance of each base; sequencing errors can heavily impact outcomes. |
26 | | **Sequence Alignment**       | Not relevant; words are mostly independent               | Highly relevant; sequences must be aligned for comparative analysis (e.g., multiple sequence alignment) | Proper alignment is computationally expensive and critical for correct downstream interpretation. |
27 | | **Interpretation Requirements** | Requires cultural, syntactic, and contextual understanding | Requires biological and evolutionary understanding; interpretation of functional regions | Interpretation relies on complex biological phenomena, often requiring specialized knowledge not inherent to LLMs. |
28 | | **Size of Typical Dataset**  | Varies (from a few hundred samples to billions of words)  | Large (from a few sequences to terabytes of genomic data for entire populations) | Huge data sizes require advanced storage and processing infrastructure. |
29 | | **Contextual Meaning**       | Based on co-occurrence, syntactic roles, and pragmatic context | Based on biological function, codon usage, and sequence conservation | Capturing functional meaning is difficult because it requires integrating biological rules rather than linguistic ones. |
30 | | **Specialized Tasks**        | Sentiment analysis, translation, summarization            | Gene prediction, variant calling, functional annotation, sequence classification | LLMs need to be retrained or fine-tuned for specialized biological tasks that are highly domain-specific. |
31 | | **Data Privacy Concerns**    | Sensitive when handling personal texts or private documents | Sensitive due to implications for personal genetic information and health data | Privacy regulations (e.g., GDPR, HIPAA) impose strict controls on handling genomic data, complicating model access and use. |
32 | | **Data Representation**      | Tokenized words and sentences                             | Tokenized as base pairs or k-mers (short nucleotide sequences) | Representing long sequences effectively while retaining functional information is challenging. |
33 | | **Error Tolerance**          | Can tolerate minor errors (e.g., typos, incorrect grammar) | Low tolerance for errors due to the importance of each base in determining biological function | Errors in sequence prediction or handling can have significant downstream effects, e.g., on mutation detection or drug response prediction. |
34 | 
35 | ### **Summary of Challenges**:
36 | - **Scale**: Genomic datasets are significantly larger and more complex than typical NLP datasets, which creates memory, processing, and storage challenges.
37 | - **Sequence Complexity**: The biological rules governing DNA sequences are more complex and less intuitive than the linguistic rules governing natural language.
38 | - **Interpretation**: Unlike human language, where meaning is derived from word co-occurrences and grammar, genomic data requires biological context and functional knowledge for accurate interpretation.
39 | - **Precision**: DNA sequences require much greater precision, as errors can have profound biological consequences, unlike minor errors in NLP that typically do not change the overall meaning of the text.
40 | 
41 | These challenges necessitate specialized approaches to model design, training techniques, and data handling for DNA sequence data compared to standard NLP tasks.
42 | 
43 | ## `Relation to Existing Literature`
44 | 
45 | The characteristics of genomic datasets discussed here align with findings from existing literature on the application of LLMs in genomics. For example, studies on [DNABERT](https://academic.oup.com/bioinformatics/article/37/15/2112/6128680?login=false), [DNABERT-2](https://arxiv.org/html/2306.15006v2), and the [Nucleotide Transformer](https://www.biorxiv.org/content/10.1101/2023.01.11.523679v3) highlight the importance of handling DNA sequences effectively, including tokenization, model adaptation, and fine-tuning for genomic tasks. These studies emphasize the need for specialized models and training techniques to address the unique characteristics of genomic data and achieve optimal performance in genomics applications.
46 | 
47 | ## `Example of DNA Sequences in FASTA Format`
48 | Here you may see a sample DNA sequence in FASTA format.
49 | 
50 | ```plaintext
51 | >sequence_1
52 | ATCGTACGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
53 | >sequence_2
54 | GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
55 | >sequence_3
56 | TACGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
57 | ```
58 | `Note`: The above sequences are for illustrative purposes only and do not represent actual genomic data.
59 | This sequence file contains three sequences, each labeled with a unique identifier (e.g., sequence_1, sequence_2, sequence_3) and the corresponding DNA sequence. The sequences consist of combinations of four nucleotide bases: adenine (A), thymine (T), cytosine (C), and guanine (G). Each sequence is represented as a linear string of nucleotide bases, with each base corresponding to a specific position in the DNA sequence.
60 | 
61 | The Gene features of Bacteroides fragillis genome are in [this file](./genome_Bacteroides_fragillis.fasta).
62 | 
63 | ## `Conclusion`
64 | 
65 | Understanding the key characteristics of genomic datasets is essential for effectively training and fine-tuning large language models for genomics applications. By recognizing the unique properties of DNA sequences, researchers can develop specialized models, data processing pipelines, and training strategies to optimize the performance of LLMs in genomics tasks. The challenges posed by genomic data, such as sequence complexity, interpretational requirements, and precision demands, necessitate tailored approaches to model design and training to ensure accurate and meaningful analysis of DNA sequences.
66 | 
67 | ---
68 | 
69 | 
70 | 


--------------------------------------------------------------------------------
/3_Model_fine_tuning/finetune/scripts/run_nt.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | data_path=$1
  4 | m=$2
  5 | 
  6 | 
  7 | if [ "$m" -eq 0 ]; then
  8 |     model=InstaDeepAI/nucleotide-transformer-500m-1000g
  9 |     run_name=NT_500_1000g
 10 | elif [ "$m" -eq 1 ]; then
 11 |     model=InstaDeepAI/nucleotide-transformer-500m-human-ref
 12 |     run_name=NT_500_human
 13 | elif [ "$m" -eq 2 ]; then
 14 |     model=InstaDeepAI/nucleotide-transformer-2.5b-1000g
 15 |     run_name=NT_2500_1000g
 16 | elif [ "$m" -eq 3 ]; then
 17 |     model=InstaDeepAI/nucleotide-transformer-2.5b-multi-species
 18 |     run_name=NT_2500_multi
 19 | else
 20 |     echo "Wrong argument"
 21 |     exit 1
 22 | fi
 23 | echo "Use: $model"
 24 | 
 25 | 
 26 | for seed in 42
 27 | do
 28 |     for data in H3 H3K14ac H3K36me3 H3K4me1 H3K4me2 H3K4me3 H3K79me3 H3K9ac H4 H4ac
 29 |     do
 30 |         python train.py \
 31 |             --model_name_or_path ${model} \
 32 |             --data_path  ${data_path}/GUE/EMP/$data \
 33 |             --kmer -1 \
 34 |             --run_name ${run_name}_EMP_${data}_seed${seed} \
 35 |             --model_max_length 100 \
 36 |             --use_lora \
 37 |             --lora_target_modules 'query,value,key,dense' \
 38 |             --per_device_train_batch_size 8 \
 39 |             --per_device_eval_batch_size 16 \
 40 |             --gradient_accumulation_steps 1 \
 41 |             --lora_alpha 16 \
 42 |             --learning_rate 1e-4 \
 43 |             --num_train_epochs 3 \
 44 |             --fp16 \
 45 |             --save_steps 200 \
 46 |             --output_dir output/nt_${run_name} \
 47 |             --evaluation_strategy steps \
 48 |             --eval_steps 200 \
 49 |             --warmup_steps 50 \
 50 |             --logging_steps 100000 \
 51 |             --overwrite_output_dir True \
 52 |             --log_level info \
 53 |             --seed ${seed} \
 54 |             --find_unused_parameters False
 55 |     done
 56 | 
 57 | 
 58 | 
 59 |     for data in prom_core_all prom_core_notata
 60 |     do
 61 |         python train.py \
 62 |             --model_name_or_path ${model} \
 63 |             --data_path  ${data_path}/GUE/prom/$data \
 64 |             --kmer -1 \
 65 |             --run_name ${run_name}_prom_${data}_seed${seed} \
 66 |             --model_max_length 20 \
 67 |             --use_lora \
 68 |             --lora_target_modules 'query,value,key,dense' \
 69 |             --per_device_train_batch_size 8 \
 70 |             --per_device_eval_batch_size 16 \
 71 |             --gradient_accumulation_steps 1 \
 72 |             --lora_alpha 16 \
 73 |             --learning_rate 1e-4 \
 74 |             --num_train_epochs 4 \
 75 |             --fp16 \
 76 |             --save_steps 400 \
 77 |             --output_dir output/nt_${run_name} \
 78 |             --evaluation_strategy steps \
 79 |             --eval_steps 400 \
 80 |             --warmup_steps 50 \
 81 |             --logging_steps 100000 \
 82 |             --overwrite_output_dir True \
 83 |             --log_level info \
 84 |             --seed ${seed} \
 85 |             --find_unused_parameters False
 86 |     done
 87 | 
 88 | 
 89 |     for data in prom_core_tata
 90 |     do
 91 |         python train.py \
 92 |             --model_name_or_path ${model} \
 93 |             --data_path  ${data_path}/GUE/prom/$data \
 94 |             --kmer -1 \
 95 |             --run_name ${run_name}_prom_${data}_seed${seed} \
 96 |             --model_max_length 20 \
 97 |             --use_lora \
 98 |             --lora_target_modules 'query,value,key,dense' \
 99 |             --per_device_train_batch_size 8 \
100 |             --per_device_eval_batch_size 16 \
101 |             --gradient_accumulation_steps 1 \
102 |             --lora_alpha 16 \
103 |             --learning_rate 1e-4 \
104 |             --num_train_epochs 10 \
105 |             --fp16 \
106 |             --save_steps 200 \
107 |             --output_dir output/nt_${run_name} \
108 |             --evaluation_strategy steps \
109 |             --eval_steps 200 \
110 |             --warmup_steps 50 \
111 |             --logging_steps 100000 \
112 |             --overwrite_output_dir True \
113 |             --log_level info \
114 |             --seed ${seed} \
115 |             --find_unused_parameters False
116 |     done
117 | 
118 |     for data in prom_300_all prom_300_notata
119 |     do
120 |         python train.py \
121 |             --model_name_or_path ${model} \
122 |             --data_path  ${data_path}/GUE/prom/$data \
123 |             --kmer -1 \
124 |             --run_name ${run_name}_prom_${data}_seed${seed} \
125 |             --model_max_length 70 \
126 |             --use_lora \
127 |             --lora_target_modules 'query,value,key,dense' \
128 |             --per_device_train_batch_size 8 \
129 |             --per_device_eval_batch_size 16 \
130 |             --gradient_accumulation_steps 1 \
131 |             --lora_alpha 16 \
132 |             --learning_rate 1e-4 \
133 |             --num_train_epochs 4 \
134 |             --fp16 \
135 |             --save_steps 400 \
136 |             --output_dir output/nt_${run_name} \
137 |             --evaluation_strategy steps \
138 |             --eval_steps 400 \
139 |             --warmup_steps 50 \
140 |             --logging_steps 100000 \
141 |             --overwrite_output_dir True \
142 |             --log_level info \
143 |             --seed ${seed} \
144 |             --find_unused_parameters False
145 |     done
146 | 
147 | 
148 |     for data in prom_300_tata
149 |     do
150 |         python train.py \
151 |             --model_name_or_path ${model} \
152 |             --data_path  ${data_path}/GUE/prom/$data \
153 |             --kmer -1 \
154 |             --run_name ${run_name}_prom_${data}_seed${seed} \
155 |             --model_max_length 70 \
156 |             --use_lora \
157 |             --lora_target_modules 'query,value,key,dense' \
158 |             --per_device_train_batch_size 8 \
159 |             --per_device_eval_batch_size 16 \
160 |             --gradient_accumulation_steps 1 \
161 |             --lora_alpha 16 \
162 |             --learning_rate 1e-4 \
163 |             --num_train_epochs 10 \
164 |             --fp16 \
165 |             --save_steps 200 \
166 |             --output_dir output/nt_${run_name} \
167 |             --evaluation_strategy steps \
168 |             --eval_steps 200 \
169 |             --warmup_steps 50 \
170 |             --logging_steps 100000 \
171 |             --overwrite_output_dir True \
172 |             --log_level info \
173 |             --seed ${seed} \
174 |             --find_unused_parameters False
175 |     done
176 | 
177 | 
178 |     for data in reconstructed
179 |     do
180 |         python train.py \
181 |             --model_name_or_path ${model} \
182 |             --data_path  ${data_path}/GUE/splice/$data \
183 |             --kmer -1 \
184 |             --run_name ${run_name}_splice_${data}_seed${seed} \
185 |             --model_max_length 80 \
186 |             --use_lora \
187 |             --lora_target_modules 'query,value,key,dense' \
188 |             --per_device_train_batch_size 8 \
189 |             --per_device_eval_batch_size 16 \
190 |             --gradient_accumulation_steps 1 \
191 |             --lora_alpha 16 \
192 |             --learning_rate 1e-4 \
193 |             --num_train_epochs 5 \
194 |             --fp16 \
195 |             --save_steps 200 \
196 |             --output_dir output/nt_${run_name} \
197 |             --evaluation_strategy steps \
198 |             --eval_steps 200 \
199 |             --warmup_steps 50 \
200 |             --logging_steps 100000 \
201 |             --overwrite_output_dir True \
202 |             --log_level info \
203 |             --seed ${seed} \
204 |             --find_unused_parameters False
205 |     done
206 | 
207 | 
208 | 
209 |     for data in covid
210 |     do
211 |         python train.py \
212 |             --model_name_or_path ${model} \
213 |             --data_path  ${data_path}/GUE/virus/$data \
214 |             --kmer -1 \
215 |             --run_name ${run_name}_virus_${data}_seed${seed} \
216 |             --model_max_length 256 \
217 |             --use_lora \
218 |             --lora_target_modules 'query,value,key,dense' \
219 |             --per_device_train_batch_size 2 \
220 |             --per_device_eval_batch_size 4 \
221 |             --gradient_accumulation_steps 4 \
222 |             --learning_rate 1e-4 \
223 |             --num_train_epochs 3 \
224 |             --fp16 \
225 |             --save_steps 10000 \
226 |             --output_dir output/nt_${run_name} \
227 |             --evaluation_strategy steps \
228 |             --eval_steps 200 \
229 |             --warmup_steps 200 \
230 |             --logging_steps 100000 \
231 |             --overwrite_output_dir True \
232 |             --log_level info \
233 |             --seed ${seed} \
234 |             --find_unused_parameters False
235 |     done
236 | 
237 |     for data in 0 1 2 3 4
238 |     do 
239 |         python train.py \
240 |             --model_name_or_path ${model} \
241 |             --data_path  ${data_path}/GUE/mouse/$data \
242 |             --kmer -1 \
243 |             --run_name ${run_name}_mouse_${data}_seed${seed} \
244 |             --model_max_length 30 \
245 |             --use_lora \
246 |             --lora_target_modules 'query,value,key,dense' \
247 |             --per_device_train_batch_size 8 \
248 |             --per_device_eval_batch_size 64 \
249 |             --gradient_accumulation_steps 1 \
250 |             --lora_alpha 16 \
251 |             --learning_rate 1e-4 \
252 |             --num_train_epochs 5 \
253 |             --max_steps 1000 \
254 |             --fp16 \
255 |             --save_steps 200 \
256 |             --output_dir output/nt_${run_name} \
257 |             --evaluation_strategy steps \
258 |             --eval_steps 200 \
259 |             --warmup_steps 100 \
260 |             --logging_steps 100000 \
261 |             --overwrite_output_dir True \
262 |             --log_level info \
263 |             --seed ${seed} \
264 |             --find_unused_parameters False
265 |     done
266 | 
267 | 
268 |     for data in 0 1 2 3 4
269 |     do 
270 |         python train.py \
271 |             --model_name_or_path ${model} \
272 |             --data_path  ${data_path}/GUE/tf/$data \
273 |             --kmer -1 \
274 |             --run_name ${run_name}_tf_${data}_seed${seed} \
275 |             --model_max_length 30 \
276 |             --use_lora \
277 |             --lora_target_modules 'query,value,key,dense' \
278 |             --per_device_train_batch_size 8 \
279 |             --per_device_eval_batch_size 64 \
280 |             --gradient_accumulation_steps 1 \
281 |             --lora_alpha 16 \
282 |             --learning_rate 1e-4 \
283 |             --num_train_epochs 3 \
284 |             --fp16 \
285 |             --save_steps 200 \
286 |             --output_dir output/nt_${run_name} \
287 |             --evaluation_strategy steps \
288 |             --eval_steps 200 \
289 |             --warmup_steps 30 \
290 |             --logging_steps 100000 \
291 |             --overwrite_output_dir True \
292 |             --log_level info \
293 |             --seed ${seed} \
294 |             --find_unused_parameters False
295 |     done
296 | done


--------------------------------------------------------------------------------
/2_BaselineModel/Baseline_model_DNABERT2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Baseline Model\n",
  8 |     "\n",
  9 |     "## DNABERT2 model\n",
 10 |     "\n",
 11 |     "This model will be used to calculate the embeddings from DNA sequences. The embeddings would be used to train a classifier to predict the binding affinity of the DNA sequences. The model is based on the DNABERT2 model, which is a transformer model that is pretrained on DNA."
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": 1,
 17 |    "metadata": {},
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "import torch\n",
 21 |     "from transformers import AutoTokenizer, AutoModel"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 2,
 27 |    "metadata": {},
 28 |    "outputs": [
 29 |     {
 30 |      "name": "stderr",
 31 |      "output_type": "stream",
 32 |      "text": [
 33 |       "/Users/babaaammar/mambaforge/envs/dna_llm/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
 34 |       "  warnings.warn(\n"
 35 |      ]
 36 |     },
 37 |     {
 38 |      "data": {
 39 |       "application/vnd.jupyter.widget-view+json": {
 40 |        "model_id": "2440bf06b6a24b9d99391083599122ac",
 41 |        "version_major": 2,
 42 |        "version_minor": 0
 43 |       },
 44 |       "text/plain": [
 45 |        "tokenizer_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]"
 46 |       ]
 47 |      },
 48 |      "metadata": {},
 49 |      "output_type": "display_data"
 50 |     },
 51 |     {
 52 |      "data": {
 53 |       "application/vnd.jupyter.widget-view+json": {
 54 |        "model_id": "b71eb1bba77f4cc48173a0326e08383d",
 55 |        "version_major": 2,
 56 |        "version_minor": 0
 57 |       },
 58 |       "text/plain": [
 59 |        "tokenizer.json:   0%|          | 0.00/168k [00:00<?, ?B/s]"
 60 |       ]
 61 |      },
 62 |      "metadata": {},
 63 |      "output_type": "display_data"
 64 |     },
 65 |     {
 66 |      "data": {
 67 |       "application/vnd.jupyter.widget-view+json": {
 68 |        "model_id": "8c908f45674d488596aad56b6c2461e6",
 69 |        "version_major": 2,
 70 |        "version_minor": 0
 71 |       },
 72 |       "text/plain": [
 73 |        "config.json:   0%|          | 0.00/904 [00:00<?, ?B/s]"
 74 |       ]
 75 |      },
 76 |      "metadata": {},
 77 |      "output_type": "display_data"
 78 |     },
 79 |     {
 80 |      "data": {
 81 |       "application/vnd.jupyter.widget-view+json": {
 82 |        "model_id": "001edaa8879f4a1890bb7d6fe4f27551",
 83 |        "version_major": 2,
 84 |        "version_minor": 0
 85 |       },
 86 |       "text/plain": [
 87 |        "configuration_bert.py:   0%|          | 0.00/1.01k [00:00<?, ?B/s]"
 88 |       ]
 89 |      },
 90 |      "metadata": {},
 91 |      "output_type": "display_data"
 92 |     },
 93 |     {
 94 |      "name": "stderr",
 95 |      "output_type": "stream",
 96 |      "text": [
 97 |       "A new version of the following files was downloaded from https://huggingface.co/zhihan1996/DNABERT-2-117M:\n",
 98 |       "- configuration_bert.py\n",
 99 |       ". Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.\n"
100 |      ]
101 |     },
102 |     {
103 |      "data": {
104 |       "application/vnd.jupyter.widget-view+json": {
105 |        "model_id": "6bd98433c76d4641a7715452da9368e3",
106 |        "version_major": 2,
107 |        "version_minor": 0
108 |       },
109 |       "text/plain": [
110 |        "bert_layers.py:   0%|          | 0.00/40.7k [00:00<?, ?B/s]"
111 |       ]
112 |      },
113 |      "metadata": {},
114 |      "output_type": "display_data"
115 |     },
116 |     {
117 |      "data": {
118 |       "application/vnd.jupyter.widget-view+json": {
119 |        "model_id": "1546625539f94cad88b7ab33d089b1b0",
120 |        "version_major": 2,
121 |        "version_minor": 0
122 |       },
123 |       "text/plain": [
124 |        "bert_padding.py:   0%|          | 0.00/6.10k [00:00<?, ?B/s]"
125 |       ]
126 |      },
127 |      "metadata": {},
128 |      "output_type": "display_data"
129 |     },
130 |     {
131 |      "name": "stderr",
132 |      "output_type": "stream",
133 |      "text": [
134 |       "A new version of the following files was downloaded from https://huggingface.co/zhihan1996/DNABERT-2-117M:\n",
135 |       "- bert_padding.py\n",
136 |       ". Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.\n"
137 |      ]
138 |     },
139 |     {
140 |      "data": {
141 |       "application/vnd.jupyter.widget-view+json": {
142 |        "model_id": "199d66458b534b48b3ad3da745194874",
143 |        "version_major": 2,
144 |        "version_minor": 0
145 |       },
146 |       "text/plain": [
147 |        "flash_attn_triton.py:   0%|          | 0.00/42.7k [00:00<?, ?B/s]"
148 |       ]
149 |      },
150 |      "metadata": {},
151 |      "output_type": "display_data"
152 |     },
153 |     {
154 |      "name": "stderr",
155 |      "output_type": "stream",
156 |      "text": [
157 |       "A new version of the following files was downloaded from https://huggingface.co/zhihan1996/DNABERT-2-117M:\n",
158 |       "- flash_attn_triton.py\n",
159 |       ". Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.\n",
160 |       "A new version of the following files was downloaded from https://huggingface.co/zhihan1996/DNABERT-2-117M:\n",
161 |       "- bert_layers.py\n",
162 |       "- bert_padding.py\n",
163 |       "- flash_attn_triton.py\n",
164 |       ". Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.\n",
165 |       "/Users/babaaammar/mambaforge/envs/dna_llm/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
166 |       "  warnings.warn(\n"
167 |      ]
168 |     },
169 |     {
170 |      "data": {
171 |       "application/vnd.jupyter.widget-view+json": {
172 |        "model_id": "40749de3e9a741c0adf0b4a8bab4cf08",
173 |        "version_major": 2,
174 |        "version_minor": 0
175 |       },
176 |       "text/plain": [
177 |        "pytorch_model.bin:   0%|          | 0.00/468M [00:00<?, ?B/s]"
178 |       ]
179 |      },
180 |      "metadata": {},
181 |      "output_type": "display_data"
182 |     },
183 |     {
184 |      "name": "stderr",
185 |      "output_type": "stream",
186 |      "text": [
187 |       "/Users/babaaammar/.cache/huggingface/modules/transformers_modules/zhihan1996/DNABERT-2-117M/d064dece8a8b41d9fb8729fbe3435278786931f1/bert_layers.py:126: UserWarning: Unable to import Triton; defaulting MosaicBERT attention implementation to pytorch (this will reduce throughput when using this model).\n",
188 |       "  warnings.warn(\n",
189 |       "Some weights of the model checkpoint at zhihan1996/DNABERT-2-117M were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']\n",
190 |       "- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
191 |       "- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
192 |       "Some weights of BertModel were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']\n",
193 |       "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
194 |      ]
195 |     }
196 |    ],
197 |    "source": [
198 |     "import torch\n",
199 |     "from transformers import AutoTokenizer, AutoModel\n",
200 |     "tokenizer = AutoTokenizer.from_pretrained(\"zhihan1996/DNABERT-2-117M\", trust_remote_code=True)\n",
201 |     "model = AutoModel.from_pretrained(\"zhihan1996/DNABERT-2-117M\", trust_remote_code=True)"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "markdown",
206 |    "metadata": {},
207 |    "source": [
208 |     "## To calculate Embedding of a DNA sequence, we can use the following steps:\n",
209 |     "\n",
210 |     "> The model can generate DNA embeddings that naturally cluster and segregate genomes of different species in the embedding space, making it useful for comparative genomics"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "markdown",
215 |    "metadata": {},
216 |    "source": [
217 |     "> Embedding is the process of representing a DNA sequence as a fixed-length vector. The embedding of a DNA sequence is a numerical representation of the sequence that captures its biological properties. The embedding can be used as input to machine learning models for various tasks such as classification, clustering, and similarity search."
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": 3,
223 |    "metadata": {},
224 |    "outputs": [
225 |     {
226 |      "name": "stdout",
227 |      "output_type": "stream",
228 |      "text": [
229 |       "torch.Size([768])\n",
230 |       "torch.Size([768])\n"
231 |      ]
232 |     }
233 |    ],
234 |    "source": [
235 |     "dna = \"ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC\"\n",
236 |     "inputs = tokenizer(dna, return_tensors = 'pt')[\"input_ids\"]\n",
237 |     "hidden_states = model(inputs)[0] # [1, sequence_length, 768]\n",
238 |     "\n",
239 |     "# embedding with mean pooling\n",
240 |     "embedding_mean = torch.mean(hidden_states[0], dim=0)\n",
241 |     "print(embedding_mean.shape) # expect to be 768\n",
242 |     "\n",
243 |     "# embedding with max pooling\n",
244 |     "embedding_max = torch.max(hidden_states[0], dim=0)[0]\n",
245 |     "print(embedding_max.shape) # expect to be 768"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": 7,
251 |    "metadata": {},
252 |    "outputs": [
253 |     {
254 |      "name": "stdout",
255 |      "output_type": "stream",
256 |      "text": [
257 |       "torch.Size([768])\n",
258 |       "torch.Size([768])\n"
259 |      ]
260 |     }
261 |    ],
262 |    "source": [
263 |     "dna = \"AAAAAGCCTGTGAAGCACAGAGAGCAGCCAGCCAGAGCTGATGCTCAATGGCAGAAACTGCTTAGTCACGCTGAAAGGGAGCCAAGGCAATAGCAGAGTGG\"\n",
264 |     "inputs = tokenizer(dna, return_tensors = 'pt')[\"input_ids\"]\n",
265 |     "hidden_states = model(inputs)[0] # [1, sequence_length, 768]\n",
266 |     "\n",
267 |     "# embedding with mean pooling\n",
268 |     "embedding_mean = torch.mean(hidden_states[0], dim=0)\n",
269 |     "print(embedding_mean.shape) # expect to be 768\n",
270 |     "\n",
271 |     "# embedding with max pooling\n",
272 |     "embedding_max = torch.max(hidden_states[0], dim=0)[0]\n",
273 |     "print(embedding_max.shape) # expect to be 768"
274 |    ]
275 |   }
276 |  ],
277 |  "metadata": {
278 |   "kernelspec": {
279 |    "display_name": "dna_llm",
280 |    "language": "python",
281 |    "name": "python3"
282 |   },
283 |   "language_info": {
284 |    "codemirror_mode": {
285 |     "name": "ipython",
286 |     "version": 3
287 |    },
288 |    "file_extension": ".py",
289 |    "mimetype": "text/x-python",
290 |    "name": "python",
291 |    "nbconvert_exporter": "python",
292 |    "pygments_lexer": "ipython3",
293 |    "version": "3.8.undefined"
294 |   }
295 |  },
296 |  "nbformat": 4,
297 |  "nbformat_minor": 2
298 | }
299 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/3_Model_fine_tuning/finetune/train.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import csv
  3 | import copy
  4 | import json
  5 | import logging
  6 | from dataclasses import dataclass, field
  7 | from typing import Optional, Dict, Sequence, Tuple, List
  8 | 
  9 | import torch
 10 | import transformers
 11 | import sklearn
 12 | import numpy as np
 13 | from torch.utils.data import Dataset
 14 | 
 15 | from peft import (
 16 |     LoraConfig,
 17 |     get_peft_model,
 18 |     get_peft_model_state_dict,
 19 | )
 20 | 
 21 | 
 22 | @dataclass
 23 | class ModelArguments:
 24 |     model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
 25 |     use_lora: bool = field(default=False, metadata={"help": "whether to use LoRA"})
 26 |     lora_r: int = field(default=8, metadata={"help": "hidden dimension for LoRA"})
 27 |     lora_alpha: int = field(default=32, metadata={"help": "alpha for LoRA"})
 28 |     lora_dropout: float = field(default=0.05, metadata={"help": "dropout rate for LoRA"})
 29 |     lora_target_modules: str = field(default="query,value", metadata={"help": "where to perform LoRA"})
 30 | 
 31 | 
 32 | @dataclass
 33 | class DataArguments:
 34 |     data_path: str = field(default=None, metadata={"help": "Path to the training data."})
 35 |     kmer: int = field(default=-1, metadata={"help": "k-mer for input sequence. -1 means not using k-mer."})
 36 | 
 37 | 
 38 | @dataclass
 39 | class TrainingArguments(transformers.TrainingArguments):
 40 |     cache_dir: Optional[str] = field(default=None)
 41 |     run_name: str = field(default="run")
 42 |     optim: str = field(default="adamw_torch")
 43 |     model_max_length: int = field(default=512, metadata={"help": "Maximum sequence length."})
 44 |     gradient_accumulation_steps: int = field(default=1)
 45 |     per_device_train_batch_size: int = field(default=1)
 46 |     per_device_eval_batch_size: int = field(default=1)
 47 |     num_train_epochs: int = field(default=1)
 48 |     fp16: bool = field(default=False)
 49 |     logging_steps: int = field(default=100)
 50 |     save_steps: int = field(default=100)
 51 |     eval_steps: int = field(default=100)
 52 |     evaluation_strategy: str = field(default="steps"),
 53 |     warmup_steps: int = field(default=50)
 54 |     weight_decay: float = field(default=0.01)
 55 |     learning_rate: float = field(default=1e-4)
 56 |     save_total_limit: int = field(default=3)
 57 |     load_best_model_at_end: bool = field(default=True)
 58 |     output_dir: str = field(default="output")
 59 |     find_unused_parameters: bool = field(default=False)
 60 |     checkpointing: bool = field(default=False)
 61 |     dataloader_pin_memory: bool = field(default=False)
 62 |     eval_and_save_results: bool = field(default=True)
 63 |     save_model: bool = field(default=False)
 64 |     seed: int = field(default=42)
 65 |     
 66 | 
 67 | def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
 68 |     """Collects the state dict and dump to disk."""
 69 |     state_dict = trainer.model.state_dict()
 70 |     if trainer.args.should_save:
 71 |         cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
 72 |         del state_dict
 73 |         trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa
 74 | 
 75 | 
 76 | """
 77 | Get the reversed complement of the original DNA sequence.
 78 | """
 79 | def get_alter_of_dna_sequence(sequence: str):
 80 |     MAP = {"A": "T", "T": "A", "C": "G", "G": "C"}
 81 |     # return "".join([MAP[c] for c in reversed(sequence)])
 82 |     return "".join([MAP[c] for c in sequence])
 83 | 
 84 | """
 85 | Transform a dna sequence to k-mer string
 86 | """
 87 | def generate_kmer_str(sequence: str, k: int) -> str:
 88 |     """Generate k-mer string from DNA sequence."""
 89 |     return " ".join([sequence[i:i+k] for i in range(len(sequence) - k + 1)])
 90 | 
 91 | 
 92 | """
 93 | Load or generate k-mer string for each DNA sequence. The generated k-mer string will be saved to the same directory as the original data with the same name but with a suffix of "_{k}mer".
 94 | """
 95 | def load_or_generate_kmer(data_path: str, texts: List[str], k: int) -> List[str]:
 96 |     """Load or generate k-mer string for each DNA sequence."""
 97 |     kmer_path = data_path.replace(".csv", f"_{k}mer.json")
 98 |     if os.path.exists(kmer_path):
 99 |         logging.warning(f"Loading k-mer from {kmer_path}...")
100 |         with open(kmer_path, "r") as f:
101 |             kmer = json.load(f)
102 |     else:        
103 |         logging.warning(f"Generating k-mer...")
104 |         kmer = [generate_kmer_str(text, k) for text in texts]
105 |         with open(kmer_path, "w") as f:
106 |             logging.warning(f"Saving k-mer to {kmer_path}...")
107 |             json.dump(kmer, f)
108 |         
109 |     return kmer
110 | 
111 | class SupervisedDataset(Dataset):
112 |     """Dataset for supervised fine-tuning."""
113 | 
114 |     def __init__(self, 
115 |                  data_path: str, 
116 |                  tokenizer: transformers.PreTrainedTokenizer, 
117 |                  kmer: int = -1):
118 | 
119 |         super(SupervisedDataset, self).__init__()
120 | 
121 |         # load data from the disk
122 |         with open(data_path, "r") as f:
123 |             data = list(csv.reader(f))[1:]
124 |         if len(data[0]) == 2:
125 |             # data is in the format of [text, label]
126 |             logging.warning("Perform single sequence classification...")
127 |             texts = [d[0] for d in data]
128 |             labels = [int(d[1]) for d in data]
129 |         elif len(data[0]) == 3:
130 |             # data is in the format of [text1, text2, label]
131 |             logging.warning("Perform sequence-pair classification...")
132 |             texts = [[d[0], d[1]] for d in data]
133 |             labels = [int(d[2]) for d in data]
134 |         else:
135 |             raise ValueError("Data format not supported.")
136 |         
137 |         if kmer != -1:
138 |             # only write file on the first process
139 |             if torch.distributed.get_rank() not in [0, -1]:
140 |                 torch.distributed.barrier()
141 | 
142 |             logging.warning(f"Using {kmer}-mer as input...")
143 |             texts = load_or_generate_kmer(data_path, texts, kmer)
144 | 
145 |             if torch.distributed.get_rank() == 0:
146 |                 torch.distributed.barrier()
147 | 
148 |         output = tokenizer(
149 |             texts,
150 |             return_tensors="pt",
151 |             padding="longest",
152 |             max_length=tokenizer.model_max_length,
153 |             truncation=True,
154 |         )
155 | 
156 |         self.input_ids = output["input_ids"]
157 |         self.attention_mask = output["attention_mask"]
158 |         self.labels = labels
159 |         self.num_labels = len(set(labels))
160 | 
161 |     def __len__(self):
162 |         return len(self.input_ids)
163 | 
164 |     def __getitem__(self, i) -> Dict[str, torch.Tensor]:
165 |         return dict(input_ids=self.input_ids[i], labels=self.labels[i])
166 | 
167 | 
168 | @dataclass
169 | class DataCollatorForSupervisedDataset(object):
170 |     """Collate examples for supervised fine-tuning."""
171 | 
172 |     tokenizer: transformers.PreTrainedTokenizer
173 | 
174 |     def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
175 |         input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
176 |         input_ids = torch.nn.utils.rnn.pad_sequence(
177 |             input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
178 |         )
179 |         labels = torch.Tensor(labels).long()
180 |         return dict(
181 |             input_ids=input_ids,
182 |             labels=labels,
183 |             attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
184 |         )
185 | 
186 | """
187 | Manually calculate the accuracy, f1, matthews_correlation, precision, recall with sklearn.
188 | """
189 | def calculate_metric_with_sklearn(logits: np.ndarray, labels: np.ndarray):
190 |     if logits.ndim == 3:
191 |         # Reshape logits to 2D if needed
192 |         logits = logits.reshape(-1, logits.shape[-1])
193 |     predictions = np.argmax(logits, axis=-1)
194 |     valid_mask = labels != -100  # Exclude padding tokens (assuming -100 is the padding token ID)
195 |     valid_predictions = predictions[valid_mask]
196 |     valid_labels = labels[valid_mask]
197 |     return {
198 |         "accuracy": sklearn.metrics.accuracy_score(valid_labels, valid_predictions),
199 |         "f1": sklearn.metrics.f1_score(
200 |             valid_labels, valid_predictions, average="macro", zero_division=0
201 |         ),
202 |         "matthews_correlation": sklearn.metrics.matthews_corrcoef(
203 |             valid_labels, valid_predictions
204 |         ),
205 |         "precision": sklearn.metrics.precision_score(
206 |             valid_labels, valid_predictions, average="macro", zero_division=0
207 |         ),
208 |         "recall": sklearn.metrics.recall_score(
209 |             valid_labels, valid_predictions, average="macro", zero_division=0
210 |         ),
211 |     }
212 | 
213 | 
214 | """
215 | Compute metrics used for huggingface trainer.
216 | """ 
217 | def compute_metrics(eval_pred):
218 |     logits, labels = eval_pred
219 |     if isinstance(logits, tuple):  # Unpack logits if it's a tuple
220 |         logits = logits[0]
221 |     return calculate_metric_with_sklearn(logits, labels)
222 | 
223 | 
224 | 
225 | def train():
226 |     parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
227 |     model_args, data_args, training_args = parser.parse_args_into_dataclasses()
228 | 
229 |     # load tokenizer
230 |     tokenizer = transformers.AutoTokenizer.from_pretrained(
231 |         model_args.model_name_or_path,
232 |         cache_dir=training_args.cache_dir,
233 |         model_max_length=training_args.model_max_length,
234 |         padding_side="right",
235 |         use_fast=True,
236 |         trust_remote_code=True,
237 |     )
238 | 
239 |     if "InstaDeepAI" in model_args.model_name_or_path:
240 |         tokenizer.eos_token = tokenizer.pad_token
241 | 
242 |     # define datasets and data collator
243 |     train_dataset = SupervisedDataset(tokenizer=tokenizer, 
244 |                                       data_path=os.path.join(data_args.data_path, "train.csv"), 
245 |                                       kmer=data_args.kmer)
246 |     val_dataset = SupervisedDataset(tokenizer=tokenizer, 
247 |                                      data_path=os.path.join(data_args.data_path, "dev.csv"), 
248 |                                      kmer=data_args.kmer)
249 |     test_dataset = SupervisedDataset(tokenizer=tokenizer, 
250 |                                      data_path=os.path.join(data_args.data_path, "test.csv"), 
251 |                                      kmer=data_args.kmer)
252 |     data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
253 | 
254 | 
255 |     # load model
256 |     model = transformers.AutoModelForSequenceClassification.from_pretrained(
257 |         model_args.model_name_or_path,
258 |         cache_dir=training_args.cache_dir,
259 |         num_labels=train_dataset.num_labels,
260 |         trust_remote_code=True,
261 |     )
262 | 
263 |     # configure LoRA
264 |     if model_args.use_lora:
265 |         lora_config = LoraConfig(
266 |             r=model_args.lora_r,
267 |             lora_alpha=model_args.lora_alpha,
268 |             target_modules=list(model_args.lora_target_modules.split(",")),
269 |             lora_dropout=model_args.lora_dropout,
270 |             bias="none",
271 |             task_type="SEQ_CLS",
272 |             inference_mode=False,
273 |         )
274 |         model = get_peft_model(model, lora_config)
275 |         model.print_trainable_parameters()
276 | 
277 |     # define trainer
278 |     trainer = transformers.Trainer(model=model,
279 |                                    tokenizer=tokenizer,
280 |                                    args=training_args,
281 |                                    compute_metrics=compute_metrics,
282 |                                    train_dataset=train_dataset,
283 |                                    eval_dataset=val_dataset,
284 |                                    data_collator=data_collator)
285 |     trainer.train()
286 | 
287 |     if training_args.save_model:
288 |         trainer.save_state()
289 |         safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
290 | 
291 |     # get the evaluation results from trainer
292 |     if training_args.eval_and_save_results:
293 |         results_path = os.path.join(training_args.output_dir, "results", training_args.run_name)
294 |         results = trainer.evaluate(eval_dataset=test_dataset)
295 |         os.makedirs(results_path, exist_ok=True)
296 |         with open(os.path.join(results_path, "eval_results.json"), "w") as f:
297 |             json.dump(results, f)
298 | 
299 | 
300 | 
301 | 
302 | if __name__ == "__main__":
303 |     train()
304 | 


--------------------------------------------------------------------------------