├── README.md ├── config.json ├── special_tokens_map.json ├── tokenizer_config.json └── vocab.txt /README.md: -------------------------------------------------------------------------------- 1 | # BERT Base Uncased Quantized Model for Spam Detection 2 | 3 | This repository hosts a quantized version of the BERT model, fine-tuned for spam detection tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments. 4 | 5 | ## Model Details 6 | 7 | - **Model Architecture:** BERT Base Uncased 8 | - **Task:** Spam Email Detection 9 | - **Dataset:** Hugging Face's `mail_spam_ham_dataset` and 'spam-mail' 10 | - **Quantization:** Float16 11 | - **Fine-tuning Framework:** Hugging Face Transformers 12 | 13 | ## Usage 14 | 15 | ### Installation 16 | 17 | ```sh 18 | pip install transformers torch 19 | ``` 20 | 21 | ### Loading the Model 22 | 23 | ```python 24 | from transformers import BertTokenizer, BertForSequenceClassification 25 | import torch 26 | 27 | model_name = "AventIQ-AI/bert-spam-detection" 28 | tokenizer = BertTokenizer.from_pretrained(model_name) 29 | model = BertForSequenceClassification.from_pretrained(model_name) 30 | 31 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 32 | 33 | def predict_spam_quantized(text): 34 | """Predicts whether a given text is spam (1) or ham (0) using the quantized BERT model.""" 35 | 36 | # Tokenize input text 37 | inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512) 38 | 39 | # Move inputs to GPU (if available) 40 | inputs = {key: value.to(device) for key, value in inputs.items()} 41 | 42 | # Perform inference 43 | with torch.no_grad(): 44 | outputs = model(**inputs) 45 | 46 | # Get predicted label (0 = ham, 1 = spam) 47 | prediction = torch.argmax(outputs.logits, dim=1).item() 48 | 49 | return "Spam" if prediction == 1 else "Ham" 50 | 51 | 52 | # Sample test messages 53 | print(predict_spam_quantized("WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.")) 54 | # Expected output: Spam 55 | 56 | print(predict_spam_quantized("WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.")) 57 | # Expected output: Ham 58 | ``` 59 | 60 | ## 📊 Classification Report (Quantized Model - float16) 61 | 62 | | Metric | Class 0 (Non-Spam) | Class 1 (Spam) | Macro Avg | Weighted Avg | 63 | |------------|----------------|----------------|------------|--------------| 64 | | **Precision** | 1.00 | 0.98 | 0.99 | 0.99 | 65 | | **Recall** | 0.99 | 0.99 | 0.99 | 0.99 | 66 | | **F1-Score** | 0.99 | 0.99 | 0.99 | 0.99 | 67 | | **Accuracy** | **99%** | **99%** | **99%** | **99%** | 68 | 69 | ### 🔍 **Observations** 70 | ✅ **Precision:** High (1.00 for non-spam, 0.98 for spam) → **Few false positives** 71 | ✅ **Recall:** High (0.99 for both classes) → **Few false negatives** 72 | ✅ **F1-Score:** **Near-perfect balance** between precision & recall 73 | 74 | ## Fine-Tuning Details 75 | 76 | ### Dataset 77 | 78 | The Hugging Face's 'spam-mail' and 'mail_spam_ham_dataset' datasets are combined together and used, containing both spam and ham (non-spam) examples. 79 | 80 | ### Training 81 | 82 | - Number of epochs: 3 83 | - Batch size: 8 84 | - Evaluation strategy: epoch 85 | - Learning rate: 2e-5 86 | 87 | ### Quantization 88 | 89 | Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency. 90 | 91 | ## Repository Structure 92 | 93 | ``` 94 | . 95 | ├── model/ # Contains the quantized model files 96 | ├── tokenizer_config/ # Tokenizer configuration and vocabulary files 97 | ├── model.safetensors/ # Fine Tuned Model 98 | ├── README.md # Model documentation 99 | ``` 100 | 101 | ## Limitations 102 | 103 | - The model may not generalize well to domains outside the fine-tuning dataset. 104 | - Quantization may result in minor accuracy degradation compared to full-precision models. 105 | 106 | ## Contributing 107 | 108 | Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements. 109 | 110 | -------------------------------------------------------------------------------- /config.json: -------------------------------------------------------------------------------- 1 | { 2 | "_name_or_path": "/content/fine_tuned_bert_model_spam_detection", 3 | "architectures": [ 4 | "BertForSequenceClassification" 5 | ], 6 | "attention_probs_dropout_prob": 0.1, 7 | "classifier_dropout": null, 8 | "gradient_checkpointing": false, 9 | "hidden_act": "gelu", 10 | "hidden_dropout_prob": 0.1, 11 | "hidden_size": 768, 12 | "initializer_range": 0.02, 13 | "intermediate_size": 3072, 14 | "layer_norm_eps": 1e-12, 15 | "max_position_embeddings": 512, 16 | "model_type": "bert", 17 | "num_attention_heads": 12, 18 | "num_hidden_layers": 12, 19 | "pad_token_id": 0, 20 | "position_embedding_type": "absolute", 21 | "problem_type": "single_label_classification", 22 | "torch_dtype": "float16", 23 | "transformers_version": "4.48.3", 24 | "type_vocab_size": 2, 25 | "use_cache": true, 26 | "vocab_size": 30522 27 | } 28 | -------------------------------------------------------------------------------- /special_tokens_map.json: -------------------------------------------------------------------------------- 1 | { 2 | "cls_token": { 3 | "content": "[CLS]", 4 | "lstrip": false, 5 | "normalized": false, 6 | "rstrip": false, 7 | "single_word": false 8 | }, 9 | "mask_token": { 10 | "content": "[MASK]", 11 | "lstrip": false, 12 | "normalized": false, 13 | "rstrip": false, 14 | "single_word": false 15 | }, 16 | "pad_token": { 17 | "content": "[PAD]", 18 | "lstrip": false, 19 | "normalized": false, 20 | "rstrip": false, 21 | "single_word": false 22 | }, 23 | "sep_token": { 24 | "content": "[SEP]", 25 | "lstrip": false, 26 | "normalized": false, 27 | "rstrip": false, 28 | "single_word": false 29 | }, 30 | "unk_token": { 31 | "content": "[UNK]", 32 | "lstrip": false, 33 | "normalized": false, 34 | "rstrip": false, 35 | "single_word": false 36 | } 37 | } 38 | -------------------------------------------------------------------------------- /tokenizer_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "added_tokens_decoder": { 3 | "0": { 4 | "content": "[PAD]", 5 | "lstrip": false, 6 | "normalized": false, 7 | "rstrip": false, 8 | "single_word": false, 9 | "special": true 10 | }, 11 | "100": { 12 | "content": "[UNK]", 13 | "lstrip": false, 14 | "normalized": false, 15 | "rstrip": false, 16 | "single_word": false, 17 | "special": true 18 | }, 19 | "101": { 20 | "content": "[CLS]", 21 | "lstrip": false, 22 | "normalized": false, 23 | "rstrip": false, 24 | "single_word": false, 25 | "special": true 26 | }, 27 | "102": { 28 | "content": "[SEP]", 29 | "lstrip": false, 30 | "normalized": false, 31 | "rstrip": false, 32 | "single_word": false, 33 | "special": true 34 | }, 35 | "103": { 36 | "content": "[MASK]", 37 | "lstrip": false, 38 | "normalized": false, 39 | "rstrip": false, 40 | "single_word": false, 41 | "special": true 42 | } 43 | }, 44 | "clean_up_tokenization_spaces": true, 45 | "cls_token": "[CLS]", 46 | "do_basic_tokenize": true, 47 | "do_lower_case": true, 48 | "extra_special_tokens": {}, 49 | "mask_token": "[MASK]", 50 | "model_max_length": 512, 51 | "never_split": null, 52 | "pad_token": "[PAD]", 53 | "sep_token": "[SEP]", 54 | "strip_accents": null, 55 | "tokenize_chinese_chars": true, 56 | "tokenizer_class": "BertTokenizer", 57 | "unk_token": "[UNK]" 58 | } 59 | --------------------------------------------------------------------------------