├── README.md
├── config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── vocab.txt


/README.md:
--------------------------------------------------------------------------------
  1 | # BERT Base Uncased Quantized Model for Spam Detection
  2 | 
  3 | This repository hosts a quantized version of the BERT model, fine-tuned for spam detection tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.
  4 | 
  5 | ## Model Details
  6 | 
  7 | - **Model Architecture:** BERT Base Uncased  
  8 | - **Task:** Spam Email Detection  
  9 | - **Dataset:** Hugging Face's `mail_spam_ham_dataset` and 'spam-mail'  
 10 | - **Quantization:** Float16  
 11 | - **Fine-tuning Framework:** Hugging Face Transformers  
 12 | 
 13 | ## Usage
 14 | 
 15 | ### Installation
 16 | 
 17 | ```sh
 18 | pip install transformers torch
 19 | ```
 20 | 
 21 | ### Loading the Model
 22 | 
 23 | ```python
 24 | from transformers import BertTokenizer, BertForSequenceClassification
 25 | import torch
 26 | 
 27 | model_name = "AventIQ-AI/bert-spam-detection"
 28 | tokenizer = BertTokenizer.from_pretrained(model_name)
 29 | model = BertForSequenceClassification.from_pretrained(model_name)
 30 | 
 31 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 32 | 
 33 | def predict_spam_quantized(text):
 34 |     """Predicts whether a given text is spam (1) or ham (0) using the quantized BERT model."""
 35 |     
 36 |     # Tokenize input text
 37 |     inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
 38 | 
 39 |     # Move inputs to GPU (if available)
 40 |     inputs = {key: value.to(device) for key, value in inputs.items()}
 41 |     
 42 |     # Perform inference
 43 |     with torch.no_grad():
 44 |         outputs = model(**inputs)
 45 | 
 46 |     # Get predicted label (0 = ham, 1 = spam)
 47 |     prediction = torch.argmax(outputs.logits, dim=1).item()
 48 |     
 49 |     return "Spam" if prediction == 1 else "Ham"
 50 | 
 51 | 
 52 | # Sample test messages
 53 | print(predict_spam_quantized("WINNER!! As a valued network customer you have been selected to receivea Â£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."))
 54 | # Expected output: Spam
 55 | 
 56 | print(predict_spam_quantized("WINNER!! As a valued network customer you have been selected to receivea Â£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."))
 57 | # Expected output: Ham
 58 | ```
 59 | 
 60 | ## 📊 Classification Report (Quantized Model - float16)
 61 |  
 62 | | Metric      | Class 0 (Non-Spam) | Class 1 (Spam) | Macro Avg | Weighted Avg |
 63 | |------------|----------------|----------------|------------|--------------|
 64 | | **Precision** | 1.00           | 0.98           | 0.99       | 0.99         |
 65 | | **Recall**    | 0.99           | 0.99           | 0.99       | 0.99         |
 66 | | **F1-Score**  | 0.99           | 0.99           | 0.99       | 0.99         |
 67 | | **Accuracy**  | **99%**        | **99%**        | **99%**    | **99%**      |
 68 |  
 69 | ### 🔍 **Observations**
 70 | ✅ **Precision:** High (1.00 for non-spam, 0.98 for spam) → **Few false positives**  
 71 | ✅ **Recall:** High (0.99 for both classes) → **Few false negatives**  
 72 | ✅ **F1-Score:** **Near-perfect balance** between precision & recall  
 73 | 
 74 | ## Fine-Tuning Details
 75 | 
 76 | ### Dataset
 77 | 
 78 | The Hugging Face's 'spam-mail' and 'mail_spam_ham_dataset' datasets are combined together and used, containing both spam and ham (non-spam) examples.
 79 | 
 80 | ### Training
 81 | 
 82 | - Number of epochs: 3 
 83 | - Batch size: 8  
 84 | - Evaluation strategy: epoch  
 85 | - Learning rate: 2e-5  
 86 | 
 87 | ### Quantization
 88 | 
 89 | Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.
 90 | 
 91 | ## Repository Structure
 92 | 
 93 | ```
 94 | .
 95 | ├── model/               # Contains the quantized model files
 96 | ├── tokenizer_config/    # Tokenizer configuration and vocabulary files
 97 | ├── model.safetensors/   # Fine Tuned Model
 98 | ├── README.md            # Model documentation
 99 | ```
100 | 
101 | ## Limitations
102 | 
103 | - The model may not generalize well to domains outside the fine-tuning dataset.  
104 | - Quantization may result in minor accuracy degradation compared to full-precision models.  
105 | 
106 | ## Contributing
107 | 
108 | Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.
109 | 
110 | 


--------------------------------------------------------------------------------
/config.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "_name_or_path": "/content/fine_tuned_bert_model_spam_detection",
 3 |   "architectures": [
 4 |     "BertForSequenceClassification"
 5 |   ],
 6 |   "attention_probs_dropout_prob": 0.1,
 7 |   "classifier_dropout": null,
 8 |   "gradient_checkpointing": false,
 9 |   "hidden_act": "gelu",
10 |   "hidden_dropout_prob": 0.1,
11 |   "hidden_size": 768,
12 |   "initializer_range": 0.02,
13 |   "intermediate_size": 3072,
14 |   "layer_norm_eps": 1e-12,
15 |   "max_position_embeddings": 512,
16 |   "model_type": "bert",
17 |   "num_attention_heads": 12,
18 |   "num_hidden_layers": 12,
19 |   "pad_token_id": 0,
20 |   "position_embedding_type": "absolute",
21 |   "problem_type": "single_label_classification",
22 |   "torch_dtype": "float16",
23 |   "transformers_version": "4.48.3",
24 |   "type_vocab_size": 2,
25 |   "use_cache": true,
26 |   "vocab_size": 30522
27 | }
28 | 


--------------------------------------------------------------------------------
/special_tokens_map.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "cls_token": {
 3 |     "content": "[CLS]",
 4 |     "lstrip": false,
 5 |     "normalized": false,
 6 |     "rstrip": false,
 7 |     "single_word": false
 8 |   },
 9 |   "mask_token": {
10 |     "content": "[MASK]",
11 |     "lstrip": false,
12 |     "normalized": false,
13 |     "rstrip": false,
14 |     "single_word": false
15 |   },
16 |   "pad_token": {
17 |     "content": "[PAD]",
18 |     "lstrip": false,
19 |     "normalized": false,
20 |     "rstrip": false,
21 |     "single_word": false
22 |   },
23 |   "sep_token": {
24 |     "content": "[SEP]",
25 |     "lstrip": false,
26 |     "normalized": false,
27 |     "rstrip": false,
28 |     "single_word": false
29 |   },
30 |   "unk_token": {
31 |     "content": "[UNK]",
32 |     "lstrip": false,
33 |     "normalized": false,
34 |     "rstrip": false,
35 |     "single_word": false
36 |   }
37 | }
38 | 


--------------------------------------------------------------------------------
/tokenizer_config.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "added_tokens_decoder": {
 3 |     "0": {
 4 |       "content": "[PAD]",
 5 |       "lstrip": false,
 6 |       "normalized": false,
 7 |       "rstrip": false,
 8 |       "single_word": false,
 9 |       "special": true
10 |     },
11 |     "100": {
12 |       "content": "[UNK]",
13 |       "lstrip": false,
14 |       "normalized": false,
15 |       "rstrip": false,
16 |       "single_word": false,
17 |       "special": true
18 |     },
19 |     "101": {
20 |       "content": "[CLS]",
21 |       "lstrip": false,
22 |       "normalized": false,
23 |       "rstrip": false,
24 |       "single_word": false,
25 |       "special": true
26 |     },
27 |     "102": {
28 |       "content": "[SEP]",
29 |       "lstrip": false,
30 |       "normalized": false,
31 |       "rstrip": false,
32 |       "single_word": false,
33 |       "special": true
34 |     },
35 |     "103": {
36 |       "content": "[MASK]",
37 |       "lstrip": false,
38 |       "normalized": false,
39 |       "rstrip": false,
40 |       "single_word": false,
41 |       "special": true
42 |     }
43 |   },
44 |   "clean_up_tokenization_spaces": true,
45 |   "cls_token": "[CLS]",
46 |   "do_basic_tokenize": true,
47 |   "do_lower_case": true,
48 |   "extra_special_tokens": {},
49 |   "mask_token": "[MASK]",
50 |   "model_max_length": 512,
51 |   "never_split": null,
52 |   "pad_token": "[PAD]",
53 |   "sep_token": "[SEP]",
54 |   "strip_accents": null,
55 |   "tokenize_chinese_chars": true,
56 |   "tokenizer_class": "BertTokenizer",
57 |   "unk_token": "[UNK]"
58 | }
59 | 


--------------------------------------------------------------------------------