Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +179 -3
config.json +42 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +67 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,179 @@
----
-license: mit
----

+---
+language:
+  - id
+license: mit
+tags:
+  - text-classification
+  - bert
+  - spam-detection
+  - indonesian
+  - twitter
+  - retrained
+datasets:
+  - nahiar/mail_data
+pipeline_tag: text-classification
+inference: true
+base_model: nahiar/spam-detection-bert-v2
+model_type: bert
+library_name: transformers
+widget:
+  - text: "Senin, 21 Juli 2025, Samapta Polsek Ngaglik melaksanakan patroli stasioner balong jalan palagan donoharjo"
+    example_title: "Ham Example"
+  - text: "Mari berkontribusi terhadap gerakan rakyat dengan membeli baju ini seharga Rp 160.000. Hubungi kami melalui WA 08977472296"
+    example_title: "Spam Example"
+model-index:
+  - name: spam-detection-bert-v3
+    results:
+      - task:
+          type: text-classification
+          name: Text Classification
+        dataset:
+          name: Mail Data Indonesian Spam Detection
+          type: csv
+        metrics:
+          - name: Accuracy
+            type: accuracy
+            value: 0.95
+          - name: F1 Score (Weighted)
+            type: f1
+            value: 0.95
+          - name: Precision (HAM)
+            type: precision
+            value: 0.98
+          - name: Recall (HAM)
+            type: recall
+            value: 0.96
+          - name: Precision (SPAM)
+            type: precision
+            value: 0.77
+          - name: Recall (SPAM)
+            type: recall
+            value: 0.85
+---
+# Indonesian Spam Detection BERT
+BERT model for spam detection in Indonesian with **95% accuracy**. This v3 model has been fine-tuned from v2 model with email dataset for optimal performance on Indonesian content.
+## Quick Start
+```python
+from transformers import pipeline
+# The easiest way to use the model
+classifier = pipeline("text-classification",
+                     model="nahiar/spam-detection-bert-v3",
+                     tokenizer="nahiar/spam-detection-bert-v3")
+# Test with text
+texts = [
+    "lacak hp hilang by no hp / imei lacak penipu/scammer/tabrak lari/terror/revengeporn sadap / hack / pulihkan akun",
+    "Senin, 21 Juli 2025, Samapta Polsek Ngaglik melaksanakan patroli stasioner balong jalan palagan donoharjo",
+    "Mari berkontribusi terhadap gerakan rakyat dengan membeli baju ini seharga Rp 160.000. Hubungi kami melalui WA 08977472296"
+]
+results = classifier(texts)
+for text, result in zip(texts, results):
+    print(f"Text: {text}")
+    print(f"Result: {result['label']} (confidence: {result['score']:.4f})")
+    print("---")
+```
+## Model Details
+- **Base Model**: nahiar/spam-detection-bert-v2
+- **Task**: Binary Text Classification (Spam vs Ham)
+- **Language**: Indonesian (Bahasa Indonesia)
+- **Model Size**: ~110M parameters
+- **Max Sequence Length**: 512 tokens
+- **Training Epochs**: 3
+- **Batch Size**: 16
+- **Learning Rate**: 2e-5
+## Performance
+| Metric               | HAM | SPAM | Overall |
+| -------------------- | --- | ---- | ------- |
+| Precision            | 98% | 77%  | 95%     |
+| Recall               | 96% | 85%  | 95%     |
+| F1-Score             | 97% | 81%  | 95%     |
+| **Overall Accuracy** | -   | -    | **95%** |
+### Confusion Matrix
+- True HAM correctly predicted: 953/988 (96%)
+- True SPAM correctly predicted: 115/135 (85%)
+- False Positives (HAM predicted as SPAM): 35
+- False Negatives (SPAM predicted as HAM): 20
+## Key Features
+✅ **Fine-tuned** from v2 model with email dataset
+✅ **Good accuracy** (95%) on spam detection with Indonesian context
+✅ **Better handling** for spam email content
+✅ **Enhanced performance** on Indonesian email text
+✅ **Optimized** for Indonesian email and social media spam detection
+## Label Mapping
+```
+0: "HAM" (not spam)
+1: "SPAM" (spam)
+```
+## Training Process
+This model was retrained using:
+- **Optimizer**: AdamW
+- **Learning Rate**: 2e-5
+- **Epochs**: 3
+- **Batch Size**: 16
+- **Max Length**: 128 tokens
+- **Train/Validation Split**: 80/20
+## Usage Example
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("nahiar/spam-detection-bert-v3")
+model = AutoModelForSequenceClassification.from_pretrained("nahiar/spam-detection-bert-v3")
+def predict_spam(text):
+    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
+    outputs = model(**inputs)
+    probs = torch.softmax(outputs.logits, dim=1)
+    predicted_label = torch.argmax(probs, dim=1).item()
+    confidence = probs[0][predicted_label].item()
+    label_map = {0: "HAM", 1: "SPAM"}
+    return label_map[predicted_label], confidence
+# Test
+text = "Dapatkan uang dengan mudah! Klik link ini sekarang!"
+result, confidence = predict_spam(text)
+print(f"Prediksi: {result} (Confidence: {confidence:.4f})")
+```
+## Citation
+```bibtex
+@misc{nahiar_spam_detection_bert,
+  title={Indonesian Spam Detection BERT},
+  author={Raihan Hidayatullah Djunaedi},
+  year={2025},
+  url={https://huggingface.co/nahiar/spam-detection-bert-v3}
+}
+```
+## Changelog
+### Current Version v3 (August 2025)
+- Fine-tuned from v2 model with email dataset (mail_data.csv)
+- Enhanced handling for Indonesian spam email content
+- Good performance (95% accuracy) on email spam detection
+- Optimized for Indonesian email and social media content
+- Improved with GPU-accelerated training using RTX 3080

config.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "finetuning_task": "text-classification",
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "ham",
+    "1": "spam"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "ham": 0,
+    "spam": 1
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "pipeline_tag": "text-classification",
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "task_specific_params": {
+    "text-classification": {
+      "num_labels": 2,
+      "problem_type": "single_label_classification"
+    }
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.54.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 32000
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4bf720dc341cb0087ea8fc4346c5b03f3f38a352ed5e859310207d263afdc740
+size 442499064

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,67 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "full_tokenizer_file": null,
+  "mask_token": "[MASK]",
+  "max_length": 128,
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "pipeline_tag": "text-classification",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff