nahiar commited on
Commit
b31c870
·
verified ·
1 Parent(s): 3c797c8

Upload folder using huggingface_hub

Browse files
Files changed (7) hide show
  1. README.md +179 -3
  2. config.json +42 -0
  3. model.safetensors +3 -0
  4. special_tokens_map.json +37 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +67 -0
  7. vocab.txt +0 -0
README.md CHANGED
@@ -1,3 +1,179 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - id
4
+ license: mit
5
+ tags:
6
+ - text-classification
7
+ - bert
8
+ - spam-detection
9
+ - indonesian
10
+ - twitter
11
+ - retrained
12
+ datasets:
13
+ - nahiar/mail_data
14
+ pipeline_tag: text-classification
15
+ inference: true
16
+ base_model: nahiar/spam-detection-bert-v2
17
+ model_type: bert
18
+ library_name: transformers
19
+ widget:
20
+ - text: "Senin, 21 Juli 2025, Samapta Polsek Ngaglik melaksanakan patroli stasioner balong jalan palagan donoharjo"
21
+ example_title: "Ham Example"
22
+ - text: "Mari berkontribusi terhadap gerakan rakyat dengan membeli baju ini seharga Rp 160.000. Hubungi kami melalui WA 08977472296"
23
+ example_title: "Spam Example"
24
+ model-index:
25
+ - name: spam-detection-bert-v3
26
+ results:
27
+ - task:
28
+ type: text-classification
29
+ name: Text Classification
30
+ dataset:
31
+ name: Mail Data Indonesian Spam Detection
32
+ type: csv
33
+ metrics:
34
+ - name: Accuracy
35
+ type: accuracy
36
+ value: 0.95
37
+ - name: F1 Score (Weighted)
38
+ type: f1
39
+ value: 0.95
40
+ - name: Precision (HAM)
41
+ type: precision
42
+ value: 0.98
43
+ - name: Recall (HAM)
44
+ type: recall
45
+ value: 0.96
46
+ - name: Precision (SPAM)
47
+ type: precision
48
+ value: 0.77
49
+ - name: Recall (SPAM)
50
+ type: recall
51
+ value: 0.85
52
+ ---
53
+
54
+ # Indonesian Spam Detection BERT
55
+
56
+ BERT model for spam detection in Indonesian with **95% accuracy**. This v3 model has been fine-tuned from v2 model with email dataset for optimal performance on Indonesian content.
57
+
58
+ ## Quick Start
59
+
60
+ ```python
61
+ from transformers import pipeline
62
+
63
+ # The easiest way to use the model
64
+ classifier = pipeline("text-classification",
65
+ model="nahiar/spam-detection-bert-v3",
66
+ tokenizer="nahiar/spam-detection-bert-v3")
67
+
68
+ # Test with text
69
+ texts = [
70
+ "lacak hp hilang by no hp / imei lacak penipu/scammer/tabrak lari/terror/revengeporn sadap / hack / pulihkan akun",
71
+ "Senin, 21 Juli 2025, Samapta Polsek Ngaglik melaksanakan patroli stasioner balong jalan palagan donoharjo",
72
+ "Mari berkontribusi terhadap gerakan rakyat dengan membeli baju ini seharga Rp 160.000. Hubungi kami melalui WA 08977472296"
73
+ ]
74
+
75
+ results = classifier(texts)
76
+ for text, result in zip(texts, results):
77
+ print(f"Text: {text}")
78
+ print(f"Result: {result['label']} (confidence: {result['score']:.4f})")
79
+ print("---")
80
+ ```
81
+
82
+ ## Model Details
83
+
84
+ - **Base Model**: nahiar/spam-detection-bert-v2
85
+ - **Task**: Binary Text Classification (Spam vs Ham)
86
+ - **Language**: Indonesian (Bahasa Indonesia)
87
+ - **Model Size**: ~110M parameters
88
+ - **Max Sequence Length**: 512 tokens
89
+ - **Training Epochs**: 3
90
+ - **Batch Size**: 16
91
+ - **Learning Rate**: 2e-5
92
+
93
+ ## Performance
94
+
95
+ | Metric | HAM | SPAM | Overall |
96
+ | -------------------- | --- | ---- | ------- |
97
+ | Precision | 98% | 77% | 95% |
98
+ | Recall | 96% | 85% | 95% |
99
+ | F1-Score | 97% | 81% | 95% |
100
+ | **Overall Accuracy** | - | - | **95%** |
101
+
102
+ ### Confusion Matrix
103
+
104
+ - True HAM correctly predicted: 953/988 (96%)
105
+ - True SPAM correctly predicted: 115/135 (85%)
106
+ - False Positives (HAM predicted as SPAM): 35
107
+ - False Negatives (SPAM predicted as HAM): 20
108
+
109
+ ## Key Features
110
+
111
+ ✅ **Fine-tuned** from v2 model with email dataset
112
+ ✅ **Good accuracy** (95%) on spam detection with Indonesian context
113
+ ✅ **Better handling** for spam email content
114
+ ✅ **Enhanced performance** on Indonesian email text
115
+ ✅ **Optimized** for Indonesian email and social media spam detection
116
+
117
+ ## Label Mapping
118
+
119
+ ```
120
+ 0: "HAM" (not spam)
121
+ 1: "SPAM" (spam)
122
+ ```
123
+
124
+ ## Training Process
125
+
126
+ This model was retrained using:
127
+
128
+ - **Optimizer**: AdamW
129
+ - **Learning Rate**: 2e-5
130
+ - **Epochs**: 3
131
+ - **Batch Size**: 16
132
+ - **Max Length**: 128 tokens
133
+ - **Train/Validation Split**: 80/20
134
+
135
+ ## Usage Example
136
+
137
+ ```python
138
+ import torch
139
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
140
+
141
+ # Load model and tokenizer
142
+ tokenizer = AutoTokenizer.from_pretrained("nahiar/spam-detection-bert-v3")
143
+ model = AutoModelForSequenceClassification.from_pretrained("nahiar/spam-detection-bert-v3")
144
+
145
+ def predict_spam(text):
146
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
147
+ outputs = model(**inputs)
148
+ probs = torch.softmax(outputs.logits, dim=1)
149
+ predicted_label = torch.argmax(probs, dim=1).item()
150
+ confidence = probs[0][predicted_label].item()
151
+ label_map = {0: "HAM", 1: "SPAM"}
152
+ return label_map[predicted_label], confidence
153
+
154
+ # Test
155
+ text = "Dapatkan uang dengan mudah! Klik link ini sekarang!"
156
+ result, confidence = predict_spam(text)
157
+ print(f"Prediksi: {result} (Confidence: {confidence:.4f})")
158
+ ```
159
+
160
+ ## Citation
161
+
162
+ ```bibtex
163
+ @misc{nahiar_spam_detection_bert,
164
+ title={Indonesian Spam Detection BERT},
165
+ author={Raihan Hidayatullah Djunaedi},
166
+ year={2025},
167
+ url={https://huggingface.co/nahiar/spam-detection-bert-v3}
168
+ }
169
+ ```
170
+
171
+ ## Changelog
172
+
173
+ ### Current Version v3 (August 2025)
174
+
175
+ - Fine-tuned from v2 model with email dataset (mail_data.csv)
176
+ - Enhanced handling for Indonesian spam email content
177
+ - Good performance (95% accuracy) on email spam detection
178
+ - Optimized for Indonesian email and social media content
179
+ - Improved with GPU-accelerated training using RTX 3080
config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "finetuning_task": "text-classification",
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "ham",
14
+ "1": "spam"
15
+ },
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 3072,
18
+ "label2id": {
19
+ "ham": 0,
20
+ "spam": 1
21
+ },
22
+ "layer_norm_eps": 1e-12,
23
+ "max_position_embeddings": 512,
24
+ "model_type": "bert",
25
+ "num_attention_heads": 12,
26
+ "num_hidden_layers": 12,
27
+ "pad_token_id": 0,
28
+ "pipeline_tag": "text-classification",
29
+ "position_embedding_type": "absolute",
30
+ "problem_type": "single_label_classification",
31
+ "task_specific_params": {
32
+ "text-classification": {
33
+ "num_labels": 2,
34
+ "problem_type": "single_label_classification"
35
+ }
36
+ },
37
+ "torch_dtype": "float32",
38
+ "transformers_version": "4.54.1",
39
+ "type_vocab_size": 2,
40
+ "use_cache": true,
41
+ "vocab_size": 32000
42
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4bf720dc341cb0087ea8fc4346c5b03f3f38a352ed5e859310207d263afdc740
3
+ size 442499064
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[UNK]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[SEP]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[PAD]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[CLS]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "full_tokenizer_file": null,
50
+ "mask_token": "[MASK]",
51
+ "max_length": 128,
52
+ "model_max_length": 512,
53
+ "never_split": null,
54
+ "pad_to_multiple_of": null,
55
+ "pad_token": "[PAD]",
56
+ "pad_token_type_id": 0,
57
+ "padding_side": "right",
58
+ "pipeline_tag": "text-classification",
59
+ "sep_token": "[SEP]",
60
+ "stride": 0,
61
+ "strip_accents": null,
62
+ "tokenize_chinese_chars": true,
63
+ "tokenizer_class": "BertTokenizer",
64
+ "truncation_side": "right",
65
+ "truncation_strategy": "longest_first",
66
+ "unk_token": "[UNK]"
67
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff