---
license: mit
language:
- en
inference: true
base_model:
- microsoft/codebert-base-mlm
pipeline_tag: feature-extraction
tags:
- smart-contract
- web3
- software-engineering
- embedding
- codebert
- solidity
- code-understanding
library_name: transformers
datasets:
- web3se/smart-contract-intent-vul-dataset
---

# SmartBERT V2 CodeBERT

![SmartBERT](./framework.png)

## Overview

SmartBERT V2 CodeBERT is a **domain-adapted pre-trained model** built on top of **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**.  
It is designed to learn high-quality semantic representations of **smart contract code**, particularly at the **function level**.

The model is further pre-trained on a large corpus of smart contracts using the **Masked Language Modeling (MLM)** objective.  
This domain-adaptive pretraining enables the model to better capture **semantic patterns, structure, and intent** within smart contract functions compared to general-purpose code models.

SmartBERT V2 can be used for tasks such as:

- Smart contract intent detection
- Code similarity analysis
- Vulnerability analysis
- Smart contract classification
- Code embedding and retrieval

SmartBERT V2 is a pre-trained model specifically developed for **[SmartIntent V2](https://github.com/web3se-lab/web3-sekit)**. It was trained on **16,000 smart contracts**, with no overlap with the SmartIntent V2 evaluation dataset to avoid data leakage.  
For production use or general smart contract representation tasks, we recommend **SmartBERT V3**: https://huggingface.co/web3se/SmartBERT-v3

---

## Training Data

SmartBERT V2 was trained on a corpus of approximately **16,000 smart contracts**, primarily written in **Solidity** and collected from public blockchain repositories.

To better model smart contract behavior, contracts were processed at the **function level**, enabling the model to learn fine-grained semantic representations of smart contract functions.

For benchmarking purposes in the **SmartIntent V2**, the pretraining corpus was intentionally limited to this **16,000-contract dataset**.  
The **evaluation dataset (4,000 smart contracts)** was strictly held out and **not included in the pretraining data**, ensuring that downstream evaluations remain unbiased and free from data leakage.

---

## Preprocessing

During preprocessing, all newline (`\n`) and tab (`\t`) characters in the function code were normalized by replacing them with a **single space**.  
This ensures a consistent input format for the tokenizer and avoids unnecessary token fragmentation.

---

## Base Model

SmartBERT V2 is initialized from:

- **Base Model:** [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)

CodeBERT is a transformer-based model trained on source code and natural language pairs.  
SmartBERT V2 further adapts this model to the **smart contract domain** through continued pretraining.

---

## Training Objective

The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as the original CodeBERT model.

During training:

- A subset of tokens is randomly masked.
- The model learns to predict the masked tokens based on surrounding context.
- This encourages the model to learn deeper structural and semantic representations of smart contract code.

---

## Training Setup

Training was conducted using the **HuggingFace Transformers** framework with the following configuration:

- **Hardware:** 2 × Nvidia A100 (80GB)
- **Training Duration:** ~10 hours
- **Training Dataset:** 16,000 smart contracts
- **Evaluation Dataset:** 4,000 smart contracts

Example training configuration:

```python
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=20,
    per_device_train_batch_size=64,
    save_steps=10000,
    save_total_limit=2,
    evaluation_strategy="steps",
    eval_steps=10000,
    resume_from_checkpoint=checkpoint
)
````

---

## Evaluation

The model was evaluated on a held-out dataset of approximately **4,000 smart contracts** to monitor training stability and generalization during pretraining.

SmartBERT V2 is primarily intended as a **representation learning model**, providing high-quality embeddings for downstream smart contract analysis tasks.

---

## How to Use

You can load SmartBERT V2 using the **HuggingFace Transformers** library.

```python
import torch
from transformers import RobertaTokenizer, RobertaModel

tokenizer = RobertaTokenizer.from_pretrained("web3se/SmartBERT-v2")
model = RobertaModel.from_pretrained("web3se/SmartBERT-v2")

code = "function totalSupply() external view returns (uint256);"

inputs = tokenizer(
    code,
    return_tensors="pt",
    truncation=True,
    max_length=512
)

with torch.no_grad():
    outputs = model(**inputs)

# Option 1: CLS embedding
cls_embedding = outputs.last_hidden_state[:, 0, :]

# Option 2: Mean pooling (recommended for code representation)
mean_embedding = outputs.last_hidden_state.mean(dim=1)
```

Mean pooling is often recommended when using the model for **code representation or similarity tasks**.

---

## GitHub Repository

To train, fine-tune, or deploy SmartBERT for Web API services, please refer to our GitHub repository:

[https://github.com/web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT)

---

## Citation

If you use **SmartBERT** in your research, please cite:

```tex
@article{huang2025smart,
  title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
  author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
  journal={arXiv preprint arXiv:2508.20086},
  year={2025}
}
```

---

## Acknowledgement

- [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
- [Macau University of Science and Technology](http://www.must.edu.mo)
- CAS Mino (中科劢诺)