--- license: mit language: - en inference: true base_model: - microsoft/codebert-base-mlm pipeline_tag: feature-extraction tags: - smart-contract - web3 - software-engineering - embedding - codebert - solidity - code-understanding library_name: transformers datasets: - web3se/smart-contract-intent-vul-dataset --- # SmartBERT V2 CodeBERT ![SmartBERT](./framework.png) ## Overview SmartBERT V2 CodeBERT is a **domain-adapted pre-trained model** built on top of **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**. It is designed to learn high-quality semantic representations of **smart contract code**, particularly at the **function level**. The model is further pre-trained on a large corpus of smart contracts using the **Masked Language Modeling (MLM)** objective. This domain-adaptive pretraining enables the model to better capture **semantic patterns, structure, and intent** within smart contract functions compared to general-purpose code models. SmartBERT V2 can be used for tasks such as: - Smart contract intent detection - Code similarity analysis - Vulnerability analysis - Smart contract classification - Code embedding and retrieval SmartBERT V2 is a pre-trained model specifically developed for **[SmartIntent V2](https://github.com/web3se-lab/web3-sekit)**. It was trained on **16,000 smart contracts**, with no overlap with the SmartIntent V2 evaluation dataset to avoid data leakage. For production use or general smart contract representation tasks, we recommend **SmartBERT V3**: https://huggingface.co/web3se/SmartBERT-v3 --- ## Training Data SmartBERT V2 was trained on a corpus of approximately **16,000 smart contracts**, primarily written in **Solidity** and collected from public blockchain repositories. To better model smart contract behavior, contracts were processed at the **function level**, enabling the model to learn fine-grained semantic representations of smart contract functions. For benchmarking purposes in the **SmartIntent V2**, the pretraining corpus was intentionally limited to this **16,000-contract dataset**. The **evaluation dataset (4,000 smart contracts)** was strictly held out and **not included in the pretraining data**, ensuring that downstream evaluations remain unbiased and free from data leakage. --- ## Preprocessing During preprocessing, all newline (`\n`) and tab (`\t`) characters in the function code were normalized by replacing them with a **single space**. This ensures a consistent input format for the tokenizer and avoids unnecessary token fragmentation. --- ## Base Model SmartBERT V2 is initialized from: - **Base Model:** [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm) CodeBERT is a transformer-based model trained on source code and natural language pairs. SmartBERT V2 further adapts this model to the **smart contract domain** through continued pretraining. --- ## Training Objective The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as the original CodeBERT model. During training: - A subset of tokens is randomly masked. - The model learns to predict the masked tokens based on surrounding context. - This encourages the model to learn deeper structural and semantic representations of smart contract code. --- ## Training Setup Training was conducted using the **HuggingFace Transformers** framework with the following configuration: - **Hardware:** 2 × Nvidia A100 (80GB) - **Training Duration:** ~10 hours - **Training Dataset:** 16,000 smart contracts - **Evaluation Dataset:** 4,000 smart contracts Example training configuration: ```python from transformers import TrainingArguments training_args = TrainingArguments( output_dir=OUTPUT_DIR, overwrite_output_dir=True, num_train_epochs=20, per_device_train_batch_size=64, save_steps=10000, save_total_limit=2, evaluation_strategy="steps", eval_steps=10000, resume_from_checkpoint=checkpoint ) ```` --- ## Evaluation The model was evaluated on a held-out dataset of approximately **4,000 smart contracts** to monitor training stability and generalization during pretraining. SmartBERT V2 is primarily intended as a **representation learning model**, providing high-quality embeddings for downstream smart contract analysis tasks. --- ## How to Use You can load SmartBERT V2 using the **HuggingFace Transformers** library. ```python import torch from transformers import RobertaTokenizer, RobertaModel tokenizer = RobertaTokenizer.from_pretrained("web3se/SmartBERT-v2") model = RobertaModel.from_pretrained("web3se/SmartBERT-v2") code = "function totalSupply() external view returns (uint256);" inputs = tokenizer( code, return_tensors="pt", truncation=True, max_length=512 ) with torch.no_grad(): outputs = model(**inputs) # Option 1: CLS embedding cls_embedding = outputs.last_hidden_state[:, 0, :] # Option 2: Mean pooling (recommended for code representation) mean_embedding = outputs.last_hidden_state.mean(dim=1) ``` Mean pooling is often recommended when using the model for **code representation or similarity tasks**. --- ## GitHub Repository To train, fine-tune, or deploy SmartBERT for Web API services, please refer to our GitHub repository: [https://github.com/web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT) --- ## Citation If you use **SmartBERT** in your research, please cite: ```tex @article{huang2025smart, title={Smart Contract Intent Detection with Pre-trained Programming Language Model}, author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin}, journal={arXiv preprint arXiv:2508.20086}, year={2025} } ``` --- ## Acknowledgement - [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/) - [Macau University of Science and Technology](http://www.must.edu.mo) - CAS Mino (中科劢诺)