EVM SemanticBytecode
A transformer model trained on EVM bytecode for smart contract analysis.
Model Description
This model was trained on EVM (Ethereum Virtual Machine) bytecode data from multiple blockchain networks. It learns meaningful representations of smart contract bytecode, transaction calldata, and contract creation code.
Intended Uses
- Bytecode analysis: Understanding smart contract structure and patterns
- Smart contract embeddings: Generating vector representations for similarity search
- On-chain data analysis: Encoding transaction calldata for downstream tasks
- Security research: Analyzing bytecode patterns for vulnerability detection
Training Data
The model was trained on the following data:
- Chains: Ethereum, optimism, base, unichain, bera, bsc
- Data types: Contract calldata, creation bytecode, deployed bytecode
- Training date: 2026-01-01
Training Procedure
- Architecture: DeBERTaV2
- Objective: Masked Language Modeling (MLM)
- Vocab size: 32,768 (byte-level BPE)
- Max length: 512 tokens
- Masking probability: 30%
Model Configuration
{
"hidden_size": 256,
"num_hidden_layers": 6,
"num_attention_heads": 4,
"vocab_size": 32768
}
How to Use
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("evm-alpha/semantic-evm-mlm-chkp1000")
model = AutoModel.from_pretrained("evm-alpha/semantic-evm-mlm-chkp1000")
# This tokenizer uses byte-level BPE (GPT-2 style).
# You must convert raw bytes to the GPT-2 unicode format:
def bytes_to_unicode_map():
"""GPT-2 byte-to-unicode mapping."""
bs = list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
cs = bs[:]
n = 0
for b in range(256):
if b not in bs:
bs.append(b)
cs.append(256 + n)
n += 1
return dict(zip(bs, [chr(c) for c in cs]))
BYTE_MAP = bytes_to_unicode_map()
def encode_evm_bytes(data: bytes) -> str:
"""Convert raw EVM bytes to tokenizer input string."""
return "".join(BYTE_MAP[b] for b in data)
# Example: encode raw EVM bytecode
bytecode_hex = "608060405234801561001057600080fd5b50"
raw_bytes = bytes.fromhex(bytecode_hex)
tokenizer_input = encode_evm_bytes(raw_bytes)
inputs = tokenizer(tokenizer_input, return_tensors="pt")
outputs = model(**inputs)
# Get embeddings (mean pooling)
embeddings = outputs.last_hidden_state # [batch, seq_len, hidden_size]
mask = inputs["attention_mask"].unsqueeze(-1)
pooled = (embeddings * mask).sum(1) / mask.sum(1)
Limitations
- Trained on specific EVM chains; may not generalize to all EVM variants
- Max context: 512 tokens (~256 bytes of bytecode)
- Optimized for EVM bytecode patterns; not intended for natural language
License
This model is released under the GNU Affero General Public License v3.0.
Citation
@misc{evm-semanticbytecode,
author = {evm-alpha},
title = {EVM SemanticBytecode},
year = {2026},
url = {https://huggingface.co/evm-alpha/semantic-evm-mlm-chkp1000}
}
- Downloads last month
- 14