EVM SemanticBytecode

A transformer model trained on EVM bytecode for smart contract analysis.

Model Description

This model was trained on EVM (Ethereum Virtual Machine) bytecode data from multiple blockchain networks. It learns meaningful representations of smart contract bytecode, transaction calldata, and contract creation code.

Intended Uses

Bytecode analysis: Understanding smart contract structure and patterns
Smart contract embeddings: Generating vector representations for similarity search
On-chain data analysis: Encoding transaction calldata for downstream tasks
Security research: Analyzing bytecode patterns for vulnerability detection

Training Data

The model was trained on the following data:

Chains: Ethereum, optimism, base, unichain, bera, bsc
Data types: Contract calldata, creation bytecode, deployed bytecode
Training date: 2026-01-01

Training Procedure

Architecture: DeBERTaV2
Objective: Masked Language Modeling (MLM)
Vocab size: 32,768 (byte-level BPE)
Max length: 512 tokens
Masking probability: 30%

Model Configuration

{
  "hidden_size": 256,
  "num_hidden_layers": 6,
  "num_attention_heads": 4,
  "vocab_size": 32768
}

How to Use

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("evm-alpha/semantic-evm-mlm-chkp1000")
model = AutoModel.from_pretrained("evm-alpha/semantic-evm-mlm-chkp1000")

# This tokenizer uses byte-level BPE (GPT-2 style).
# You must convert raw bytes to the GPT-2 unicode format:

def bytes_to_unicode_map():
    """GPT-2 byte-to-unicode mapping."""
    bs = list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
    cs = bs[:]
    n = 0
    for b in range(256):
        if b not in bs:
            bs.append(b)
            cs.append(256 + n)
            n += 1
    return dict(zip(bs, [chr(c) for c in cs]))

BYTE_MAP = bytes_to_unicode_map()

def encode_evm_bytes(data: bytes) -> str:
    """Convert raw EVM bytes to tokenizer input string."""
    return "".join(BYTE_MAP[b] for b in data)

# Example: encode raw EVM bytecode
bytecode_hex = "608060405234801561001057600080fd5b50"
raw_bytes = bytes.fromhex(bytecode_hex)
tokenizer_input = encode_evm_bytes(raw_bytes)

inputs = tokenizer(tokenizer_input, return_tensors="pt")
outputs = model(**inputs)

# Get embeddings (mean pooling)
embeddings = outputs.last_hidden_state  # [batch, seq_len, hidden_size]
mask = inputs["attention_mask"].unsqueeze(-1)
pooled = (embeddings * mask).sum(1) / mask.sum(1)

Limitations

Trained on specific EVM chains; may not generalize to all EVM variants
Max context: 512 tokens (~256 bytes of bytecode)
Optimized for EVM bytecode patterns; not intended for natural language

License

This model is released under the GNU Affero General Public License v3.0.

Citation

@misc{evm-semanticbytecode,
  author = {evm-alpha},
  title = {EVM SemanticBytecode},
  year = {2026},
  url = {https://huggingface.co/evm-alpha/semantic-evm-mlm-chkp1000}
}

Downloads last month: 14

Safetensors

Model size

14.3M params

Tensor type

F32