EVM SemanticBytecode

A transformer model trained on EVM bytecode for smart contract analysis.

Model Description

This model was trained on EVM (Ethereum Virtual Machine) bytecode data from multiple blockchain networks. It learns meaningful representations of smart contract bytecode, transaction calldata, and contract creation code.

Intended Uses

  • Bytecode analysis: Understanding smart contract structure and patterns
  • Smart contract embeddings: Generating vector representations for similarity search
  • On-chain data analysis: Encoding transaction calldata for downstream tasks
  • Security research: Analyzing bytecode patterns for vulnerability detection

Training Data

The model was trained on the following data:

  • Chains: Ethereum, optimism, base, unichain, bera, bsc
  • Data types: Contract calldata, creation bytecode, deployed bytecode
  • Training date: 2026-01-01

Training Procedure

  • Architecture: DeBERTaV2
  • Objective: Masked Language Modeling (MLM)
  • Vocab size: 32,768 (byte-level BPE)
  • Max length: 512 tokens
  • Masking probability: 30%

Model Configuration

{
  "hidden_size": 256,
  "num_hidden_layers": 6,
  "num_attention_heads": 4,
  "vocab_size": 32768
}

How to Use

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("evm-alpha/semantic-evm-mlm-chkp1000")
model = AutoModel.from_pretrained("evm-alpha/semantic-evm-mlm-chkp1000")

# This tokenizer uses byte-level BPE (GPT-2 style).
# You must convert raw bytes to the GPT-2 unicode format:

def bytes_to_unicode_map():
    """GPT-2 byte-to-unicode mapping."""
    bs = list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
    cs = bs[:]
    n = 0
    for b in range(256):
        if b not in bs:
            bs.append(b)
            cs.append(256 + n)
            n += 1
    return dict(zip(bs, [chr(c) for c in cs]))

BYTE_MAP = bytes_to_unicode_map()

def encode_evm_bytes(data: bytes) -> str:
    """Convert raw EVM bytes to tokenizer input string."""
    return "".join(BYTE_MAP[b] for b in data)

# Example: encode raw EVM bytecode
bytecode_hex = "608060405234801561001057600080fd5b50"
raw_bytes = bytes.fromhex(bytecode_hex)
tokenizer_input = encode_evm_bytes(raw_bytes)

inputs = tokenizer(tokenizer_input, return_tensors="pt")
outputs = model(**inputs)

# Get embeddings (mean pooling)
embeddings = outputs.last_hidden_state  # [batch, seq_len, hidden_size]
mask = inputs["attention_mask"].unsqueeze(-1)
pooled = (embeddings * mask).sum(1) / mask.sum(1)

Limitations

  • Trained on specific EVM chains; may not generalize to all EVM variants
  • Max context: 512 tokens (~256 bytes of bytecode)
  • Optimized for EVM bytecode patterns; not intended for natural language

License

This model is released under the GNU Affero General Public License v3.0.

Citation

@misc{evm-semanticbytecode,
  author = {evm-alpha},
  title = {EVM SemanticBytecode},
  year = {2026},
  url = {https://huggingface.co/evm-alpha/semantic-evm-mlm-chkp1000}
}
Downloads last month
14
Safetensors
Model size
14.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support