IntelliTeX: Natural Language β†’ LaTeX (Experimental)

Model summary

IntelliTeX is an experimental Small Language Model (SLM) study for converting English, spoken-style math descriptions into a single LaTeX equation. It is intended as a research artifact (training regimes, decoding constraints, stress tests), not a production-ready LaTeX authoring system.

  • Base model: Salesforce/codet5p-220m (CodeT5+ 220M)
  • Primary task: text β†’ LaTeX equation generation (single equation output)
  • Primary language: English

What the model is for

Intended use

  • Drafting LaTeX equations from short natural-language descriptions
  • Prototyping or benchmarking compact models on domain-specific translation

Not recommended

  • Fully automated formula generation without verification

Training approach (experimental study)

We evaluated multiple training configurations to understand what improves a compact model most:

  1. LoRA fine-tuning: rapid iteration and capability checks
  2. Full-parameter fine-tuning (FPFT): to measure the performance ceiling (LoRA often underperformed FPFT)
  3. Two-stage pipeline (continued pretraining β†’ FPFT) inspired by CodeT5+ training recipes:
    • Stage 1: domain-adaptive continued pretraining on TeXTeller with span-denoising + causal LM objectives
      • ~4B tokens, ~76k steps
    • Stage 2: supervised FPFT on Speech2LaTeX textβ†’LaTeX pairs

Experiment Results

  • Full-parameter fine-tuning (FPFT) was the largest single driver of gains in our experiments. In our report, FPFT CodeT5+ 220M reached EM 0.463, which is ~4Γ— higher than the Qwen2.5-Coder 32B Instruct (EM 0.121) under the same evaluation setup.
  • On the main Speech2LaTeX (S2L) benchmark,FPFT CodeT5+ 220M outperformed a larger 0.5B FPFT baseline in our report, indicating that training regime and architecture can matter more than parameter count for this task.
  • Stage 1 (domain-adaptive continued pretraining) primarily improved robustness rather than average-case performance: it did not materially change EM on the main S2L test set (e.g., 0.467 vs 0.463), but helped more on stress conditions.
  • On MathBridge stress tests, CodeT5+ 220M with Stage 1 + FPFT closely matched a much larger 3B comparator on long-context and long-target subsets, and outperformed the model with only FPFT.

Main benchmark (Speech2LaTeX test set)

  • Qwen2.5-Coder 32B: EM 0.121
  • FPFT Qwen2.5-Coder 0.5B: EM 0.405
  • Stage 1 + FPFT CodeT5+ 220M: EM 0.463
  • FPFT CodeT5+ 220M: EM 0.467
  • FPFT Qwen2.5-Coder 3B: EM 0.507

Stress tests (MathBridge subsets)

  • Long-context inputs (source length > 115 chars):
    • FPFT CodeT5+ 220M: EM 0.150
    • Stage 1 + FPFT CodeT5+ 220M: EM 0.195
    • FPFT Qwen2.5-Coder 3B: EM 0.209
  • Long-target outputs (target length > 60 chars):
    • FPFT CodeT5+ 220M: EM 0.049
    • FPFT Qwen2.5-Coder 3B: EM 0.070
    • Stage 1 + FPFT CodeT5+ 220M: EM 0.076

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "duanxianpi/IntelliTeX"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

text = "the integral from zero to one of x squared dx"
prompt = f"Convert natural-language math into a STRICT LaTeX equation\n{text}"

inputs = tok(prompt, return_tensors="pt")

out = model.generate(
    **inputs,
    max_length=512,
)

print(tok.decode(out[0], skip_special_tokens=True))
# Ouput: $$\int_{0}^{1}x^{2}\,dx$$

Running on Transformer.js

A live, in-browser demonstration using the transformer.js library showcases this efficiency advantage on typical CPU hardware.

Model
IntelliTeX our
Qwen2.5-Coder-0.5B-Instruct qwen

Full Evaluation Results

1. Comprehensive performance on the S2L test dataset (2745 samples)

Model Architecture Method EM ↑ CR ↑ CER ↓ TexBLEU ↑
SmolLM2 (135M) Base 0.005 0.790 42.40 0.743
Base + Grammar 0.011 0.822 7.90 0.279
LoRA 0.126 0.957 0.90 0.823
LoRA + Grammar 0.127 0.957 0.91 0.824
SmolLM2 (360M) Base 0.107 0.695 9.38 0.802
Base + Grammar 0.142 0.760 10.00 0.812
LoRA 0.242 0.980 0.49 0.861
LoRA + Grammar 0.243 0.980 0.49 0.862
CodeT5+ (220M) Base 0.000 0.921 96.01 0.725
LoRA 0.258 0.913 0.39 0.874
FPFT 0.467 0.982 0.22 0.912
Stage 1 + FPFT (IntelliTeX) 0.463 0.998 0.22 0.915
Qwen2.5-Coder (0.5B) Base 0.161 0.974 1.27 0.830
Base + Grammar 0.160 0.978 1.27 0.831
LoRA 0.155 0.909 2.71 0.836
LoRA + Grammar 0.155 0.967 1.75 0.838
FPFT 0.405 0.990 0.24 0.902
Qwen2.5-Coder (3B) Base 0.294 0.991 0.46 0.869
Base + Grammar 0.293 0.996 0.45 0.870
FPFT 0.507 0.997 0.18 0.919
Qwen2.5-Coder (32B) Base 0.121 1.000 0.38 0.863

Note: EM = Exact Match, CR = Compilable Rate, CER = Character Error Rate. Base = Original Instruct Model, Grammar = Structured Decoding, Stage 1 = Domain-Adaptive Pre-training.

2. Stress Test Analysis

Performance on Long Context Inputs (Source > 115 chars)

Demonstrates the model's ability to understand lengthy natural language descriptions.

Model (FPFT) EM ↑ CR ↑ CER ↓ TexBLEU ↑
CodeT5+ (220M) 0.150 0.967 0.219 0.868
IntelliTeX (Stage 1 + FPFT) 0.195 0.997 0.211 0.873
Qwen2.5-Coder (0.5B) 0.129 0.976 0.292 0.859
Qwen2.5-Coder (3B) 0.209 0.996 0.199 0.874

Performance on Long Sequence Generation (Target > 60 chars)

Demonstrates the model's ability to generate complex, long LaTeX formulas.

Model (FPFT) EM ↑ CR ↑ CER ↓ TexBLEU ↑
CodeT5+ (220M) 0.049 0.940 0.297 0.827
IntelliTeX (Stage 1 + FPFT) 0.076 0.991 0.312 0.828
Qwen2.5-Coder (0.5B) 0.037 0.967 0.394 0.816
Qwen2.5-Coder (3B) 0.070 0.988 0.350 0.822
Downloads last month
4
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for duanxianpi/IntelliTex

Finetuned
(92)
this model

Datasets used to train duanxianpi/IntelliTex