Introduction

Xmodel-2 is a 1.2-billion-parameter large language model designed specifically for reasoning tasks. Its architecture enables different model scales to share a unified set of hyperparameters, allowing for extensive experimentation on smaller models and seamless transfer of optimal configurations to larger models. To maximize training efficiency and stability, Xmodel-2 employs the WSD learning rate scheduler from MiniCPM. Pretrained on 1.5 trillion tokens from diverse sources, Xmodel-2 achieves state-of-the-art performance in complex reasoning and agent-based tasks, while maintaining low training costs. These results highlight the potential of efficient model design and training strategies in advancing reasoning capabilities. Model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/Xmodel-2

For detail, you can read the paper at https://huggingface.co/papers/2412.19638

To use Xmodel-2 for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code and related virtual environments.

from transformers.models.auto.modeling_auto import AutoModelForCausalLM
from transformers.models.auto.tokenization_auto import AutoTokenizer

model_path = os.path.expanduser("/path/to/Xmodel-2")

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True
)

prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer(text, return_tensors="pt").to(model.device)

stop_tokens = ["<|im_end|>", "<|im_start|>"]
stop_token_ids = []
for token in stop_tokens:
    encoded = tokenizer.encode(token, add_special_tokens=False)
    if encoded:
        stop_token_ids.extend(encoded)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=256,
    do_sample=True,
    top_p=0.9,
    temperature=0.7,
    pad_token_id=tokenizer.eos_token_id,
    stop_strings=stop_tokens,
    tokenizer=tokenizer
)

output = tokenizer.decode(
    generated_ids[0][len(model_inputs.input_ids[0]):], 
    skip_special_tokens=True
)

for stop_token in stop_tokens:
    output = output.replace(stop_token, "")

output = output.split("<|im_start|>")[0]
output = output.strip()

print("Generated Response:")
print(output) 

The possible result generated by this code is:

Generated Response:
Large language models are advanced artificial intelligence systems that are trained on massive amounts of text data to generate human-like text. These models are typically trained on a large corpus of text data, such as books, articles, and websites, and are able to generate text that is coherent and contextually appropriate.

Large language models are often used in natural language processing (NLP) tasks, such as language translation, text summarization, and text generation. They are also used in a variety of other applications, such as chatbots, virtual assistants, and language learning tools.

Large language models are a key component of the field of artificial intelligence and are being used in a variety of industries and applications. They are a powerful tool for generating human-like text and are helping to transform the way that we interact with technology.

Evaluation

Commonsense Reasoning

We evaluate Xmodel-2 on various commonsense reasoning benchmarks using the Language Model Evaluation Harness, which includes: ARC-Challenge, ARC-Easy, BoolQ, HellaSwag, OpenBookQA, PiQA, SciQ, TriviaQA, and Winogrande. For fairness and reproducibility, all models were evaluated in the same environment using raw accuracy metrics.

Model ARC-c ARC-e Boolq HS OB PiQA SciQ Wino. Avg
MobiLLama-1B 28.24 61.53 60.92 46.74 21.80 75.14 88.20 59.27 55.23
TinyLLaMA1.1-1.1B 30.97 61.66 55.99 46.70 25.20 72.63 89.30 59.43 55.24
OLMo-1B 28.67 63.34 61.74 46.97 25.00 75.03 87.00 59.98 55.97
OpenELM-1.1B 28.84 62.37 63.58 48.36 25.40 74.76 90.60 61.72 56.95
Llama-3.2-1B 31.31 65.36 63.73 47.78 26.40 74.48 91.50 61.01 57.70
MiniCPM-1.2B 36.86 70.29 67.92 49.91 23.60 74.43 91.80 60.77 59.45
Fox-1-1.6B 34.73 69.91 71.77 46.33 24.60 75.24 93.20 60.77 59.57
InternLM2.5-1.8B 35.24 66.37 79.82 46.99 22.00 73.29 94.90 62.67 60.16
Qwen2-1.5B 33.11 66.41 72.60 48.57 27.00 75.57 94.60 65.75 60.45
StableLM-2-zephyr-1.6B 36.52 66.79 80.00 53.26 26.80 74.86 88.00 64.09 61.29
SmolLM-1.7B 43.43 76.47 65.93 49.58 30.00 75.79 93.20 60.93 61.92
Qwen2.5-1.5B 41.21 75.21 72.97 50.15 31.80 75.90 94.30 63.61 63.14
DCLM-1B 41.30 74.79 71.41 53.59 32.20 76.93 94.00 66.22 63.81
Phi-1.5-1.3B 44.80 76.22 74.95 47.96 38.60 76.66 93.30 72.93 65.68
Xmodel-2-1.2B 39.16 71.55 74.65 47.45 29.20 74.81 93.60 63.93 61.79

Complex Reasoning

To evaluate the complex reasoning abilities of Xmodel-2, we conducted tests using several well-established benchmarks, including GSM8K, MATH, BBH, MMLU, HumanEval, and MBPP. The first four benchmarks were assessed using the Language Model Evaluation Harness, while the last two were evaluated with the Code Generation LM Evaluation Harness.

Model GSM8K
5-shot
MATH
4-shot
BBH
3-shot
MMLU
0-shot
HumanEval
pass@1
MBPP
pass@1
Avg
OpenELM-1.1B 0.45 1.06 6.62 25.52 8.54 6.80 8.16
OLMo-1B 2.35 1.46 25.60 24.46 5.49 0.20 9.93
TinyLLaMA1.1-1.1B 2.50 1.48 25.57 25.35 1.83 3.40 10.02
MobiLLama-1B 1.97 1.54 25.76 25.26 7.93 5.40 11.31
DCLM-1B 4.93 2.14 30.70 46.43 8.54 6.80 16.59
Llama-3.2-1B 6.60 1.78 31.44 36.63 14.63 22.20 18.88
SmolLM-1.7B 7.51 3.18 29.21 27.73 21.34 31.80 20.13
Fox-1-1.6B 34.34 7.94 28.75 39.55 14.02 9.00 22.27
StableLM-2-zephyr-1.6B 41.32 10.12 32.71 41.30 25.61 19.40 28.41
Phi-1.5-1.3B 32.15 3.18 28.81 41.75 36.59 35.40 29.65
InternLM2.5-1.8B 27.90 16.68 41.76 46.30 27.40 29.60 31.61
MiniCPM-1.2B 40.11 10.98 35.42 43.99 43.90 36.80 35.20
Qwen2-1.5B 57.62 22.90 33.05 55.11 20.73 30.40 36.64
Qwen2.5-1.5B 62.40 28.28 43.99 59.72 5.49 40.00 39.98
Xmodel-2-1.2B 55.88 25.50 48.40 48.87 29.88 29.20 39.62

Agent Capabilities

We evaluate Xmodel-2โ€™s performance on four agent tasks using the ReAct prompting technique. These tasks include HotpotQA, FEVER, AlfWorld, and WebShop. We use EM(Exact Match) as the evaluation metric in FEVER and HotpotQA, and success rate in AlfWorld and WebShop.

Model HotpotQA (EM) FEVER (EM) AlfWorld (success rate) WebShop (success rate) Avg
DCLM-1B 4.92 24.39 0.75 0.00 7.52
MobiLLama-1B 0.00 30.43 0.00 0.00 7.61
TinyLLama1.1-1.1B 2.11 28.77 0.00 0.20 7.77
OpenELM-1-1B 2.70 28.37 0.00 0.40 7.87
StableLM-2-zephyr 1.6B 1.44 20.81 8.96 2.20 8.35
SmolLM-1.7B 2.28 31.31 0.00 0.60 8.55
Fox-1-1.6B 5.37 30.88 0.00 0.60 9.21
Llama-3.2-1B 4.87 27.67 8.21 3.20 10.99
Qwen2.5-1.5B 13.53 27.58 5.97 0.60 11.92
MiniCPM-1.2B 11.00 36.57 1.60 1.00 12.52
InternLM2.5-1.8B 12.84 34.02 2.99 1.00 12.71
Xmodel-2-1.2B 13.70 40.00 0.78 2.20 14.21
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support