Part of the Brie Model Family: This is the foundational model in our architecture comparison study. See also: Brie Qwen 2.5 3B | Brie Llama 3.2 3B

Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation (Karman, 2025)

v2.0 (Jan 2026): Added theoretical framework (Li et al. 2025), corrected training config documentation

Brie Qwen 2.5 0.5B

LoRA adapter for Qwen/Qwen2.5-0.5B-Instruct specializing in continental philosophy, speculative reasoning, and conceptual development for creative work.

Part of a controlled study demonstrating a LLM-assisted data authoring methodology, where the researcher authored 1,213 training examples through iterative discussions, using LLMs as authoring tools. Achieves 77% win rate on in-domain tasks (n=13), 71.9% comprehensive multi-domain (n=57), and 40% on out-of-domain tasks (n=15).

Model Details

  • Base Model: Qwen/Qwen2.5-0.5B-Instruct (618M parameters)
  • Training Method: LoRA (Low-Rank Adaptation)
  • Training Data: 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools
  • Training Duration: 2 epochs (290 steps, ~5 hours on Apple M4 MacBook)
  • Training Cost: Negligible (consumer hardware)
  • Adapter Size: 4.1 MB
  • License: Apache 2.0
  • Training: October 2025
  • Evaluation: October 2025

LoRA Configuration

LoRAConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM"
)

Performance

Blind A/B testing (85+ comparisons) using Claude Opus 4 and Claude 3.7 Sonnet as judges.

Test Type Samples Win Rate Interpretation
Philosophy/Creative (In-Domain) 13 77% Exceptional domain expertise
Coding/Math/Practical (Out-of-Domain) 15 40% Maintained competitiveness
Comprehensive Multi-Domain 57 71.9% Strong overall performance

Note: The comprehensive evaluation (71.9%, n=57) includes both in-domain and out-of-domain tasks. The dedicated in-domain subset (77%, n=13) shows stronger performance on philosophy/creative tasks specifically.

Domain Performance

In-Domain (77% win rate):

  • Continental philosophy (phenomenology, existentialism, critical theory)
  • Speculative and conceptual reframing
  • Contemplative prose
  • Philosophical argumentation

Out-of-Domain (40% win rate):

  • Math: 33%
  • Practical tasks: 67%
  • Creative writing: 67%
  • Factual knowledge: 33%
  • Coding: 0%

Training Notes

Second Epoch Essential

Critical methodological finding:

  • Checkpoint-100 (1 epoch): ~10% performance (undertrained)
  • Checkpoint-290 (2 epochs): 77% in-domain performance
  • Impact: 60+ percentage point improvement from completing training

Lesson: For small datasets (~1k examples), don't evaluate early checkpoints as representative of final performance. Training to completion (2+ epochs) is critical.

No Catastrophic Forgetting

Domain-specific fine-tuning with LoRA successfully specializes without losing general capabilities:

  • 77% in-domain (exceptional specialization)
  • 40% out-of-domain (maintained competitiveness)
  • Creative skills transferred to new contexts (67%)

Small Dataset Success

1,213 examples authored through LLM-assisted methodology sufficient for domain expertise:

  • Quality > quantity for domain-specific fine-tuning
  • LLM-assisted data authoring enables domain experts to capture specialized reasoning patterns
  • LoRA prevents overfitting on small datasets
  • Careful curation more important than scale

Data Authoring Process: Training data was authored using Claude (Anthropic), ChatGPT (OpenAI), Mistral, and Kimi as discussion partners. Notably, no training data was generated using Qwen or Llama models to prevent potential data contamination in fine-tuning experiments.

Multi-Response Sampling Methodology

A key methodological innovation: rather than single responses per prompt, the training data contains 202 unique prompts with multiple high-quality responses per prompt (averaging ~6 responses each, totaling 1,213 examples).

Why This Matters:

  • The model learns the distribution of valid responses rather than memorizing fixed prompt-response pairs
  • Teaches multiple valid reasoning paths and stylistic variations within domain constraints
  • Explains strong generalization despite relatively few unique prompts
  • Provides robustness: model learns what makes a response valid, not just one "correct" answer

This multi-response approach is critical to understanding why 1,213 examples achieve 77% in-domain performance—the model learns patterns and variance, not memorization.

Reference: This approach aligns with cognitive grounding principles (causal, compositional, revisable reasoning) discussed in the paper.

Use Cases

Use Brie when:

  • Writing about continental philosophy
  • Exploring philosophical concepts in depth
  • Creative brainstorming on philosophical topics
  • Contemplative/meditative writing
  • Tasks requiring nuanced, multi-faceted analysis

Use baseline Qwen when:

  • Coding/programming tasks
  • Pure mathematical problems
  • Technical documentation
  • Factual knowledge retrieval

Usage

Installation

pip install transformers peft torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "closestfriend/brie-qwen2.5-0.5b")

# Generate
messages = [
    {"role": "system", "content": "You are a helpful assistant specializing in philosophy and creative writing."},
    {"role": "user", "content": "Explain Heidegger's concept of 'Being-in-the-world'."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.75,
    do_sample=True
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

Training Details

Training Metrics

  • Initial Loss: 3.319
  • Final Loss: 1.4824 (55% reduction)
  • Validation Loss: 1.5031
  • Training Time: ~5 hours (2 epochs)
  • Hardware: Apple M4 MacBook Pro (16GB RAM, MPS backend)

Training Configuration

TrainingArguments(
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="linear",
    warmup_steps=20,
    logging_steps=10,
    eval_steps=50,
    save_steps=100,
)

Evaluation Methodology

Rigorous Testing

  • 85+ blind A/B comparisons across multiple test suites
  • Randomized presentation order to avoid position bias
  • Multiple judge models (Claude Opus 4, Claude 3.7 Sonnet)
  • Reproducibility testing across 3 independent runs
  • Variance characterization (40-60% range with small samples)

Test Suites

  1. In-Domain Test (13 prompts): Philosophy, brainstorming, contemplative writing
  2. Out-of-Domain Test (15 prompts): Coding, math, practical tasks, factual questions
  3. Comprehensive Eval (57 prompts): Multi-domain blind comparisons
  4. Reproducibility Test (15 prompts): Variance analysis across runs

Evaluation Criteria (1-5 scale)

  • Creativity & Originality
  • Coherence & Structure
  • Depth & Insight
  • Engagement & Interest
  • Writing Quality

Limitations

  • Specialized, not universal: Excels in philosophy/creative domains but not coding (0% on programming tasks)
  • Sampling variance: Results can vary 40-60% across runs with temperature 0.75 and small samples (n<20)
  • Judge subjectivity: Different AI judges prefer different qualities (depth vs clarity)
  • Small base model: 0.5B parameters means limited overall capability compared to larger models
  • English only: Trained on English examples, performance on other languages not tested

Citation

If you use this model in your research or applications, please cite:

@misc{brie-2025,
  author = {Hunter Karman
  title = {Brie: Domain-Specific Fine-Tuning with LLM-Assisted Data Authoring},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/closestfriend/brie-qwen2.5-0.5b}},
  note = {77% in-domain performance with 1,213 examples authored through LLM-assisted methodology}
}

Acknowledgments

  • Base Model: Qwen Team for Qwen 2.5 0.5B Instruct
  • Evaluation Judges: Anthropic's Claude Opus 4 and Claude 3.7 Sonnet, OpenAI's GPT-4o, Google's Gemini 2.5 Flash Lite
  • Training Framework: HuggingFace PEFT & TRL libraries

Model Card Authors

Created by Hunter Noah Shokrian Karman

Model Card Contact

For questions or feedback: hnshokrian@gmail.com

Links

Brie Model Family:


Full evaluation details and training code: github.com/closestfriend/efficient-domain-adaptation

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for closestfriend/brie-qwen2.5-0.5b

Base model

Qwen/Qwen2.5-0.5B
Adapter
(436)
this model