Part of the Brie Model Family: This is the foundational model in our architecture comparison study. See also: Brie Qwen 2.5 3B | Brie Llama 3.2 3B
Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation (Karman, 2025)
v2.0 (Jan 2026): Added theoretical framework (Li et al. 2025), corrected training config documentation
Brie Qwen 2.5 0.5B
LoRA adapter for Qwen/Qwen2.5-0.5B-Instruct specializing in continental philosophy, speculative reasoning, and conceptual development for creative work.
Part of a controlled study demonstrating a LLM-assisted data authoring methodology, where the researcher authored 1,213 training examples through iterative discussions, using LLMs as authoring tools. Achieves 77% win rate on in-domain tasks (n=13), 71.9% comprehensive multi-domain (n=57), and 40% on out-of-domain tasks (n=15).
Model Details
- Base Model: Qwen/Qwen2.5-0.5B-Instruct (618M parameters)
- Training Method: LoRA (Low-Rank Adaptation)
- Training Data: 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools
- Training Duration: 2 epochs (290 steps, ~5 hours on Apple M4 MacBook)
- Training Cost: Negligible (consumer hardware)
- Adapter Size: 4.1 MB
- License: Apache 2.0
- Training: October 2025
- Evaluation: October 2025
LoRA Configuration
LoRAConfig(
r=8,
lora_alpha=16,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
task_type="CAUSAL_LM"
)
Performance
Blind A/B testing (85+ comparisons) using Claude Opus 4 and Claude 3.7 Sonnet as judges.
| Test Type | Samples | Win Rate | Interpretation |
|---|---|---|---|
| Philosophy/Creative (In-Domain) | 13 | 77% | Exceptional domain expertise |
| Coding/Math/Practical (Out-of-Domain) | 15 | 40% | Maintained competitiveness |
| Comprehensive Multi-Domain | 57 | 71.9% | Strong overall performance |
Note: The comprehensive evaluation (71.9%, n=57) includes both in-domain and out-of-domain tasks. The dedicated in-domain subset (77%, n=13) shows stronger performance on philosophy/creative tasks specifically.
Domain Performance
In-Domain (77% win rate):
- Continental philosophy (phenomenology, existentialism, critical theory)
- Speculative and conceptual reframing
- Contemplative prose
- Philosophical argumentation
Out-of-Domain (40% win rate):
- Math: 33%
- Practical tasks: 67%
- Creative writing: 67%
- Factual knowledge: 33%
- Coding: 0%
Training Notes
Second Epoch Essential
Critical methodological finding:
- Checkpoint-100 (1 epoch): ~10% performance (undertrained)
- Checkpoint-290 (2 epochs): 77% in-domain performance
- Impact: 60+ percentage point improvement from completing training
Lesson: For small datasets (~1k examples), don't evaluate early checkpoints as representative of final performance. Training to completion (2+ epochs) is critical.
No Catastrophic Forgetting
Domain-specific fine-tuning with LoRA successfully specializes without losing general capabilities:
- 77% in-domain (exceptional specialization)
- 40% out-of-domain (maintained competitiveness)
- Creative skills transferred to new contexts (67%)
Small Dataset Success
1,213 examples authored through LLM-assisted methodology sufficient for domain expertise:
- Quality > quantity for domain-specific fine-tuning
- LLM-assisted data authoring enables domain experts to capture specialized reasoning patterns
- LoRA prevents overfitting on small datasets
- Careful curation more important than scale
Data Authoring Process: Training data was authored using Claude (Anthropic), ChatGPT (OpenAI), Mistral, and Kimi as discussion partners. Notably, no training data was generated using Qwen or Llama models to prevent potential data contamination in fine-tuning experiments.
Multi-Response Sampling Methodology
A key methodological innovation: rather than single responses per prompt, the training data contains 202 unique prompts with multiple high-quality responses per prompt (averaging ~6 responses each, totaling 1,213 examples).
Why This Matters:
- The model learns the distribution of valid responses rather than memorizing fixed prompt-response pairs
- Teaches multiple valid reasoning paths and stylistic variations within domain constraints
- Explains strong generalization despite relatively few unique prompts
- Provides robustness: model learns what makes a response valid, not just one "correct" answer
This multi-response approach is critical to understanding why 1,213 examples achieve 77% in-domain performance—the model learns patterns and variance, not memorization.
Reference: This approach aligns with cognitive grounding principles (causal, compositional, revisable reasoning) discussed in the paper.
Use Cases
Use Brie when:
- Writing about continental philosophy
- Exploring philosophical concepts in depth
- Creative brainstorming on philosophical topics
- Contemplative/meditative writing
- Tasks requiring nuanced, multi-faceted analysis
Use baseline Qwen when:
- Coding/programming tasks
- Pure mathematical problems
- Technical documentation
- Factual knowledge retrieval
Usage
Installation
pip install transformers peft torch
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct",
torch_dtype=torch.float16,
device_map="auto"
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "closestfriend/brie-qwen2.5-0.5b")
# Generate
messages = [
{"role": "system", "content": "You are a helpful assistant specializing in philosophy and creative writing."},
{"role": "user", "content": "Explain Heidegger's concept of 'Being-in-the-world'."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.75,
do_sample=True
)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
Training Details
Training Metrics
- Initial Loss: 3.319
- Final Loss: 1.4824 (55% reduction)
- Validation Loss: 1.5031
- Training Time: ~5 hours (2 epochs)
- Hardware: Apple M4 MacBook Pro (16GB RAM, MPS backend)
Training Configuration
TrainingArguments(
num_train_epochs=2,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="linear",
warmup_steps=20,
logging_steps=10,
eval_steps=50,
save_steps=100,
)
Evaluation Methodology
Rigorous Testing
- 85+ blind A/B comparisons across multiple test suites
- Randomized presentation order to avoid position bias
- Multiple judge models (Claude Opus 4, Claude 3.7 Sonnet)
- Reproducibility testing across 3 independent runs
- Variance characterization (40-60% range with small samples)
Test Suites
- In-Domain Test (13 prompts): Philosophy, brainstorming, contemplative writing
- Out-of-Domain Test (15 prompts): Coding, math, practical tasks, factual questions
- Comprehensive Eval (57 prompts): Multi-domain blind comparisons
- Reproducibility Test (15 prompts): Variance analysis across runs
Evaluation Criteria (1-5 scale)
- Creativity & Originality
- Coherence & Structure
- Depth & Insight
- Engagement & Interest
- Writing Quality
Limitations
- Specialized, not universal: Excels in philosophy/creative domains but not coding (0% on programming tasks)
- Sampling variance: Results can vary 40-60% across runs with temperature 0.75 and small samples (n<20)
- Judge subjectivity: Different AI judges prefer different qualities (depth vs clarity)
- Small base model: 0.5B parameters means limited overall capability compared to larger models
- English only: Trained on English examples, performance on other languages not tested
Citation
If you use this model in your research or applications, please cite:
@misc{brie-2025,
author = {Hunter Karman
title = {Brie: Domain-Specific Fine-Tuning with LLM-Assisted Data Authoring},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/closestfriend/brie-qwen2.5-0.5b}},
note = {77% in-domain performance with 1,213 examples authored through LLM-assisted methodology}
}
Acknowledgments
- Base Model: Qwen Team for Qwen 2.5 0.5B Instruct
- Evaluation Judges: Anthropic's Claude Opus 4 and Claude 3.7 Sonnet, OpenAI's GPT-4o, Google's Gemini 2.5 Flash Lite
- Training Framework: HuggingFace PEFT & TRL libraries
Model Card Authors
Created by Hunter Noah Shokrian Karman
Model Card Contact
For questions or feedback: hnshokrian@gmail.com
Links
Brie Model Family:
- Brie Qwen 2.5 0.5B - Foundational model (this model)
Brie Qwen 2.5 3B - Scaling study
Brie Llama 3.2 3B - Cross-architecture study
Paper: Karman, H. (2025). "Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation." DOI: 10.5281/zenodo.17657737
Code Repository: https://github.com/closestfriend/efficient-domain-adaptation
Full evaluation details and training code: github.com/closestfriend/efficient-domain-adaptation
- Downloads last month
- 9