Part of the Brie Model Family: This is the foundational model in our architecture comparison study. See also: Brie Qwen 2.5 3B | Brie Llama 3.2 3B

Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation (Karman, 2025)

v2.0 (Jan 2026): Added theoretical framework (Li et al. 2025), corrected training config documentation

Brie Qwen 2.5 0.5B

LoRA adapter for Qwen/Qwen2.5-0.5B-Instruct specializing in continental philosophy, speculative reasoning, and conceptual development for creative work.

Part of a controlled study demonstrating a LLM-assisted data authoring methodology, where the researcher authored 1,213 training examples through iterative discussions, using LLMs as authoring tools. Achieves 77% win rate on in-domain tasks (n=13), 71.9% comprehensive multi-domain (n=57), and 40% on out-of-domain tasks (n=15).

Model Details

Base Model: Qwen/Qwen2.5-0.5B-Instruct (618M parameters)
Training Method: LoRA (Low-Rank Adaptation)
Training Data: 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools
Training Duration: 2 epochs (290 steps, ~5 hours on Apple M4 MacBook)
Training Cost: Negligible (consumer hardware)
Adapter Size: 4.1 MB
License: Apache 2.0
Training: October 2025
Evaluation: October 2025

LoRA Configuration

LoRAConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM"
)

Performance

Blind A/B testing (85+ comparisons) using Claude Opus 4 and Claude 3.7 Sonnet as judges.

Test Type	Samples	Win Rate	Interpretation
Philosophy/Creative (In-Domain)	13	77%	Exceptional domain expertise
Coding/Math/Practical (Out-of-Domain)	15	40%	Maintained competitiveness
Comprehensive Multi-Domain	57	71.9%	Strong overall performance

Note: The comprehensive evaluation (71.9%, n=57) includes both in-domain and out-of-domain tasks. The dedicated in-domain subset (77%, n=13) shows stronger performance on philosophy/creative tasks specifically.

Domain Performance

In-Domain (77% win rate):

Continental philosophy (phenomenology, existentialism, critical theory)
Speculative and conceptual reframing
Contemplative prose
Philosophical argumentation

Out-of-Domain (40% win rate):

Math: 33%
Practical tasks: 67%
Creative writing: 67%
Factual knowledge: 33%
Coding: 0%

Training Notes

Second Epoch Essential

Critical methodological finding:

Checkpoint-100 (1 epoch): ~10% performance (undertrained)
Checkpoint-290 (2 epochs): 77% in-domain performance
Impact: 60+ percentage point improvement from completing training

Lesson: For small datasets (~1k examples), don't evaluate early checkpoints as representative of final performance. Training to completion (2+ epochs) is critical.

No Catastrophic Forgetting

Domain-specific fine-tuning with LoRA successfully specializes without losing general capabilities:

77% in-domain (exceptional specialization)
40% out-of-domain (maintained competitiveness)
Creative skills transferred to new contexts (67%)

Small Dataset Success

1,213 examples authored through LLM-assisted methodology sufficient for domain expertise:

Quality > quantity for domain-specific fine-tuning
LLM-assisted data authoring enables domain experts to capture specialized reasoning patterns
LoRA prevents overfitting on small datasets
Careful curation more important than scale

Data Authoring Process: Training data was authored using Claude (Anthropic), ChatGPT (OpenAI), Mistral, and Kimi as discussion partners. Notably, no training data was generated using Qwen or Llama models to prevent potential data contamination in fine-tuning experiments.

Multi-Response Sampling Methodology

A key methodological innovation: rather than single responses per prompt, the training data contains 202 unique prompts with multiple high-quality responses per prompt (averaging ~6 responses each, totaling 1,213 examples).

Why This Matters:

The model learns the distribution of valid responses rather than memorizing fixed prompt-response pairs
Teaches multiple valid reasoning paths and stylistic variations within domain constraints
Explains strong generalization despite relatively few unique prompts
Provides robustness: model learns what makes a response valid, not just one "correct" answer

This multi-response approach is critical to understanding why 1,213 examples achieve 77% in-domain performance—the model learns patterns and variance, not memorization.

Reference: This approach aligns with cognitive grounding principles (causal, compositional, revisable reasoning) discussed in the paper.

Use Cases

Use Brie when:

Writing about continental philosophy
Exploring philosophical concepts in depth
Creative brainstorming on philosophical topics
Contemplative/meditative writing
Tasks requiring nuanced, multi-faceted analysis

Use baseline Qwen when:

Coding/programming tasks
Pure mathematical problems
Technical documentation
Factual knowledge retrieval

Usage

Installation

pip install transformers peft torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "closestfriend/brie-qwen2.5-0.5b")

# Generate
messages = [
    {"role": "system", "content": "You are a helpful assistant specializing in philosophy and creative writing."},
    {"role": "user", "content": "Explain Heidegger's concept of 'Being-in-the-world'."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.75,
    do_sample=True
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

Training Details

Training Metrics

Initial Loss: 3.319
Final Loss: 1.4824 (55% reduction)
Validation Loss: 1.5031
Training Time: ~5 hours (2 epochs)
Hardware: Apple M4 MacBook Pro (16GB RAM, MPS backend)

Training Configuration

TrainingArguments(
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="linear",
    warmup_steps=20,
    logging_steps=10,
    eval_steps=50,
    save_steps=100,
)

Evaluation Methodology

Rigorous Testing

85+ blind A/B comparisons across multiple test suites
Randomized presentation order to avoid position bias
Multiple judge models (Claude Opus 4, Claude 3.7 Sonnet)
Reproducibility testing across 3 independent runs
Variance characterization (40-60% range with small samples)

Test Suites

In-Domain Test (13 prompts): Philosophy, brainstorming, contemplative writing
Out-of-Domain Test (15 prompts): Coding, math, practical tasks, factual questions
Comprehensive Eval (57 prompts): Multi-domain blind comparisons
Reproducibility Test (15 prompts): Variance analysis across runs

Evaluation Criteria (1-5 scale)

Creativity & Originality
Coherence & Structure
Depth & Insight
Engagement & Interest
Writing Quality

Limitations

Specialized, not universal: Excels in philosophy/creative domains but not coding (0% on programming tasks)
Sampling variance: Results can vary 40-60% across runs with temperature 0.75 and small samples (n<20)
Judge subjectivity: Different AI judges prefer different qualities (depth vs clarity)
Small base model: 0.5B parameters means limited overall capability compared to larger models
English only: Trained on English examples, performance on other languages not tested

Citation

If you use this model in your research or applications, please cite:

@misc{brie-2025,
  author = {Hunter Karman
  title = {Brie: Domain-Specific Fine-Tuning with LLM-Assisted Data Authoring},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/closestfriend/brie-qwen2.5-0.5b}},
  note = {77% in-domain performance with 1,213 examples authored through LLM-assisted methodology}
}

Acknowledgments

Base Model: Qwen Team for Qwen 2.5 0.5B Instruct
Evaluation Judges: Anthropic's Claude Opus 4 and Claude 3.7 Sonnet, OpenAI's GPT-4o, Google's Gemini 2.5 Flash Lite
Training Framework: HuggingFace PEFT & TRL libraries

Model Card Authors

Created by Hunter Noah Shokrian Karman

Model Card Contact

For questions or feedback: hnshokrian@gmail.com

Links

Brie Model Family:

Brie Qwen 2.5 0.5B - Foundational model (this model)
- Brie Qwen 2.5 3B - Scaling study
- - Brie Llama 3.2 3B - Cross-architecture study
  - Paper: Karman, H. (2025). "Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation." DOI: 10.5281/zenodo.17657737
- Code Repository: https://github.com/closestfriend/efficient-domain-adaptation

Full evaluation details and training code: github.com/closestfriend/efficient-domain-adaptation

Downloads last month: 9

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for closestfriend/brie-qwen2.5-0.5b

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Adapter

(436)

this model