Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper
•
2305.18290
•
Published
•
64
This model is a fine-tuned version of Qwen/Qwen2.5-1.5B-Instruct. It has been trained using TRL.
from transformers import pipeline
question = "Explain diabetes simply"
generator = pipeline("text-generation", model="azherali/Qwen2.5-1.5B-Instruct-dpo", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model.