training config
do you have train setup for flux-klein?
i have trained a character before, that was ok,
10 images, lr .0001, adam, rank32, 2000step
but for image edit modus
for a start i have 20 control an 20 target images.
those parameter dont work
Hello, for Flux Klein the rank used was 128 and the other was 64. Normally, at the beginning I train with low noise, then as the training progresses I stop and change according to the results I obtain from the samples. I used a total of about 300 pairs of images or more, during 3500 steps. The training resolution also changed during the training; at the beginning low resolution 256, 512 and closer to the end I change to 1024, 1536. Flux Klein was easy to train because it already does some kind of head swap, but other models can be more complicated. The more samples and the more variability, the better.
sounds anyway complicate , if you change the train setup until the train end (noise, resolution) ...
you train with? onetrain? kohya, aitoolkit ???
I trained on all these versions using ai-toolkit.
Previously, I used Kohya, now I use also AI Toolkit...
As I said, only 25 image pairs for now, but what needs to be “learned” is minimal. Similar to you, “flux-klein” can already do this quite well (input low resolution, output high resolution). do you think at step ~300 i must be able to see if it starts to work?
One question: should the loss/loss curve tend to decrease with ongoing training? So I can stop immediately after 10-50 steps if the curve only rises? If it decreases, how much should it decrease to?
Still strange that my character was relatively okay with 20 images and 1000 steps... but now with the image pairs, I've been testing for 3 days already. ^^
It's not quite that simple. The idea of "if it goes up in the first 50 steps I can stop" isn't quite accurate. The curve will go up and down several times; what's not acceptable is for it to only go up the entire time, nor should it remain stable at the same level for too long. 90% of the success of your training depends on your dataset. If it has the same pattern in all your samples, it will work out in the end. It's also important to have the same dimensions. Avoid training, for example, control/image_1.png (1024x1024) and target/image_1.png (512x1024). Another thing, my focus is that below a loss of 0.10 I already start thinking about stopping. A very low loss may indicate overfitting, and a very high loss may indicate either a lack of samples in your dataset or something is wrong at the configuration level. I always train all my models with more than 100 samples, whether simple images or pairs, always more than 100, sometimes with detailed captions, sometimes with captions in instruction format (this is good for editing models) teaching what it needs to do with the image, and often with a simple trigger word. In short, if your training isn't working, it's because something is wrong with your dataset, either because there are too few samples, or because there are incorrect dimensions, or because the captions are unclear, or because it's a very random dataset without any pattern.
THY!
yeah a lot of train is depend on the dataset, i know ... trained a lot flux1 and sdxl.
for the rank iam only on 32, higher is to much for my hardware (need more vram for train).
and for my understanding the higher the rank the less important your prompt like the turbo models (flux2) rank 256 or 512
i will try more ... and it still starts with "klein" ... ;)
Hi @Alissonerdx ! I have a question about the training strategy.
From what I understand, the training requires triplets of images: 1 target image paired with 2 control images (image1: body, image2: face). Please correct me if I'm wrong about this.
If that's the case, I'm curious about how you obtained such datasets - specifically, target images where the body matches image1 and the face matches image2. Could you share some insights on your data preparation process?
Thanks in advance!
Hi @Alissonerdx ! I have a question about the training strategy.
From what I understand, the training requires triplets of images: 1 target image paired with 2 control images (image1: body, image2: face). Please correct me if I'm wrong about this.
If that's the case, I'm curious about how you obtained such datasets - specifically, target images where the body matches image1 and the face matches image2. Could you share some insights on your data preparation process?
Thanks in advance!
This is the golden part of the process, because anyone can train a LoRa, but the most important part of the training is the dataset, hahaha. And when it comes to face and head swapping, there's almost nothing of good quality on the internet. So, the first thing I did before training a LoRa was to find a way to do head swapping without LoRa using a workflow. I created several workflow versions until I had one refined enough. Then I automated it to work with many images and selected the best ones. After training the first LoRa, everything became easier, and I no longer needed a complex workflow. Based on that, I created new datasets and refined my versions more and more. Now I have a more consistent dataset. But another way to do this would be to try using the Humo, Wan Animate, or Wan SCAIL video models, but then you work with only a few frames and then remove the rest.
But anyway, now with these LoRas of mine, anyone can create a good dataset without having to go through the suffering I went through. Just swap out many images and select the best ones. It took me months to get my datasets, and now everything is simply easy.
Thank you so much for the response! @Alissonerdx
I have one more question: When you trained the LoRA, did you train it on Flux2 Klein Base and then apply it to Flux2 Klein (the step-wise distilled version) for inference? Or did you train the LoRA directly on Flux2 Klein itself?
I'd appreciate any clarification on this!
Hello, for Flux Klein the rank used was 128 and the other was 64. Normally, at the beginning I train with low noise, then as the training progresses I stop and change according to the results I obtain from the samples. I used a total of about 300 pairs of images or more, during 3500 steps. The training resolution also changed during the training; at the beginning low resolution 256, 512 and closer to the end I change to 1024, 1536. Flux Klein was easy to train because it already does some kind of head swap, but other models can be more complicated. The more samples and the more variability, the better.
I've been working of this same type of project ( a complete head swap) for about 3-4months. I started on FAL-AI. It trained some nice LoRA's, but unfortunately, you can't save at intervals. I bit the bullet and upgraded my pc to an RTX5090 with 128gb system ram ( But DDR4 ram). With AI-Toolkit and offloading, I can train the full qwen 2509 model with no quantization to the transformers or text encoder. I haven't tried training the Klein model yet. I've done a few trainings with qwen 2509 and they are somewhat successful. I see now I need a much larger dataset. You are using 300 sets, I am using 30! If you don't mind would you elaborate on a few things for me. When you say you begin to train with low noise, do you mean a low or high learn rate? I was think of starting with perhaps 4-e4 for perhaps for 300 steps ( this for 30 pairs), dropping to 2-e4 for 300 steps, 1-e4 for 1000, and maybe finishing off with 7-e5 until it was done. Also, with the different resolutions, do you manually resize some of your dataset or are you just selecting 256 ( 512,1024,1536) and turning off "match target resolution?" I tried one of your early versions, but it was very low rank ( 4 I think). It worked pretty well, but it pretty much conformed to the shape of the face it was swapping. That was probably due to the lower rank. I going to try all of these - seems I have to enlarge my dataset! I'd appreciate if you'd elaborate of your training when you have the time. Thanks!
Hello, for Flux Klein the rank used was 128 and the other was 64. Normally, at the beginning I train with low noise, then as the training progresses I stop and change according to the results I obtain from the samples. I used a total of about 300 pairs of images or more, during 3500 steps. The training resolution also changed during the training; at the beginning low resolution 256, 512 and closer to the end I change to 1024, 1536. Flux Klein was easy to train because it already does some kind of head swap, but other models can be more complicated. The more samples and the more variability, the better.
I've been working of this same type of project ( a complete head swap) for about 3-4months. I started on FAL-AI. It trained some nice LoRA's, but unfortunately, you can't save at intervals. I bit the bullet and upgraded my pc to an RTX5090 with 128gb system ram ( But DDR4 ram). With AI-Toolkit and offloading, I can train the full qwen 2509 model with no quantization to the transformers or text encoder. I haven't tried training the Klein model yet. I've done a few trainings with qwen 2509 and they are somewhat successful. I see now I need a much larger dataset. You are using 300 sets, I am using 30! If you don't mind would you elaborate on a few things for me. When you say you begin to train with low noise, do you mean a low or high learn rate? I was think of starting with perhaps 4-e4 for perhaps for 300 steps ( this for 30 pairs), dropping to 2-e4 for 300 steps, 1-e4 for 1000, and maybe finishing off with 7-e5 until it was done. Also, with the different resolutions, do you manually resize some of your dataset or are you just selecting 256 ( 512,1024,1536) and turning off "match target resolution?" I tried one of your early versions, but it was very low rank ( 4 I think). It worked pretty well, but it pretty much conformed to the shape of the face it was swapping. That was probably due to the lower rank. I going to try all of these - seems I have to enlarge my dataset! I'd appreciate if you'd elaborate of your training when you have the time. Thanks!
When I say low noise, I mean that I intentionally bias the noise distribution during training to privilege lower noise levels.
During diffusion training, depending on the timestep sampling strategy you select, the model will see different noise distributions. This is what we refer to as timestep sampling — the strategy that defines how frequently each noise level (sigma/timestep) is sampled during training.
For example:
If a training sample receives low noise, the model is learning how to restore fine details from a slightly degraded image.
In this case, it focuses on subtle corrections such as:- Skin texture
- Micro-details
- Small lighting variations
- Minor structural refinements
If a training sample receives high noise, the model must reconstruct the image from a heavily corrupted state.
In this regime, it learns more global and structural transformations, such as:- Major shape changes
- Pose alterations
- Replacing a head with another one
- Large compositional adjustments
So effectively:
- Low-noise training → refinement learning
- High-noise training → structural learning
What I usually do during training is adjust this distribution over time.
For example:
- I may start training with a higher emphasis on high-noise timesteps so the model first learns strong structural transformations.
- Then, toward the end of training, I shift the sampling distribution toward low-noise timesteps so the model refines details and improves realism.
This staged approach helps balance:
- Structural capability
- Identity preservation
- Fine-detail quality
Instead of locking the model into only one regime, I progressively change the strategy depending on what I want the model to specialize in.
Got it, high noise and low noise bias. Time step bias. I've been training with it balanced. I'll give that a try. I assume to still use weighted as the type of time step? I'll let you know how it goes. I still have to build up my data set. Thanks for the help! Just one more question, did you use masking in your training ?
Dumping the structure I get
diffusion_model.transformer_blocks.0.attn.add_k_proj.alpha shape=[]
diffusion_model.transformer_blocks.0.attn.add_k_proj.lora_down.weight shape=[4, 3072]
diffusion_model.transformer_blocks.0.attn.add_k_proj.lora_up.weight shape=[3072, 4]
diffusion_model.transformer_blocks.0.attn.add_q_proj.alpha shape=[]
diffusion_model.transformer_blocks.0.attn.add_q_proj.lora_down.weight shape=[4, 3072]
diffusion_model.transformer_blocks.0.attn.add_q_proj.lora_up.weight shape=[3072, 4]
diffusion_model.transformer_blocks.0.attn.add_v_proj.alpha shape=[]
diffusion_model.transformer_blocks.0.attn.add_v_proj.lora_down.weight shape=[4, 3072]
diffusion_model.transformer_blocks.0.attn.add_v_proj.lora_up.weight shape=[3072, 4]
diffusion_model.transformer_blocks.0.attn.to_add_out.alpha shape=[]......
I don't see these alpha shape keys in my LoRA. The block look like this:
diffusion_model.transformer_blocks.0.attn.add_k_proj.lora_A.weight shape=[128, 3072]
diffusion_model.transformer_blocks.0.attn.add_k_proj.lora_B.weight shape=[3072, 128]
diffusion_model.transformer_blocks.0.attn.add_q_proj.lora_A.weight shape=[128, 3072]
diffusion_model.transformer_blocks.0.attn.add_q_proj.lora_B.weight shape=[3072, 128]
diffusion_model.transformer_blocks.0.attn.add_v_proj.lora_A.weight shape=[128, 3072]
diffusion_model.transformer_blocks.0.attn.add_v_proj.lora_B.weight shape=[3072, 128]
diffusion_model.transformer_blocks.0.attn.to_add_out.lora_A.weight shape=[128, 3072]
diffusion_model.transformer_blocks.0.attn.to_add_out.lora_B.weight shape=[3072, 128]
diffusion_model.transformer_blocks.0.attn.to_k.lora_A.weight shape=[128, 3072]
diffusion_model.transformer_blocks.0.attn.to_k.lora_B.weight shape=[3072, 128]
diffusion_model.transformer_blocks.0.attn.to_out.0.lora_A.weight shape=[128, 3072]
diffusion_model.transformer_blocks.0.attn.to_out.0.lora_B.weight shape=[3072, 128]
diffusion_model.transformer_blocks.0.attn.to_q.lora_A.weight shape=[128, 3072]
diffusion_model.transformer_blocks.0.attn.to_q.lora_B.weight shape=[3072, 128]
diffusion_model.transformer_blocks.0.attn.to_v.lora_A.weight shape=[128, 3072]
diffusion_model.transformer_blocks.0.attn.to_v.lora_B.weight shape=[3072, 128]
Hi, @Alissonerdx !
I am currently planning to develop a Hair Swap LoRA for qwen-image-edit-2511, and I have prepared a dataset of about 300 image pairs for this purpose.
I am still exploring the best approach to achieve this. If anyone has any experience, technical insights, or advice on how to tackle this, I would be very grateful if you could share your knowledge.
Thank you in advance!
Hi, @Alissonerdx !
I am currently planning to develop a Hair Swap LoRA for qwen-image-edit-2511, and I have prepared a dataset of about 300 image pairs for this purpose.
I am still exploring the best approach to achieve this. If anyone has any experience, technical insights, or advice on how to tackle this, I would be very grateful if you could share your knowledge.
Thank you in advance!
I'm not @Alissonerdx !, but I have been working on my own head swap lora for about 4 months. I can tell you a few things I've learned. As is said all over, you need a very good data set and that is where 90% of the work is. Without a good data set there is no magic that will make a good lora. In my case, I'm swapping heads ( heads and hair because hair influences the likeness of a person). How someone makes that data set is going to be different between people ( people use different techniques). I found the easiest way for me is to start with 3D models. I start with a 3D model in a certain pose and background, render it, and then change only the head and hair of the model and render that. Rendered with the same settings and lighting, they look identical except for the head. They also look like 3D renders. That's where the work comes in. Making those renders look photo realistic with concrete identities, is the heart of creating my data set, and it i s laborious. I've spent weeks/months on data sets and am constantly trying to improve them. But once you have a good data set, I think with any type of swap, the training is going to have to start with high noise bias. With 300 sets? you probably want 1000 -1500 steps of high noise bias with a 1e-4 learn rate. At that point you want to check the lora to make sure its getting the swap right never mind likeness or detail. After that it becomes aesthetic and personal of what you think looks good. You're going to move to balanced noise bias and finally high noise bias. I think when dealing with people you should be using Sigmoid as the time step type. Sigmoid won't starve the low noise when using the high noise bias. It sort of smooths things, but high noise still takes precedence. Also, you might want to look into using a cosine learn rate scheduler. I've found it makes better loras with people.
How do you know if you have a good data set? You don't really know until make a lora with it. Weakness will be apparent after you use the loras it makes for a bit. The more you make, the better you will be able to judge your data set. In my case, the lora gets everything right, but the skin texture is too smooth and qwen like. I'm working on that, and with Zimage, SeedVR2, and Fluxmania, I'm getting more skin texture in my data set. But still I might have to move it from qwen 2509 to flux klein.
Dont know why all people make a lora even the model can do 99%.
iv tested on 3 images all works perfectly. ;)
preserve the overall impression of the portrait photo.
remove hair and replace every single hair and replace every single strand of hair with dark hair.
Dont know why all people make a lora even the model can do 99%.
iv tested on 3 images all works perfectly. ;)preserve the overall impression of the portrait photo. remove hair and replace every single hair and replace every single strand of hair with dark hair.
Now try this with objects that are placed on face or at different angles, and you'll see why people train. You've tested 3 examples out of an infinite number.
maybe ... If so, I use Inpaint because I will probably want to make a few more replacements of other parts ...