See axolotl config
axolotl version: 0.12.2
base_model: Coloss/Qwen3-8B-Instruct
#/leonardo_work/EUHPC_A04_045/training/model-fp32
#Coloss/Qwen3-8B-Instruct
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name
strict: false
#resume_from_checkpoint: /leonardo_work/EUHPC_A04_045/training/ale_outputs/pluto-8B-sft/checkpoint-4040 #
#auto_resume_from_checkpoints: true
#resume_from_checkpoint: /leonardo_work/EUHPC_A04_045/training/ale_outputs/pluto-8B-sft-32
#auto_resume_from_checkpoints: true
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_fused_linear_cross_entropy: true
liger_cross_entropy: false # Explicitly disabled to ensure the Fused version takes over
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
#liger_cross_entropy: true
#liger_rms_norm: true
#liger_glu_activation: true
#liger_layer_norm: true
#chat_template: qwen3
datasets:
- path: Coloss/Omnia-v5-Nesso
type: chat_template
field_messages: conversations
message_property_mappings:
role: from
content: value
#dataset_prepared_path: ./ale_outputs/tokenized-omni-v5-v.2
dataset_prepared_path: /leonardo_work/EUHPC_A04_045/training/ale_outputs/tokenized-omnia-v6-nesso
val_set_size: 0.0005
output_dir: ./ale_outputs/pluto-8B-sft-v0.2
#do_bench_eval: true
#bench_dataset: /leonardo_work/EUHPC_A04_045/training/examples/qwen3/eval_mix_train.json
sequence_len: 8000
excess_length_strategy: truncate
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 2
#max_steps: 50
optimizer: adamw_torch_fused #adamw_bnb_8bit #adamw_torch_fused
lr_scheduler: cosine
learning_rate: 4e-5
bf16: auto #auto
fp16: false
tf32: true
wandb_mode: "offline"
wandb_project: pluto-8b
wandb_entity: mii-llm
wandb_name: pluto-8b-sft-v0.2
#gradient_checkpointing: true
#gradient_checkpointing_kwargs:
# use_reentrant: false
logging_steps: 1
sdp_attention: false
flash_attention: true
warmup_ratio: 0.1
evals_per_epoch: 5
saves_per_epoch: 5
save_total_limit: 5
weight_decay: 0.0
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_offload_optimizer: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT #SHARDED_STATE_DICT #FULL_STATE_DICT
fsdp_activation_checkpointing: true
#fsdp:
# - full_shard
# - auto_wrap
#fsdp_config:
# fsdp_limit_all_gathers: true
# fsdp_sync_module_states: true
# fsdp_offload_params: true
# fsdp_use_orig_params: false
# fsdp_cpu_ram_efficient_loading: true
# ADD THIS LINE:
# fsdp_offload_optimizer: true
# fsdp_use_orig_params: false
# fsdp_cpu_ram_efficient_loading: true
# fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
# fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
# fsdp_state_dict_type: FULL_STATE_DICT
# fsdp_sharding_strategy: FULL_SHARD
special_tokens:
ale_outputs/pluto-8B-sft-v0.2
This model is a fine-tuned version of Coloss/Qwen3-8B-Instruct on the Coloss/Omnia-v5-Nesso dataset. It achieves the following results on the evaluation set:
- Loss: 0.6555
- Memory/max Mem Active(gib): 48.58
- Memory/max Mem Allocated(gib): 47.86
- Memory/device Mem Reserved(gib): 53.4
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 4e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 32
- gradient_accumulation_steps: 4
- total_train_batch_size: 256
- total_eval_batch_size: 64
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 195
- training_steps: 1950
Training results
| Training Loss | Epoch | Step | Validation Loss | Mem Active(gib) | Mem Allocated(gib) | Mem Reserved(gib) |
|---|---|---|---|---|---|---|
| No log | 0 | 0 | 1.3858 | 48.57 | 47.85 | 49.2 |
| 0.7248 | 0.1999 | 195 | 0.7236 | 48.58 | 47.86 | 53.4 |
| 0.6917 | 0.3999 | 390 | 0.7014 | 48.58 | 47.86 | 53.4 |
| 0.6648 | 0.5998 | 585 | 0.6863 | 48.58 | 47.86 | 53.4 |
| 0.6738 | 0.7998 | 780 | 0.6747 | 48.58 | 47.86 | 53.4 |
| 0.6397 | 0.9997 | 975 | 0.6659 | 48.58 | 47.86 | 53.4 |
| 0.6131 | 1.1989 | 1170 | 0.6633 | 48.58 | 47.86 | 53.4 |
| 0.5895 | 1.3989 | 1365 | 0.6609 | 48.58 | 47.86 | 53.4 |
| 0.5819 | 1.5988 | 1560 | 0.6583 | 48.58 | 47.86 | 53.4 |
| 0.5996 | 1.7988 | 1755 | 0.6565 | 48.58 | 47.86 | 53.4 |
| 0.584 | 1.9987 | 1950 | 0.6555 | 48.58 | 47.86 | 53.4 |
Framework versions
- Transformers 4.55.2
- Pytorch 2.6.0+cu126
- Datasets 4.0.0
- Tokenizers 0.21.4
- Downloads last month
- 13