[2024-11-13 16:42:16,554] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2024-11-13 16:42:19,202] [INFO] [comm.py:637:init_distributed] cdb=None [2024-11-13 16:42:19,202] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 11/13/2024 16:42:19 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 11/13/2024 16:42:19 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=507, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/QA2/qa_abcd_lora/runs/Nov13_16-42-19_amax, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=50.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/QA2/qa_abcd_lora, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/QA2/qa_abcd_lora, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=40000, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 11/13/2024 16:42:19 - INFO - __main__ - Loading Tokenizer: /home/amax/wjr/InternVL2-8B [INFO|tokenization_utils_base.py:2025] 2024-11-13 16:42:19,349 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2025] 2024-11-13 16:42:19,349 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-11-13 16:42:19,349 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-11-13 16:42:19,349 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-11-13 16:42:19,349 >> loading file tokenizer.json [WARNING|logging.py:314] 2024-11-13 16:42:19,516 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 11/13/2024 16:42:19 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2024-11-13 16:42:19,645 >> loading configuration file /home/amax/wjr/InternVL2-8B/config.json [INFO|configuration_utils.py:792] 2024-11-13 16:42:19,646 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": false, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internlm2-chat", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.1, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 11/13/2024 16:42:19 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3473] 2024-11-13 16:42:19,648 >> loading weights file /home/amax/wjr/InternVL2-8B/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-11-13 16:42:19,648 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2024-11-13 16:42:19,650 >> Generate config GenerationConfig {} this model [INFO|configuration_utils.py:826] 2024-11-13 16:42:19,729 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2, "use_cache": false } motion_mlp.weight1 Parameter containing: tensor([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], requires_grad=True) motion_mlp.weight2 Parameter containing: tensor([[0.0038, 0.0076, 0.0004, ..., 0.0092, 0.0042, 0.0098], [0.0033, 0.0081, 0.0055, ..., 0.0089, 0.0046, 0.0087], [0.0053, 0.0038, 0.0099, ..., 0.0016, 0.0053, 0.0082], ..., [0.0014, 0.0015, 0.0036, ..., 0.0059, 0.0014, 0.0088], [0.0058, 0.0034, 0.0062, ..., 0.0092, 0.0003, 0.0021], [0.0092, 0.0033, 0.0002, ..., 0.0039, 0.0040, 0.0054]], requires_grad=True) motion_mlp.bias Parameter containing: tensor([0., 0., 0., ..., 0., 0., 0.], requires_grad=True) motion_mlp.weight1 Parameter containing: tensor([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], requires_grad=True) motion_mlp.weight2 Parameter containing: tensor([[0.0087, 0.0012, 0.0095, ..., 0.0056, 0.0054, 0.0045], [0.0030, 0.0033, 0.0010, ..., 0.0004, 0.0079, 0.0005], [0.0034, 0.0050, 0.0075, ..., 0.0003, 0.0055, 0.0069], ..., [0.0020, 0.0028, 0.0082, ..., 0.0027, 0.0032, 0.0065], [0.0036, 0.0044, 0.0072, ..., 0.0036, 0.0079, 0.0013], [0.0085, 0.0073, 0.0087, ..., 0.0063, 0.0000, 0.0019]], requires_grad=True) motion_mlp.bias Parameter containing: tensor([0., 0., 0., ..., 0., 0., 0.], requires_grad=True) Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [WARNING|modeling_utils.py:4352] 2024-11-13 16:42:23,444 >> Some weights of InternVLChatModel were not initialized from the model checkpoint at /home/amax/wjr/InternVL2-8B and are newly initialized: ['motion_mlp.0.bias', 'motion_mlp.0.weight', 'motion_mlp.1.bias', 'motion_mlp.1.weight', 'motion_mlp.3.bias', 'motion_mlp.3.weight', 'slowfast_model.feature_extraction.0.multipathway_blocks.0.conv.weight', 'slowfast_model.feature_extraction.0.multipathway_blocks.0.norm.bias', 'slowfast_model.feature_extraction.0.multipathway_blocks.0.norm.num_batches_tracked', 'slowfast_model.feature_extraction.0.multipathway_blocks.0.norm.running_mean', 'slowfast_model.feature_extraction.0.multipathway_blocks.0.norm.running_var', 'slowfast_model.feature_extraction.0.multipathway_blocks.0.norm.weight', 'slowfast_model.feature_extraction.0.multipathway_blocks.1.conv.weight', 'slowfast_model.feature_extraction.0.multipathway_blocks.1.norm.bias', 'slowfast_model.feature_extraction.0.multipathway_blocks.1.norm.num_batches_tracked', 'slowfast_model.feature_extraction.0.multipathway_blocks.1.norm.running_mean', 'slowfast_model.feature_extraction.0.multipathway_blocks.1.norm.running_var', 'slowfast_model.feature_extraction.0.multipathway_blocks.1.norm.weight', 'slowfast_model.feature_extraction.0.multipathway_fusion.conv_fast_to_slow.weight', 'slowfast_model.feature_extraction.0.multipathway_fusion.norm.bias', 'slowfast_model.feature_extraction.0.multipathway_fusion.norm.num_batches_tracked', 'slowfast_model.feature_extraction.0.multipathway_fusion.norm.running_mean', 'slowfast_model.feature_extraction.0.multipathway_fusion.norm.running_var', 'slowfast_model.feature_extraction.0.multipathway_fusion.norm.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch1_conv.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch1_norm.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch1_norm.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch1_norm.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch1_norm.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch1_norm.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.conv_a.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.conv_b.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.conv_c.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_a.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_a.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_b.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_b.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_c.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.0.branch2.norm_c.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.conv_a.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.conv_b.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.conv_c.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_a.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_a.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_b.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_b.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_c.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.1.branch2.norm_c.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.conv_a.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.conv_b.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.conv_c.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_a.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_a.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_b.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_b.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_c.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.0.res_blocks.2.branch2.norm_c.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch1_conv.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch1_norm.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch1_norm.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch1_norm.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch1_norm.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch1_norm.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.conv_a.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.conv_b.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.conv_c.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_a.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_a.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_b.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_b.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_c.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.0.branch2.norm_c.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.conv_a.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.conv_b.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.conv_c.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_a.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_a.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_b.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_b.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_c.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.1.branch2.norm_c.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.conv_a.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.conv_b.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.conv_c.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_a.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_a.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_b.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_b.weight', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_c.bias', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.1.multipathway_blocks.1.res_blocks.2.branch2.norm_c.weight', 'slowfast_model.feature_extraction.1.multipathway_fusion.conv_fast_to_slow.weight', 'slowfast_model.feature_extraction.1.multipathway_fusion.norm.bias', 'slowfast_model.feature_extraction.1.multipathway_fusion.norm.num_batches_tracked', 'slowfast_model.feature_extraction.1.multipathway_fusion.norm.running_mean', 'slowfast_model.feature_extraction.1.multipathway_fusion.norm.running_var', 'slowfast_model.feature_extraction.1.multipathway_fusion.norm.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch1_conv.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch1_norm.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch1_norm.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch1_norm.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch1_norm.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch1_norm.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.conv_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.conv_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.conv_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_a.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_b.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_c.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.0.branch2.norm_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.conv_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.conv_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.conv_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_a.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_b.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_c.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.1.branch2.norm_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.conv_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.conv_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.conv_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_a.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_b.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_c.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.2.branch2.norm_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.conv_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.conv_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.conv_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_a.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_b.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_c.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.0.res_blocks.3.branch2.norm_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch1_conv.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch1_norm.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch1_norm.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch1_norm.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch1_norm.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch1_norm.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.conv_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.conv_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.conv_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_a.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_b.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_c.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.0.branch2.norm_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.conv_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.conv_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.conv_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_a.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_b.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_c.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.1.branch2.norm_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.conv_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.conv_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.conv_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_a.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_b.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_c.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.2.branch2.norm_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.conv_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.conv_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.conv_c.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_a.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_a.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_b.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_b.weight', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_c.bias', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.2.multipathway_blocks.1.res_blocks.3.branch2.norm_c.weight', 'slowfast_model.feature_extraction.2.multipathway_fusion.conv_fast_to_slow.weight', 'slowfast_model.feature_extraction.2.multipathway_fusion.norm.bias', 'slowfast_model.feature_extraction.2.multipathway_fusion.norm.num_batches_tracked', 'slowfast_model.feature_extraction.2.multipathway_fusion.norm.running_mean', 'slowfast_model.feature_extraction.2.multipathway_fusion.norm.running_var', 'slowfast_model.feature_extraction.2.multipathway_fusion.norm.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch1_conv.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch1_norm.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch1_norm.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch1_norm.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch1_norm.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch1_norm.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.conv_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.conv_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.conv_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_a.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_b.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_c.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.0.branch2.norm_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.conv_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.conv_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.conv_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_a.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_b.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_c.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.1.branch2.norm_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.conv_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.conv_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.conv_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_a.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_b.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_c.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.2.branch2.norm_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.conv_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.conv_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.conv_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_a.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_b.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_c.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.3.branch2.norm_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.conv_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.conv_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.conv_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_a.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_b.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_c.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.4.branch2.norm_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.conv_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.conv_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.conv_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_a.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_b.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_c.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.0.res_blocks.5.branch2.norm_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch1_conv.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch1_norm.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch1_norm.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch1_norm.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch1_norm.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch1_norm.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.conv_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.conv_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.conv_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_a.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_b.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_c.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.0.branch2.norm_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.conv_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.conv_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.conv_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_a.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_b.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_c.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.1.branch2.norm_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.conv_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.conv_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.conv_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_a.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_b.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_c.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.2.branch2.norm_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.conv_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.conv_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.conv_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_a.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_b.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_c.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.3.branch2.norm_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.conv_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.conv_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.conv_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_a.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_b.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_c.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.4.branch2.norm_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.conv_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.conv_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.conv_c.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_a.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_a.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_b.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_b.weight', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_c.bias', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.3.multipathway_blocks.1.res_blocks.5.branch2.norm_c.weight', 'slowfast_model.feature_extraction.3.multipathway_fusion.conv_fast_to_slow.weight', 'slowfast_model.feature_extraction.3.multipathway_fusion.norm.bias', 'slowfast_model.feature_extraction.3.multipathway_fusion.norm.num_batches_tracked', 'slowfast_model.feature_extraction.3.multipathway_fusion.norm.running_mean', 'slowfast_model.feature_extraction.3.multipathway_fusion.norm.running_var', 'slowfast_model.feature_extraction.3.multipathway_fusion.norm.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch1_conv.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch1_norm.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch1_norm.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch1_norm.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch1_norm.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch1_norm.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.conv_a.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.conv_b.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.conv_c.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_a.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_a.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_b.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_b.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_c.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.0.branch2.norm_c.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.conv_a.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.conv_b.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.conv_c.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_a.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_a.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_b.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_b.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_c.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.1.branch2.norm_c.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.conv_a.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.conv_b.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.conv_c.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_a.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_a.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_b.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_b.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_c.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.0.res_blocks.2.branch2.norm_c.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch1_conv.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch1_norm.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch1_norm.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch1_norm.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch1_norm.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch1_norm.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.conv_a.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.conv_b.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.conv_c.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_a.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_a.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_b.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_b.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_c.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.0.branch2.norm_c.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.conv_a.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.conv_b.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.conv_c.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_a.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_a.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_b.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_b.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_c.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.1.branch2.norm_c.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.conv_a.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.conv_b.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.conv_c.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_a.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_a.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_a.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_a.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_a.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_b.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_b.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_b.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_b.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_b.weight', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_c.bias', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_c.num_batches_tracked', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_c.running_mean', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_c.running_var', 'slowfast_model.feature_extraction.4.multipathway_blocks.1.res_blocks.2.branch2.norm_c.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|configuration_utils.py:779] 2024-11-13 16:42:23,453 >> loading configuration file /home/amax/wjr/InternVL2-8B/generation_config.json [INFO|configuration_utils.py:826] 2024-11-13 16:42:23,453 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 11/13/2024 16:42:23 - INFO - __main__ - Finished 11/13/2024 16:42:23 - INFO - __main__ - model.config.force_image_size: 448 11/13/2024 16:42:23 - INFO - __main__ - data_args.force_image_size: 448 11/13/2024 16:42:23 - INFO - __main__ - model.config.vision_config.image_size: 448 11/13/2024 16:42:23 - INFO - __main__ - [Dataset] num_image_token: 256 11/13/2024 16:42:23 - INFO - __main__ - [Dataset] dynamic_image_size: True 11/13/2024 16:42:23 - INFO - __main__ - [Dataset] use_thumbnail: True 11/13/2024 16:42:23 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 11/13/2024 16:42:23 - INFO - __main__ - Formatting inputs...Skip in lazy mode in in2 11/13/2024 16:42:23 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 4058 11/13/2024 16:42:23 - INFO - __main__ - [Dataset] num_image_token: 256 11/13/2024 16:42:23 - INFO - __main__ - [Dataset] dynamic_image_size: True 11/13/2024 16:42:23 - INFO - __main__ - [Dataset] use_thumbnail: True 11/13/2024 16:42:23 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 11/13/2024 16:42:23 - INFO - __main__ - Formatting inputs...Skip in lazy mode 11/13/2024 16:42:24 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 1016 eval_dataset trainable params: 37,748,736 || all params: 7,775,531,008 || trainable%: 0.4854811325575258 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.0.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.0.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.1.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.1.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.2.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.2.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.3.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.3.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.4.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.4.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.5.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.5.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.6.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.6.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.7.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.7.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.8.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.8.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.9.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.9.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.10.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.10.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.11.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.11.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.12.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.12.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.13.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.13.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.14.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.14.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.15.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.15.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.16.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.16.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.17.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.17.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.18.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.18.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.19.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.19.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.20.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.20.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.21.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.21.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.22.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.22.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.23.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.23.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.24.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.24.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.25.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.25.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.26.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.26.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.27.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.27.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.28.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.28.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.29.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.29.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.30.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.30.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wqkv.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.31.attention.wo.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w1.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w3.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_A.default.weight 11/13/2024 16:42:24 - INFO - __main__ - language_model.base_model.model.model.layers.31.feed_forward.w2.lora_B.default.weight 11/13/2024 16:42:24 - INFO - __main__ - mlp1.0.weight 11/13/2024 16:42:24 - INFO - __main__ - mlp1.0.bias 11/13/2024 16:42:24 - INFO - __main__ - mlp1.1.weight 11/13/2024 16:42:24 - INFO - __main__ - mlp1.1.bias 11/13/2024 16:42:24 - INFO - __main__ - mlp1.3.weight 11/13/2024 16:42:24 - INFO - __main__ - mlp1.3.bias 11/13/2024 16:42:24 - INFO - __main__ - motion_mlp.0.weight 11/13/2024 16:42:24 - INFO - __main__ - motion_mlp.0.bias 11/13/2024 16:42:24 - INFO - __main__ - motion_mlp.1.weight 11/13/2024 16:42:24 - INFO - __main__ - motion_mlp.1.bias 11/13/2024 16:42:24 - INFO - __main__ - motion_mlp.3.weight 11/13/2024 16:42:24 - INFO - __main__ - motion_mlp.3.bias training_args TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=507, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/QA2/qa_abcd_lora/runs/Nov13_16-42-19_amax, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=50.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/QA2/qa_abcd_lora, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/QA2/qa_abcd_lora, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=40000, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) [INFO|trainer.py:571] 2024-11-13 16:42:24,932 >> Using auto half precision backend [2024-11-13 16:42:25,119] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.5, git-hash=unknown, git-branch=unknown [2024-11-13 16:42:35,865] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /home/amax/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/amax/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.7072751522064209 seconds [2024-11-13 16:42:37,102] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2024-11-13 16:42:37,103] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-11-13 16:42:37,146] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2024-11-13 16:42:37,146] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2024-11-13 16:42:37,146] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2024-11-13 16:42:37,147] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000 [2024-11-13 16:42:37,147] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000 [2024-11-13 16:42:37,147] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2024-11-13 16:42:37,147] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False [2024-11-13 16:42:37,709] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states [2024-11-13 16:42:37,709] [INFO] [utils.py:801:see_memory_usage] MA 16.09 GB Max_MA 16.27 GB CA 16.49 GB Max_CA 16 GB [2024-11-13 16:42:37,710] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 29.19 GB, percent = 11.6% [2024-11-13 16:42:37,865] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states [2024-11-13 16:42:37,865] [INFO] [utils.py:801:see_memory_usage] MA 16.09 GB Max_MA 16.46 GB CA 16.86 GB Max_CA 17 GB [2024-11-13 16:42:37,865] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 29.19 GB, percent = 11.6% [2024-11-13 16:42:37,866] [INFO] [stage_1_and_2.py:539:__init__] optimizer state initialized [2024-11-13 16:42:38,022] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer [2024-11-13 16:42:38,023] [INFO] [utils.py:801:see_memory_usage] MA 16.09 GB Max_MA 16.09 GB CA 16.86 GB Max_CA 17 GB [2024-11-13 16:42:38,023] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 29.2 GB, percent = 11.6% [2024-11-13 16:42:38,029] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw [2024-11-13 16:42:38,030] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2024-11-13 16:42:38,030] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2024-11-13 16:42:38,030] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2024-11-13 16:42:38,034] [INFO] [config.py:996:print] DeepSpeedEngine configuration: [2024-11-13 16:42:38,034] [INFO] [config.py:1000:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-11-13 16:42:38,034] [INFO] [config.py:1000:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-11-13 16:42:38,034] [INFO] [config.py:1000:print] amp_enabled .................. False [2024-11-13 16:42:38,034] [INFO] [config.py:1000:print] amp_params ................... False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] bfloat16_enabled ............. True [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] bfloat16_immediate_grad_update False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] checkpoint_parallel_write_pipeline False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] checkpoint_tag_validation_enabled True [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] checkpoint_tag_validation_fail False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] comms_config ................. [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] communication_data_type ...... None [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] compile_config ............... enabled=False backend='inductor' kwargs={} [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] curriculum_enabled_legacy .... False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] curriculum_params_legacy ..... False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] data_efficiency_enabled ...... False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] dataloader_drop_last ......... False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] disable_allgather ............ False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] dump_state ................... False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] dynamic_loss_scale_args ...... None [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] eigenvalue_enabled ........... False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] eigenvalue_gas_boundary_resolution 1 [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] eigenvalue_layer_num ......... 0 [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] eigenvalue_max_iter .......... 100 [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] eigenvalue_stability ......... 1e-06 [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] eigenvalue_tol ............... 0.01 [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] eigenvalue_verbose ........... False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] elasticity_enabled ........... False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] fp16_auto_cast ............... None [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] fp16_enabled ................. False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] fp16_master_weights_and_gradients False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] global_rank .................. 0 [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] grad_accum_dtype ............. None [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] gradient_accumulation_steps .. 1 [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] gradient_clipping ............ 1.0 [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] gradient_predivide_factor .... 1.0 [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] graph_harvesting ............. False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] initial_dynamic_scale ........ 1 [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] load_universal_checkpoint .... False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] loss_scale ................... 1.0 [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] memory_breakdown ............. False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] mics_hierarchial_params_gather False [2024-11-13 16:42:38,035] [INFO] [config.py:1000:print] mics_shard_size .............. -1 [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] optimizer_legacy_fusion ...... False [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] optimizer_name ............... adamw [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] pld_enabled .................. False [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] pld_params ................... False [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] prescale_gradients ........... False [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] scheduler_name ............... None [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] scheduler_params ............. None [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] seq_parallel_communication_data_type torch.float32 [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] sparse_attention ............. None [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] sparse_gradients_enabled ..... False [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] steps_per_print .............. inf [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] train_batch_size ............. 4 [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] train_micro_batch_size_per_gpu 4 [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] use_data_before_expert_parallel_ False [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] use_node_local_storage ....... False [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] wall_clock_breakdown ......... True [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] weight_quantization_config ... None [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] world_size ................... 1 [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] zero_allow_untested_optimizer False [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] zero_enabled ................. True [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] zero_force_ds_cpu_optimizer .. True [2024-11-13 16:42:38,036] [INFO] [config.py:1000:print] zero_optimization_stage ...... 1 [2024-11-13 16:42:38,036] [INFO] [config.py:986:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 1, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 4, "train_micro_batch_size_per_gpu": 4, "wall_clock_breakdown": true } [INFO|trainer.py:1721] 2024-11-13 16:42:38,036 >> ***** Running training ***** [INFO|trainer.py:1722] 2024-11-13 16:42:38,037 >> Num examples = 4,058 [INFO|trainer.py:1723] 2024-11-13 16:42:38,037 >> Num Epochs = 50 [INFO|trainer.py:1724] 2024-11-13 16:42:38,037 >> Instantaneous batch size per device = 4 [INFO|trainer.py:1727] 2024-11-13 16:42:38,037 >> Total train batch size (w. parallel, distributed & accumulation) = 4 [INFO|trainer.py:1728] 2024-11-13 16:42:38,037 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1729] 2024-11-13 16:42:38,037 >> Total optimization steps = 50,750 [INFO|trainer.py:1730] 2024-11-13 16:42:38,041 >> Number of trainable parameters = 97,546,752 0%| | 0/50750 [00:00> Saving model checkpoint to work_dirs/QA2/qa_abcd_lora [INFO|configuration_utils.py:473] 2024-11-13 18:08:45,484 >> Configuration saved in work_dirs/QA2/qa_abcd_lora/config.json [INFO|configuration_utils.py:594] 2024-11-13 18:08:45,484 >> Configuration saved in work_dirs/QA2/qa_abcd_lora/generation_config.json [INFO|modeling_utils.py:2501] 2024-11-13 18:09:17,183 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at work_dirs/QA2/qa_abcd_lora/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2433] 2024-11-13 18:09:17,185 >> tokenizer config file saved in work_dirs/QA2/qa_abcd_lora/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2024-11-13 18:09:17,185 >> Special tokens file saved in work_dirs/QA2/qa_abcd_lora/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2024-11-13 18:09:17,185 >> added tokens file saved in work_dirs/QA2/qa_abcd_lora/added_tokens.json 11/13/2024 18:09:18 - INFO - __main__ - Saved LoRA weights to work_dirs/QA2/qa_abcd_lora/lora_weights.pth dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:09:24,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.92 [2024-11-13 18:09:24,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1997.28 | bwd_microstep: 3808.98 | bwd_inner_microstep: 3801.17 | bwd_allreduce_microstep: 7.75 | step_microstep: 24.58 [2024-11-13 18:09:24,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1997.26 | bwd: 3809.00 | bwd_inner: 3801.17 | bwd_allreduce: 7.78 | step: 24.58 1%| | 508/50750 [1:26:46<9123:19:19, 653.72s/it] {'loss': 0.2419, 'learning_rate': 1.3342087984241629e-05, 'epoch': 0.5} 1%| | 508/50750 [1:26:46<9123:19:19, 653.72s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:09:30,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:09:30,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2006.35 | bwd_microstep: 3830.28 | bwd_inner_microstep: 3822.77 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.13 [2024-11-13 18:09:30,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2006.34 | bwd: 3830.29 | bwd_inner: 3822.77 | bwd_allreduce: 7.48 | step: 21.14 1%| | 509/50750 [1:26:52<6410:50:23, 459.37s/it] {'loss': 0.004, 'learning_rate': 1.3368351936966514e-05, 'epoch': 0.5} 1%| | 509/50750 [1:26:52<6410:50:23, 459.37s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:09:36,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:09:36,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.46 | bwd_microstep: 3838.07 | bwd_inner_microstep: 3830.58 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.35 [2024-11-13 18:09:36,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.46 | bwd: 3838.08 | bwd_inner: 3830.58 | bwd_allreduce: 7.46 | step: 21.36 1%| | 510/50750 [1:26:58<4512:12:49, 323.33s/it] {'loss': 0.0563, 'learning_rate': 1.33946158896914e-05, 'epoch': 0.5} 1%| | 510/50750 [1:26:58<4512:12:49, 323.33s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:09:42,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 18:09:42,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2014.41 | bwd_microstep: 3842.46 | bwd_inner_microstep: 3834.75 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.68 [2024-11-13 18:09:42,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2014.41 | bwd: 3842.48 | bwd_inner: 3834.75 | bwd_allreduce: 7.68 | step: 21.68 1%| | 511/50750 [1:27:04<3183:13:10, 228.10s/it] {'loss': 0.0006, 'learning_rate': 1.3420879842416285e-05, 'epoch': 0.5} 1%| | 511/50750 [1:27:04<3183:13:10, 228.10s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:09:48,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.94 [2024-11-13 18:09:48,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2015.61 | bwd_microstep: 3846.05 | bwd_inner_microstep: 3838.51 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.21 [2024-11-13 18:09:48,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2015.61 | bwd: 3846.06 | bwd_inner: 3838.51 | bwd_allreduce: 7.51 | step: 21.21 1%| | 512/50750 [1:27:10<2252:57:05, 161.44s/it] {'loss': 0.5178, 'learning_rate': 1.344714379514117e-05, 'epoch': 0.5} 1%| | 512/50750 [1:27:10<2252:57:05, 161.44s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:09:54,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 18:09:54,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.15 | bwd_microstep: 3862.33 | bwd_inner_microstep: 3854.56 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.91 [2024-11-13 18:09:54,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.15 | bwd: 3862.34 | bwd_inner: 3854.56 | bwd_allreduce: 7.74 | step: 21.91 1%| | 513/50750 [1:27:16<1601:54:49, 114.79s/it] {'loss': 0.8057, 'learning_rate': 1.3473407747866055e-05, 'epoch': 0.51} 1%| | 513/50750 [1:27:16<1601:54:49, 114.79s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:10:00,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:10:00,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.12 | bwd_microstep: 3852.31 | bwd_inner_microstep: 3844.77 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.29 [2024-11-13 18:10:00,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.12 | bwd: 3852.32 | bwd_inner: 3844.77 | bwd_allreduce: 7.51 | step: 21.30 1%| | 514/50750 [1:27:22<1146:09:25, 82.14s/it] {'loss': 0.9521, 'learning_rate': 1.349967170059094e-05, 'epoch': 0.51} 1%| | 514/50750 [1:27:22<1146:09:25, 82.14s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:10:06,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:10:06,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2036.71 | bwd_microstep: 3850.57 | bwd_inner_microstep: 3842.99 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.44 [2024-11-13 18:10:06,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2036.70 | bwd: 3850.59 | bwd_inner: 3842.99 | bwd_allreduce: 7.55 | step: 21.44 1%| | 515/50750 [1:27:28<827:10:02, 59.28s/it] {'loss': 0.0009, 'learning_rate': 1.3525935653315826e-05, 'epoch': 0.51} 1%| | 515/50750 [1:27:28<827:10:02, 59.28s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:10:12,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.94 [2024-11-13 18:10:12,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2037.81 | bwd_microstep: 3846.19 | bwd_inner_microstep: 3838.63 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.28 [2024-11-13 18:10:12,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2037.81 | bwd: 3846.20 | bwd_inner: 3838.63 | bwd_allreduce: 7.53 | step: 21.28 1%| | 516/50750 [1:27:34<603:50:39, 43.27s/it] {'loss': 0.006, 'learning_rate': 1.355219960604071e-05, 'epoch': 0.51} 1%| | 516/50750 [1:27:34<603:50:39, 43.27s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:10:18,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:10:18,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2038.76 | bwd_microstep: 3851.24 | bwd_inner_microstep: 3843.68 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.66 [2024-11-13 18:10:18,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2038.74 | bwd: 3851.25 | bwd_inner: 3843.68 | bwd_allreduce: 7.53 | step: 21.67 1%| | 517/50750 [1:27:40<447:34:04, 32.08s/it] {'loss': 0.1867, 'learning_rate': 1.3578463558765595e-05, 'epoch': 0.51} 1%| | 517/50750 [1:27:40<447:34:04, 32.08s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:10:24,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 18:10:24,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2041.28 | bwd_microstep: 3848.78 | bwd_inner_microstep: 3841.20 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.19 [2024-11-13 18:10:24,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2041.27 | bwd: 3848.79 | bwd_inner: 3841.20 | bwd_allreduce: 7.56 | step: 21.19 1%| | 518/50750 [1:27:46<338:08:52, 24.23s/it] {'loss': 0.0951, 'learning_rate': 1.360472751149048e-05, 'epoch': 0.51} 1%| | 518/50750 [1:27:46<338:08:52, 24.23s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:10:29,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 18:10:29,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2038.38 | bwd_microstep: 3851.77 | bwd_inner_microstep: 3844.02 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.89 [2024-11-13 18:10:29,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2038.38 | bwd: 3851.78 | bwd_inner: 3844.02 | bwd_allreduce: 7.72 | step: 21.90 1%| | 519/50750 [1:27:51<261:33:46, 18.75s/it] {'loss': 0.5961, 'learning_rate': 1.3630991464215366e-05, 'epoch': 0.51} 1%| | 519/50750 [1:27:51<261:33:46, 18.75s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:10:35,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 18:10:35,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2041.00 | bwd_microstep: 3849.85 | bwd_inner_microstep: 3842.26 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.72 [2024-11-13 18:10:35,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2040.98 | bwd: 3849.87 | bwd_inner: 3842.26 | bwd_allreduce: 7.57 | step: 21.73 1%| | 520/50750 [1:27:57<207:57:49, 14.90s/it] {'loss': 1.0325, 'learning_rate': 1.3657255416940251e-05, 'epoch': 0.51} 1%| | 520/50750 [1:27:57<207:57:49, 14.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:10:41,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 5.04 [2024-11-13 18:10:41,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2039.44 | bwd_microstep: 3861.51 | bwd_inner_microstep: 3854.00 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.55 [2024-11-13 18:10:41,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2039.44 | bwd: 3861.52 | bwd_inner: 3854.00 | bwd_allreduce: 7.48 | step: 21.55 1%| | 521/50750 [1:28:03<170:30:04, 12.22s/it] {'loss': 0.0131, 'learning_rate': 1.3683519369665135e-05, 'epoch': 0.51} 1%| | 521/50750 [1:28:03<170:30:04, 12.22s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:10:47,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:10:47,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2037.04 | bwd_microstep: 3838.68 | bwd_inner_microstep: 3831.17 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.17 [2024-11-13 18:10:47,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2037.02 | bwd: 3838.69 | bwd_inner: 3831.18 | bwd_allreduce: 7.48 | step: 21.17 1%| | 522/50750 [1:28:09<144:09:00, 10.33s/it] {'loss': 0.0069, 'learning_rate': 1.370978332239002e-05, 'epoch': 0.51} 1%| | 522/50750 [1:28:09<144:09:00, 10.33s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:10:53,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:10:53,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.47 | bwd_microstep: 3838.71 | bwd_inner_microstep: 3831.17 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.48 [2024-11-13 18:10:53,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.46 | bwd: 3838.72 | bwd_inner: 3831.17 | bwd_allreduce: 7.52 | step: 21.48 1%| | 523/50750 [1:28:15<125:37:36, 9.00s/it] {'loss': 0.0012, 'learning_rate': 1.3736047275114906e-05, 'epoch': 0.52} 1%| | 523/50750 [1:28:15<125:37:36, 9.00s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:10:59,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 18:10:59,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.71 | bwd_microstep: 3841.51 | bwd_inner_microstep: 3832.69 | bwd_allreduce_microstep: 8.77 | step_microstep: 22.38 [2024-11-13 18:10:59,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.71 | bwd: 3841.52 | bwd_inner: 3832.69 | bwd_allreduce: 8.79 | step: 22.38 1%| | 524/50750 [1:28:21<112:41:57, 8.08s/it] {'loss': 0.5542, 'learning_rate': 1.3762311227839791e-05, 'epoch': 0.52} 1%| | 524/50750 [1:28:21<112:41:57, 8.08s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:11:05,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:11:05,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.43 | bwd_microstep: 3853.04 | bwd_inner_microstep: 3845.56 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.36 [2024-11-13 18:11:05,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.39 | bwd: 3853.05 | bwd_inner: 3845.56 | bwd_allreduce: 7.45 | step: 21.37 1%| | 525/50750 [1:28:27<103:41:27, 7.43s/it] {'loss': 0.1112, 'learning_rate': 1.3788575180564677e-05, 'epoch': 0.52} 1%| | 525/50750 [1:28:27<103:41:27, 7.43s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 18:11:11,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:11:11,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.30 | bwd_microstep: 3848.06 | bwd_inner_microstep: 3840.50 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.95 [2024-11-13 18:11:11,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.30 | bwd: 3848.07 | bwd_inner: 3840.50 | bwd_allreduce: 7.53 | step: 21.95 1%| | 526/50750 [1:28:33<97:20:23, 6.98s/it] {'loss': 0.0055, 'learning_rate': 1.3814839133289562e-05, 'epoch': 0.52} 1%| | 526/50750 [1:28:33<97:20:23, 6.98s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:11:17,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 18:11:17,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.95 | bwd_microstep: 3839.59 | bwd_inner_microstep: 3831.92 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.49 [2024-11-13 18:11:17,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.95 | bwd: 3839.60 | bwd_inner: 3831.92 | bwd_allreduce: 7.64 | step: 21.50 1%| | 527/50750 [1:28:39<92:52:03, 6.66s/it] {'loss': 0.0893, 'learning_rate': 1.3841103086014446e-05, 'epoch': 0.52} 1%| | 527/50750 [1:28:39<92:52:03, 6.66s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:11:23,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:11:23,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.80 | bwd_microstep: 3846.96 | bwd_inner_microstep: 3839.47 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.98 [2024-11-13 18:11:23,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.78 | bwd: 3846.97 | bwd_inner: 3839.47 | bwd_allreduce: 7.46 | step: 20.98 1%| | 528/50750 [1:28:45<89:46:25, 6.44s/it] {'loss': 0.0026, 'learning_rate': 1.3867367038739332e-05, 'epoch': 0.52} 1%| | 528/50750 [1:28:45<89:46:25, 6.44s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:11:29,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.92 [2024-11-13 18:11:29,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.19 | bwd_microstep: 3843.07 | bwd_inner_microstep: 3835.06 | bwd_allreduce_microstep: 7.91 | step_microstep: 22.48 [2024-11-13 18:11:29,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.19 | bwd: 3843.09 | bwd_inner: 3835.06 | bwd_allreduce: 7.94 | step: 22.47 1%| | 529/50750 [1:28:51<87:37:05, 6.28s/it] {'loss': 0.3674, 'learning_rate': 1.3893630991464217e-05, 'epoch': 0.52} 1%| | 529/50750 [1:28:51<87:37:05, 6.28s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:11:35,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 18:11:35,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.82 | bwd_microstep: 3843.94 | bwd_inner_microstep: 3836.45 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.13 [2024-11-13 18:11:35,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.81 | bwd: 3843.95 | bwd_inner: 3836.45 | bwd_allreduce: 7.46 | step: 21.13 1%| | 530/50750 [1:28:57<86:05:27, 6.17s/it] {'loss': 0.0064, 'learning_rate': 1.39198949441891e-05, 'epoch': 0.52} 1%| | 530/50750 [1:28:57<86:05:27, 6.17s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 18:11:41,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:11:41,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.67 | bwd_microstep: 3853.30 | bwd_inner_microstep: 3845.78 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.68 [2024-11-13 18:11:41,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.67 | bwd: 3853.31 | bwd_inner: 3845.78 | bwd_allreduce: 7.49 | step: 21.68 1%| | 531/50750 [1:29:03<85:04:00, 6.10s/it] {'loss': 0.2734, 'learning_rate': 1.3946158896913986e-05, 'epoch': 0.52} 1%| | 531/50750 [1:29:03<85:04:00, 6.10s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:11:46,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.98 [2024-11-13 18:11:46,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.74 | bwd_microstep: 3847.85 | bwd_inner_microstep: 3840.34 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.48 [2024-11-13 18:11:46,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.74 | bwd: 3847.86 | bwd_inner: 3840.34 | bwd_allreduce: 7.48 | step: 21.48 1%| | 532/50750 [1:29:08<84:19:21, 6.04s/it] {'loss': 0.0886, 'learning_rate': 1.3972422849638872e-05, 'epoch': 0.52} 1%| | 532/50750 [1:29:08<84:19:21, 6.04s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:11:52,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:11:52,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.73 | bwd_microstep: 3845.92 | bwd_inner_microstep: 3838.38 | bwd_allreduce_microstep: 7.50 | step_microstep: 23.53 [2024-11-13 18:11:52,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.73 | bwd: 3845.93 | bwd_inner: 3838.38 | bwd_allreduce: 7.52 | step: 23.53 1%| | 533/50750 [1:29:14<83:47:22, 6.01s/it] {'loss': 0.5049, 'learning_rate': 1.3998686802363757e-05, 'epoch': 0.53} 1%| | 533/50750 [1:29:14<83:47:22, 6.01s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:11:58,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 18:11:58,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.83 | bwd_microstep: 3848.81 | bwd_inner_microstep: 3841.26 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.76 [2024-11-13 18:11:58,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.82 | bwd: 3848.82 | bwd_inner: 3841.26 | bwd_allreduce: 7.52 | step: 21.76 1%| | 534/50750 [1:29:20<83:26:57, 5.98s/it] {'loss': 0.0239, 'learning_rate': 1.4024950755088643e-05, 'epoch': 0.53} 1%| | 534/50750 [1:29:20<83:26:57, 5.98s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:12:04,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:12:04,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.54 | bwd_microstep: 3841.80 | bwd_inner_microstep: 3834.29 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.48 [2024-11-13 18:12:04,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.53 | bwd: 3841.81 | bwd_inner: 3834.29 | bwd_allreduce: 7.48 | step: 21.48 1%| | 535/50750 [1:29:26<83:10:21, 5.96s/it] {'loss': 0.3534, 'learning_rate': 1.4051214707813528e-05, 'epoch': 0.53} 1%| | 535/50750 [1:29:26<83:10:21, 5.96s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:12:10,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.57 | optimizer_step: 4.93 [2024-11-13 18:12:10,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.87 | bwd_microstep: 3850.19 | bwd_inner_microstep: 3842.58 | bwd_allreduce_microstep: 7.56 | step_microstep: 24.17 [2024-11-13 18:12:10,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.86 | bwd: 3850.20 | bwd_inner: 3842.58 | bwd_allreduce: 7.58 | step: 24.17 1%| | 536/50750 [1:29:32<83:01:47, 5.95s/it] {'loss': 0.4735, 'learning_rate': 1.4077478660538414e-05, 'epoch': 0.53} 1%| | 536/50750 [1:29:32<83:01:47, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 18:12:16,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:12:16,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.25 | bwd_microstep: 3852.09 | bwd_inner_microstep: 3844.61 | bwd_allreduce_microstep: 7.44 | step_microstep: 23.92 [2024-11-13 18:12:16,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.25 | bwd: 3852.11 | bwd_inner: 3844.61 | bwd_allreduce: 7.45 | step: 23.93 1%| | 537/50750 [1:29:38<82:55:27, 5.95s/it] {'loss': 0.045, 'learning_rate': 1.4103742613263299e-05, 'epoch': 0.53} 1%| | 537/50750 [1:29:38<82:55:27, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:12:22,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 18:12:22,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.92 | bwd_microstep: 3850.50 | bwd_inner_microstep: 3842.92 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.99 [2024-11-13 18:12:22,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.92 | bwd: 3850.52 | bwd_inner: 3842.92 | bwd_allreduce: 7.56 | step: 22.00 1%| | 538/50750 [1:29:44<82:50:22, 5.94s/it] {'loss': 0.6111, 'learning_rate': 1.4130006565988181e-05, 'epoch': 0.53} 1%| | 538/50750 [1:29:44<82:50:22, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:12:28,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.43 | optimizer_step: 4.93 [2024-11-13 18:12:28,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.90 | bwd_microstep: 3853.47 | bwd_inner_microstep: 3845.86 | bwd_allreduce_microstep: 7.56 | step_microstep: 23.79 [2024-11-13 18:12:28,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.88 | bwd: 3853.48 | bwd_inner: 3845.86 | bwd_allreduce: 7.58 | step: 23.81 1%| | 539/50750 [1:29:50<82:48:56, 5.94s/it] {'loss': 0.0015, 'learning_rate': 1.4156270518713067e-05, 'epoch': 0.53} 1%| | 539/50750 [1:29:50<82:48:56, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:12:34,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.33 | optimizer_step: 4.92 [2024-11-13 18:12:34,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.39 | bwd_microstep: 3852.11 | bwd_inner_microstep: 3843.96 | bwd_allreduce_microstep: 8.10 | step_microstep: 24.08 [2024-11-13 18:12:34,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.38 | bwd: 3852.13 | bwd_inner: 3843.96 | bwd_allreduce: 8.12 | step: 24.08 1%| | 540/50750 [1:29:56<82:49:10, 5.94s/it] {'loss': 0.0318, 'learning_rate': 1.4182534471437952e-05, 'epoch': 0.53} 1%| | 540/50750 [1:29:56<82:49:10, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:12:40,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:12:40,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.84 | bwd_microstep: 3842.93 | bwd_inner_microstep: 3835.32 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.62 [2024-11-13 18:12:40,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.82 | bwd: 3842.95 | bwd_inner: 3835.32 | bwd_allreduce: 7.58 | step: 21.62 1%| | 541/50750 [1:30:02<82:44:38, 5.93s/it] {'loss': 0.0039, 'learning_rate': 1.4208798424162837e-05, 'epoch': 0.53} 1%| | 541/50750 [1:30:02<82:44:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:12:46,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:12:46,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.53 | bwd_microstep: 3849.09 | bwd_inner_microstep: 3841.34 | bwd_allreduce_microstep: 7.69 | step_microstep: 22.76 [2024-11-13 18:12:46,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.51 | bwd: 3849.11 | bwd_inner: 3841.34 | bwd_allreduce: 7.71 | step: 22.76 1%| | 542/50750 [1:30:08<82:43:23, 5.93s/it] {'loss': 0.0143, 'learning_rate': 1.4235062376887723e-05, 'epoch': 0.53} 1%| | 542/50750 [1:30:08<82:43:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:12:52,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 18:12:52,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.18 | bwd_microstep: 3840.93 | bwd_inner_microstep: 3833.37 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.64 [2024-11-13 18:12:52,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.16 | bwd: 3840.95 | bwd_inner: 3833.37 | bwd_allreduce: 7.54 | step: 21.64 1%| | 543/50750 [1:30:14<82:40:27, 5.93s/it] {'loss': 0.0014, 'learning_rate': 1.4261326329612608e-05, 'epoch': 0.53} 1%| | 543/50750 [1:30:14<82:40:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:12:58,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:12:58,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.67 | bwd_microstep: 3841.49 | bwd_inner_microstep: 3834.02 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.92 [2024-11-13 18:12:58,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.67 | bwd: 3841.50 | bwd_inner: 3834.02 | bwd_allreduce: 7.44 | step: 20.93 1%| | 544/50750 [1:30:20<82:37:32, 5.92s/it] {'loss': 0.0028, 'learning_rate': 1.4287590282337494e-05, 'epoch': 0.54} 1%| | 544/50750 [1:30:20<82:37:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:13:04,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 18:13:04,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.24 | bwd_microstep: 3854.16 | bwd_inner_microstep: 3846.60 | bwd_allreduce_microstep: 7.52 | step_microstep: 22.06 [2024-11-13 18:13:04,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.24 | bwd: 3854.18 | bwd_inner: 3846.60 | bwd_allreduce: 7.54 | step: 22.06 1%| | 545/50750 [1:30:25<82:38:51, 5.93s/it] {'loss': 0.0113, 'learning_rate': 1.431385423506238e-05, 'epoch': 0.54} 1%| | 545/50750 [1:30:25<82:38:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:13:09,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:13:09,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.26 | bwd_microstep: 3866.18 | bwd_inner_microstep: 3858.69 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.12 [2024-11-13 18:13:09,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.25 | bwd: 3866.20 | bwd_inner: 3858.69 | bwd_allreduce: 7.47 | step: 21.12 1%| | 546/50750 [1:30:31<82:41:25, 5.93s/it] {'loss': 0.5571, 'learning_rate': 1.4340118187787265e-05, 'epoch': 0.54} 1%| | 546/50750 [1:30:31<82:41:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:13:15,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:13:15,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.23 | bwd_microstep: 3851.93 | bwd_inner_microstep: 3844.41 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.12 [2024-11-13 18:13:15,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.23 | bwd: 3851.94 | bwd_inner: 3844.41 | bwd_allreduce: 7.50 | step: 21.12 1%| | 547/50750 [1:30:37<82:40:51, 5.93s/it] {'loss': 0.0313, 'learning_rate': 1.4366382140512147e-05, 'epoch': 0.54} 1%| | 547/50750 [1:30:37<82:40:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:13:21,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:13:21,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.69 | bwd_microstep: 3850.94 | bwd_inner_microstep: 3843.40 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.04 [2024-11-13 18:13:21,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.69 | bwd: 3850.95 | bwd_inner: 3843.40 | bwd_allreduce: 7.51 | step: 21.04 1%| | 548/50750 [1:30:43<82:39:40, 5.93s/it] {'loss': 0.0838, 'learning_rate': 1.4392646093237032e-05, 'epoch': 0.54} 1%| | 548/50750 [1:30:43<82:39:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:13:27,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:13:27,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.53 | bwd_microstep: 3848.45 | bwd_inner_microstep: 3840.90 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.17 [2024-11-13 18:13:27,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.53 | bwd: 3848.46 | bwd_inner: 3840.90 | bwd_allreduce: 7.52 | step: 21.17 1%| | 549/50750 [1:30:49<82:37:56, 5.93s/it] {'loss': 0.0398, 'learning_rate': 1.4418910045961918e-05, 'epoch': 0.54} 1%| | 549/50750 [1:30:49<82:37:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:13:33,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:13:33,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.42 | bwd_microstep: 3855.61 | bwd_inner_microstep: 3848.01 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.24 [2024-11-13 18:13:33,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.42 | bwd: 3855.62 | bwd_inner: 3848.01 | bwd_allreduce: 7.57 | step: 21.25 1%| | 550/50750 [1:30:55<82:38:14, 5.93s/it] {'loss': 0.0014, 'learning_rate': 1.4445173998686803e-05, 'epoch': 0.54} 1%| | 550/50750 [1:30:55<82:38:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:13:39,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-13 18:13:39,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.29 | bwd_microstep: 3851.91 | bwd_inner_microstep: 3844.26 | bwd_allreduce_microstep: 7.60 | step_microstep: 23.90 [2024-11-13 18:13:39,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.29 | bwd: 3851.93 | bwd_inner: 3844.26 | bwd_allreduce: 7.62 | step: 23.90 1%| | 551/50750 [1:31:01<82:40:00, 5.93s/it] {'loss': 0.6791, 'learning_rate': 1.4471437951411689e-05, 'epoch': 0.54} 1%| | 551/50750 [1:31:01<82:40:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:13:45,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.58 | optimizer_step: 4.93 [2024-11-13 18:13:45,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.20 | bwd_microstep: 3852.11 | bwd_inner_microstep: 3844.23 | bwd_allreduce_microstep: 7.83 | step_microstep: 24.33 [2024-11-13 18:13:45,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.20 | bwd: 3852.13 | bwd_inner: 3844.23 | bwd_allreduce: 7.85 | step: 24.34 1%| | 552/50750 [1:31:07<82:42:59, 5.93s/it] {'loss': 0.0012, 'learning_rate': 1.4497701904136574e-05, 'epoch': 0.54} 1%| | 552/50750 [1:31:07<82:42:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:13:51,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.29 | optimizer_step: 4.93 [2024-11-13 18:13:51,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.34 | bwd_microstep: 3843.05 | bwd_inner_microstep: 3834.85 | bwd_allreduce_microstep: 8.13 | step_microstep: 25.73 [2024-11-13 18:13:51,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.32 | bwd: 3843.07 | bwd_inner: 3834.85 | bwd_allreduce: 8.16 | step: 25.73 1%| | 553/50750 [1:31:13<82:42:36, 5.93s/it] {'loss': 0.0162, 'learning_rate': 1.452396585686146e-05, 'epoch': 0.54} 1%| | 553/50750 [1:31:13<82:42:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:13:57,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:13:57,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.63 | bwd_microstep: 3837.55 | bwd_inner_microstep: 3829.71 | bwd_allreduce_microstep: 7.80 | step_microstep: 21.90 [2024-11-13 18:13:57,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.62 | bwd: 3837.57 | bwd_inner: 3829.71 | bwd_allreduce: 7.82 | step: 21.90 1%| | 554/50750 [1:31:19<82:40:11, 5.93s/it] {'loss': 0.0004, 'learning_rate': 1.4550229809586345e-05, 'epoch': 0.55} 1%| | 554/50750 [1:31:19<82:40:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:14:03,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.92 [2024-11-13 18:14:03,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.29 | bwd_microstep: 3849.49 | bwd_inner_microstep: 3841.62 | bwd_allreduce_microstep: 7.80 | step_microstep: 25.27 [2024-11-13 18:14:03,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.27 | bwd: 3849.51 | bwd_inner: 3841.62 | bwd_allreduce: 7.83 | step: 25.26 1%| | 555/50750 [1:31:25<82:42:01, 5.93s/it] {'loss': 1.2837, 'learning_rate': 1.4576493762311227e-05, 'epoch': 0.55} 1%| | 555/50750 [1:31:25<82:42:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:14:09,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 18:14:09,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.58 | bwd_microstep: 3856.76 | bwd_inner_microstep: 3849.28 | bwd_allreduce_microstep: 7.44 | step_microstep: 22.07 [2024-11-13 18:14:09,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.58 | bwd: 3856.78 | bwd_inner: 3849.28 | bwd_allreduce: 7.46 | step: 22.07 1%| | 556/50750 [1:31:31<82:45:14, 5.94s/it] {'loss': 0.0025, 'learning_rate': 1.4602757715036113e-05, 'epoch': 0.55} 1%| | 556/50750 [1:31:31<82:45:14, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:14:15,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 18:14:15,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.72 | bwd_microstep: 3848.38 | bwd_inner_microstep: 3840.68 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.71 [2024-11-13 18:14:15,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.71 | bwd: 3848.40 | bwd_inner: 3840.68 | bwd_allreduce: 7.67 | step: 21.72 1%| | 557/50750 [1:31:37<82:43:44, 5.93s/it] {'loss': 0.0004, 'learning_rate': 1.4629021667760998e-05, 'epoch': 0.55} 1%| | 557/50750 [1:31:37<82:43:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:14:21,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.31 | optimizer_step: 4.93 [2024-11-13 18:14:21,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.71 | bwd_microstep: 3854.04 | bwd_inner_microstep: 3846.46 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.96 [2024-11-13 18:14:21,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.71 | bwd: 3854.06 | bwd_inner: 3846.46 | bwd_allreduce: 7.56 | step: 21.96 1%| | 558/50750 [1:31:43<82:43:30, 5.93s/it] {'loss': 0.0014, 'learning_rate': 1.4655285620485884e-05, 'epoch': 0.55} 1%| | 558/50750 [1:31:43<82:43:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:14:27,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 18:14:27,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.17 | bwd_microstep: 3853.03 | bwd_inner_microstep: 3845.20 | bwd_allreduce_microstep: 7.78 | step_microstep: 21.64 [2024-11-13 18:14:27,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.16 | bwd: 3853.05 | bwd_inner: 3845.20 | bwd_allreduce: 7.80 | step: 21.64 1%| | 559/50750 [1:31:49<82:44:14, 5.93s/it] {'loss': 1.0996, 'learning_rate': 1.4681549573210769e-05, 'epoch': 0.55} 1%| | 559/50750 [1:31:49<82:44:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:14:33,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 5.08 [2024-11-13 18:14:33,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.27 | bwd_microstep: 3857.82 | bwd_inner_microstep: 3850.30 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.71 [2024-11-13 18:14:33,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.25 | bwd: 3857.83 | bwd_inner: 3850.30 | bwd_allreduce: 7.50 | step: 21.71 1%| | 560/50750 [1:31:54<82:45:32, 5.94s/it] {'loss': 0.0003, 'learning_rate': 1.4707813525935654e-05, 'epoch': 0.55} 1%| | 560/50750 [1:31:54<82:45:32, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:14:38,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:14:38,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.10 | bwd_microstep: 3849.57 | bwd_inner_microstep: 3842.07 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.18 [2024-11-13 18:14:38,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.10 | bwd: 3849.59 | bwd_inner: 3842.07 | bwd_allreduce: 7.48 | step: 21.18 1%| | 561/50750 [1:32:00<82:42:39, 5.93s/it] {'loss': 0.0006, 'learning_rate': 1.473407747866054e-05, 'epoch': 0.55} 1%| | 561/50750 [1:32:00<82:42:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:14:44,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.92 [2024-11-13 18:14:44,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.23 | bwd_microstep: 3866.95 | bwd_inner_microstep: 3859.16 | bwd_allreduce_microstep: 7.75 | step_microstep: 22.49 [2024-11-13 18:14:44,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.23 | bwd: 3866.97 | bwd_inner: 3859.16 | bwd_allreduce: 7.76 | step: 22.49 1%| | 562/50750 [1:32:06<82:46:19, 5.94s/it] {'loss': 0.0523, 'learning_rate': 1.4760341431385425e-05, 'epoch': 0.55} 1%| | 562/50750 [1:32:06<82:46:19, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:14:50,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:14:50,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.77 | bwd_microstep: 3862.97 | bwd_inner_microstep: 3855.24 | bwd_allreduce_microstep: 7.69 | step_microstep: 22.07 [2024-11-13 18:14:50,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.75 | bwd: 3862.99 | bwd_inner: 3855.24 | bwd_allreduce: 7.71 | step: 22.07 1%| | 563/50750 [1:32:12<82:49:58, 5.94s/it] {'loss': 0.0002, 'learning_rate': 1.4786605384110311e-05, 'epoch': 0.55} 1%| | 563/50750 [1:32:12<82:49:58, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:14:56,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.92 [2024-11-13 18:14:56,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.94 | bwd_microstep: 3850.95 | bwd_inner_microstep: 3843.49 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.79 [2024-11-13 18:14:56,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.94 | bwd: 3850.97 | bwd_inner: 3843.49 | bwd_allreduce: 7.44 | step: 20.80 1%| | 564/50750 [1:32:18<82:46:54, 5.94s/it] {'loss': 0.0031, 'learning_rate': 1.4812869336835195e-05, 'epoch': 0.56} 1%| | 564/50750 [1:32:18<82:46:54, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:15:02,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:15:02,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.10 | bwd_microstep: 3859.47 | bwd_inner_microstep: 3851.86 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.41 [2024-11-13 18:15:02,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.10 | bwd: 3859.49 | bwd_inner: 3851.86 | bwd_allreduce: 7.58 | step: 21.41 1%| | 565/50750 [1:32:24<82:47:02, 5.94s/it] {'loss': 0.2133, 'learning_rate': 1.483913328956008e-05, 'epoch': 0.56} 1%| | 565/50750 [1:32:24<82:47:02, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:15:08,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 18:15:08,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.84 | bwd_microstep: 3847.18 | bwd_inner_microstep: 3838.28 | bwd_allreduce_microstep: 8.85 | step_microstep: 22.43 [2024-11-13 18:15:08,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.84 | bwd: 3847.20 | bwd_inner: 3838.28 | bwd_allreduce: 8.87 | step: 22.44 1%| | 566/50750 [1:32:30<82:43:07, 5.93s/it] {'loss': 0.0538, 'learning_rate': 1.4865397242284964e-05, 'epoch': 0.56} 1%| | 566/50750 [1:32:30<82:43:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:15:14,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-13 18:15:14,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.67 | bwd_microstep: 3849.92 | bwd_inner_microstep: 3842.02 | bwd_allreduce_microstep: 7.85 | step_microstep: 22.26 [2024-11-13 18:15:14,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.67 | bwd: 3849.93 | bwd_inner: 3842.02 | bwd_allreduce: 7.87 | step: 22.26 1%| | 567/50750 [1:32:36<82:41:20, 5.93s/it] {'loss': 0.165, 'learning_rate': 1.489166119500985e-05, 'epoch': 0.56} 1%| | 567/50750 [1:32:36<82:41:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:15:20,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:15:20,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.66 | bwd_microstep: 3857.00 | bwd_inner_microstep: 3849.51 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.13 [2024-11-13 18:15:20,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.66 | bwd: 3857.02 | bwd_inner: 3849.51 | bwd_allreduce: 7.47 | step: 21.14 1%| | 568/50750 [1:32:42<82:41:06, 5.93s/it] {'loss': 0.2207, 'learning_rate': 1.4917925147734735e-05, 'epoch': 0.56} 1%| | 568/50750 [1:32:42<82:41:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:15:26,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:15:26,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.04 | bwd_microstep: 3850.89 | bwd_inner_microstep: 3843.10 | bwd_allreduce_microstep: 7.73 | step_microstep: 24.75 [2024-11-13 18:15:26,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.04 | bwd: 3850.91 | bwd_inner: 3843.10 | bwd_allreduce: 7.75 | step: 24.77 1%| | 569/50750 [1:32:48<82:39:42, 5.93s/it] {'loss': 0.0012, 'learning_rate': 1.494418910045962e-05, 'epoch': 0.56} 1%| | 569/50750 [1:32:48<82:39:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:15:32,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:15:32,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.79 | bwd_microstep: 3856.84 | bwd_inner_microstep: 3849.29 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.54 [2024-11-13 18:15:32,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.79 | bwd: 3856.86 | bwd_inner: 3849.29 | bwd_allreduce: 7.53 | step: 21.54 1%| | 570/50750 [1:32:54<82:39:52, 5.93s/it] {'loss': 0.0001, 'learning_rate': 1.4970453053184506e-05, 'epoch': 0.56} 1%| | 570/50750 [1:32:54<82:39:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:15:38,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:15:38,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.09 | bwd_microstep: 3849.14 | bwd_inner_microstep: 3841.66 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.40 [2024-11-13 18:15:38,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.09 | bwd: 3849.15 | bwd_inner: 3841.66 | bwd_allreduce: 7.46 | step: 21.41 1%| | 571/50750 [1:33:00<82:38:19, 5.93s/it] {'loss': 0.0002, 'learning_rate': 1.4996717005909391e-05, 'epoch': 0.56} 1%| | 571/50750 [1:33:00<82:38:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:15:44,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:15:44,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.12 | bwd_microstep: 3842.73 | bwd_inner_microstep: 3835.03 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.37 [2024-11-13 18:15:44,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.12 | bwd: 3842.74 | bwd_inner: 3835.03 | bwd_allreduce: 7.66 | step: 21.37 1%| | 572/50750 [1:33:06<82:35:39, 5.93s/it] {'loss': 0.0272, 'learning_rate': 1.5022980958634277e-05, 'epoch': 0.56} 1%| | 572/50750 [1:33:06<82:35:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:15:50,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.98 [2024-11-13 18:15:50,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.84 | bwd_microstep: 3845.75 | bwd_inner_microstep: 3838.08 | bwd_allreduce_microstep: 7.63 | step_microstep: 22.11 [2024-11-13 18:15:50,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.82 | bwd: 3845.77 | bwd_inner: 3838.08 | bwd_allreduce: 7.65 | step: 22.12 1%| | 573/50750 [1:33:12<82:35:38, 5.93s/it] {'loss': 0.3514, 'learning_rate': 1.504924491135916e-05, 'epoch': 0.56} 1%| | 573/50750 [1:33:12<82:35:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:15:56,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:15:56,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.03 | bwd_microstep: 3849.42 | bwd_inner_microstep: 3841.76 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.22 [2024-11-13 18:15:56,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.04 | bwd: 3849.43 | bwd_inner: 3841.76 | bwd_allreduce: 7.64 | step: 21.22 1%| | 574/50750 [1:33:18<82:35:22, 5.93s/it] {'loss': 0.8853, 'learning_rate': 1.5075508864084046e-05, 'epoch': 0.57} 1%| | 574/50750 [1:33:18<82:35:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:16:01,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:16:01,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.30 | bwd_microstep: 3857.81 | bwd_inner_microstep: 3850.21 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.81 [2024-11-13 18:16:01,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.30 | bwd: 3857.82 | bwd_inner: 3850.21 | bwd_allreduce: 7.57 | step: 21.81 1%| | 575/50750 [1:33:23<82:37:05, 5.93s/it] {'loss': 0.0013, 'learning_rate': 1.5101772816808931e-05, 'epoch': 0.57} 1%| | 575/50750 [1:33:23<82:37:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:16:07,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-13 18:16:07,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.46 | bwd_microstep: 3848.91 | bwd_inner_microstep: 3841.20 | bwd_allreduce_microstep: 7.65 | step_microstep: 22.77 [2024-11-13 18:16:07,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.46 | bwd: 3848.92 | bwd_inner: 3841.20 | bwd_allreduce: 7.67 | step: 22.78 1%| | 576/50750 [1:33:29<82:36:00, 5.93s/it] {'loss': 0.3817, 'learning_rate': 1.5128036769533815e-05, 'epoch': 0.57} 1%| | 576/50750 [1:33:29<82:36:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:16:13,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.94 [2024-11-13 18:16:13,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.43 | bwd_microstep: 3850.98 | bwd_inner_microstep: 3843.18 | bwd_allreduce_microstep: 7.74 | step_microstep: 24.81 [2024-11-13 18:16:13,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.41 | bwd: 3851.00 | bwd_inner: 3843.18 | bwd_allreduce: 7.76 | step: 24.81 1%| | 577/50750 [1:33:35<82:37:37, 5.93s/it] {'loss': 0.088, 'learning_rate': 1.51543007222587e-05, 'epoch': 0.57} 1%| | 577/50750 [1:33:35<82:37:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:16:19,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:16:19,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.82 | bwd_microstep: 3846.05 | bwd_inner_microstep: 3838.39 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.62 [2024-11-13 18:16:19,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.81 | bwd: 3846.06 | bwd_inner: 3838.39 | bwd_allreduce: 7.63 | step: 21.62 1%| | 578/50750 [1:33:41<82:35:24, 5.93s/it] {'loss': 0.3579, 'learning_rate': 1.5180564674983586e-05, 'epoch': 0.57} 1%| | 578/50750 [1:33:41<82:35:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:16:25,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:16:25,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.99 | bwd_microstep: 3847.53 | bwd_inner_microstep: 3840.01 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.26 [2024-11-13 18:16:25,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.99 | bwd: 3847.54 | bwd_inner: 3840.01 | bwd_allreduce: 7.49 | step: 21.26 1%| | 579/50750 [1:33:47<82:34:55, 5.93s/it] {'loss': 0.0114, 'learning_rate': 1.5206828627708471e-05, 'epoch': 0.57} 1%| | 579/50750 [1:33:47<82:34:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:16:31,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:16:31,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.43 | bwd_microstep: 3849.53 | bwd_inner_microstep: 3842.03 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.25 [2024-11-13 18:16:31,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.43 | bwd: 3849.55 | bwd_inner: 3842.03 | bwd_allreduce: 7.48 | step: 21.25 1%| | 580/50750 [1:33:53<82:33:41, 5.92s/it] {'loss': 0.0024, 'learning_rate': 1.5233092580433357e-05, 'epoch': 0.57} 1%| | 580/50750 [1:33:53<82:33:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:16:37,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:16:37,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.00 | bwd_microstep: 3845.34 | bwd_inner_microstep: 3837.79 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.39 [2024-11-13 18:16:37,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.00 | bwd: 3845.35 | bwd_inner: 3837.79 | bwd_allreduce: 7.52 | step: 21.40 1%| | 581/50750 [1:33:59<82:31:49, 5.92s/it] {'loss': 0.167, 'learning_rate': 1.525935653315824e-05, 'epoch': 0.57} 1%| | 581/50750 [1:33:59<82:31:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:16:43,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:16:43,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.80 | bwd_microstep: 3842.84 | bwd_inner_microstep: 3835.34 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.11 [2024-11-13 18:16:43,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.80 | bwd: 3842.85 | bwd_inner: 3835.34 | bwd_allreduce: 7.48 | step: 21.11 1%| | 582/50750 [1:34:05<82:29:25, 5.92s/it] {'loss': 0.0019, 'learning_rate': 1.5285620485883126e-05, 'epoch': 0.57} 1%| | 582/50750 [1:34:05<82:29:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:16:49,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 18:16:49,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.64 | bwd_microstep: 3851.26 | bwd_inner_microstep: 3843.64 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.71 [2024-11-13 18:16:49,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.64 | bwd: 3851.27 | bwd_inner: 3843.64 | bwd_allreduce: 7.59 | step: 21.72 1%| | 583/50750 [1:34:11<82:30:39, 5.92s/it] {'loss': 0.1223, 'learning_rate': 1.531188443860801e-05, 'epoch': 0.57} 1%| | 583/50750 [1:34:11<82:30:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:16:55,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 18:16:55,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.01 | bwd_microstep: 3847.32 | bwd_inner_microstep: 3839.60 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.88 [2024-11-13 18:16:55,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.99 | bwd: 3847.33 | bwd_inner: 3839.60 | bwd_allreduce: 7.68 | step: 21.88 1%| | 584/50750 [1:34:17<82:31:21, 5.92s/it] {'loss': 1.4926, 'learning_rate': 1.5338148391332897e-05, 'epoch': 0.58} 1%| | 584/50750 [1:34:17<82:31:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:17:01,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.99 [2024-11-13 18:17:01,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.79 | bwd_microstep: 3852.16 | bwd_inner_microstep: 3842.81 | bwd_allreduce_microstep: 9.27 | step_microstep: 23.65 [2024-11-13 18:17:01,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.77 | bwd: 3852.19 | bwd_inner: 3842.81 | bwd_allreduce: 9.30 | step: 23.65 1%| | 585/50750 [1:34:23<82:32:37, 5.92s/it] {'loss': 0.5948, 'learning_rate': 1.536441234405778e-05, 'epoch': 0.58} 1%| | 585/50750 [1:34:23<82:32:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:17:07,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:17:07,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.66 | bwd_microstep: 3846.05 | bwd_inner_microstep: 3838.25 | bwd_allreduce_microstep: 7.76 | step_microstep: 22.22 [2024-11-13 18:17:07,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.66 | bwd: 3846.07 | bwd_inner: 3838.25 | bwd_allreduce: 7.78 | step: 22.23 1%| | 586/50750 [1:34:29<82:32:39, 5.92s/it] {'loss': 0.001, 'learning_rate': 1.5390676296782668e-05, 'epoch': 0.58} 1%| | 586/50750 [1:34:29<82:32:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:17:13,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.46 | optimizer_step: 4.93 [2024-11-13 18:17:13,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.25 | bwd_microstep: 3848.81 | bwd_inner_microstep: 3840.90 | bwd_allreduce_microstep: 7.85 | step_microstep: 25.16 [2024-11-13 18:17:13,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.24 | bwd: 3848.83 | bwd_inner: 3840.90 | bwd_allreduce: 7.88 | step: 25.18 1%| | 587/50750 [1:34:35<82:36:07, 5.93s/it] {'loss': 0.1898, 'learning_rate': 1.5416940249507552e-05, 'epoch': 0.58} 1%| | 587/50750 [1:34:35<82:36:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:17:18,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:17:18,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.52 | bwd_microstep: 3848.63 | bwd_inner_microstep: 3841.11 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.17 [2024-11-13 18:17:18,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.51 | bwd: 3848.64 | bwd_inner: 3841.11 | bwd_allreduce: 7.49 | step: 21.17 1%| | 588/50750 [1:34:40<82:33:52, 5.93s/it] {'loss': 0.5485, 'learning_rate': 1.544320420223244e-05, 'epoch': 0.58} 1%| | 588/50750 [1:34:40<82:33:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:17:24,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 18:17:24,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.38 | bwd_microstep: 3844.43 | bwd_inner_microstep: 3836.63 | bwd_allreduce_microstep: 7.76 | step_microstep: 22.54 [2024-11-13 18:17:24,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.38 | bwd: 3844.45 | bwd_inner: 3836.63 | bwd_allreduce: 7.78 | step: 22.54 1%| | 589/50750 [1:34:46<82:33:53, 5.93s/it] {'loss': 0.0885, 'learning_rate': 1.5469468154957323e-05, 'epoch': 0.58} 1%| | 589/50750 [1:34:46<82:33:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:17:30,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:17:30,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.57 | bwd_microstep: 3843.58 | bwd_inner_microstep: 3835.95 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.65 [2024-11-13 18:17:30,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.55 | bwd: 3843.60 | bwd_inner: 3835.95 | bwd_allreduce: 7.60 | step: 21.65 1%| | 590/50750 [1:34:52<82:33:54, 5.93s/it] {'loss': 0.0839, 'learning_rate': 1.5495732107682207e-05, 'epoch': 0.58} 1%| | 590/50750 [1:34:52<82:33:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:17:36,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 18:17:36,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.16 | bwd_microstep: 3845.63 | bwd_inner_microstep: 3837.97 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.48 [2024-11-13 18:17:36,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.14 | bwd: 3845.64 | bwd_inner: 3837.97 | bwd_allreduce: 7.63 | step: 21.49 1%| | 591/50750 [1:34:58<82:34:17, 5.93s/it] {'loss': 0.0005, 'learning_rate': 1.552199606040709e-05, 'epoch': 0.58} 1%| | 591/50750 [1:34:58<82:34:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:17:42,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.48 | optimizer_step: 4.93 [2024-11-13 18:17:42,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.05 | bwd_microstep: 3844.78 | bwd_inner_microstep: 3837.00 | bwd_allreduce_microstep: 7.73 | step_microstep: 22.37 [2024-11-13 18:17:42,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.03 | bwd: 3844.80 | bwd_inner: 3837.00 | bwd_allreduce: 7.75 | step: 22.38 1%| | 592/50750 [1:35:04<82:34:23, 5.93s/it] {'loss': 0.453, 'learning_rate': 1.5548260013131977e-05, 'epoch': 0.58} 1%| | 592/50750 [1:35:04<82:34:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:17:48,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 18:17:48,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.39 | bwd_microstep: 3844.85 | bwd_inner_microstep: 3837.31 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.24 [2024-11-13 18:17:48,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.38 | bwd: 3844.87 | bwd_inner: 3837.31 | bwd_allreduce: 7.52 | step: 21.25 1%| | 593/50750 [1:35:10<82:32:24, 5.92s/it] {'loss': 0.0319, 'learning_rate': 1.557452396585686e-05, 'epoch': 0.58} 1%| | 593/50750 [1:35:10<82:32:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:17:54,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:17:54,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.15 | bwd_microstep: 3842.67 | bwd_inner_microstep: 3834.18 | bwd_allreduce_microstep: 8.41 | step_microstep: 21.51 [2024-11-13 18:17:54,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.15 | bwd: 3842.69 | bwd_inner: 3834.18 | bwd_allreduce: 8.44 | step: 21.50 1%| | 594/50750 [1:35:16<82:30:16, 5.92s/it] {'loss': 0.175, 'learning_rate': 1.560078791858175e-05, 'epoch': 0.59} 1%| | 594/50750 [1:35:16<82:30:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:18:00,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-13 18:18:00,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.01 | bwd_microstep: 3842.43 | bwd_inner_microstep: 3834.65 | bwd_allreduce_microstep: 7.74 | step_microstep: 23.35 [2024-11-13 18:18:00,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.00 | bwd: 3842.45 | bwd_inner: 3834.65 | bwd_allreduce: 7.75 | step: 23.35 1%| | 595/50750 [1:35:22<82:30:35, 5.92s/it] {'loss': 0.1454, 'learning_rate': 1.5627051871306632e-05, 'epoch': 0.59} 1%| | 595/50750 [1:35:22<82:30:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:18:06,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 18:18:06,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.90 | bwd_microstep: 3848.67 | bwd_inner_microstep: 3840.94 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.63 [2024-11-13 18:18:06,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.88 | bwd: 3848.69 | bwd_inner: 3840.94 | bwd_allreduce: 7.70 | step: 21.64 1%| | 596/50750 [1:35:28<82:31:50, 5.92s/it] {'loss': 0.5399, 'learning_rate': 1.565331582403152e-05, 'epoch': 0.59} 1%| | 596/50750 [1:35:28<82:31:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:18:12,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.92 [2024-11-13 18:18:12,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.38 | bwd_microstep: 3845.86 | bwd_inner_microstep: 3838.19 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.59 [2024-11-13 18:18:12,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.36 | bwd: 3845.88 | bwd_inner: 3838.19 | bwd_allreduce: 7.64 | step: 21.61 1%| | 597/50750 [1:35:34<82:32:04, 5.92s/it] {'loss': 0.0162, 'learning_rate': 1.5679579776756403e-05, 'epoch': 0.59} 1%| | 597/50750 [1:35:34<82:32:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:18:18,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:18:18,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.25 | bwd_microstep: 3854.11 | bwd_inner_microstep: 3845.98 | bwd_allreduce_microstep: 8.07 | step_microstep: 23.54 [2024-11-13 18:18:18,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.24 | bwd: 3854.13 | bwd_inner: 3845.98 | bwd_allreduce: 8.10 | step: 23.54 1%| | 598/50750 [1:35:40<82:36:03, 5.93s/it] {'loss': 0.1007, 'learning_rate': 1.570584372948129e-05, 'epoch': 0.59} 1%| | 598/50750 [1:35:40<82:36:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:18:24,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.30 | optimizer_step: 5.02 [2024-11-13 18:18:24,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.95 | bwd_microstep: 3851.15 | bwd_inner_microstep: 3843.40 | bwd_allreduce_microstep: 7.69 | step_microstep: 24.71 [2024-11-13 18:18:24,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.94 | bwd: 3851.17 | bwd_inner: 3843.40 | bwd_allreduce: 7.72 | step: 24.70 1%| | 599/50750 [1:35:46<82:37:22, 5.93s/it] {'loss': 0.0135, 'learning_rate': 1.5732107682206174e-05, 'epoch': 0.59} 1%| | 599/50750 [1:35:46<82:37:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:18:30,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:18:30,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.81 | bwd_microstep: 3844.88 | bwd_inner_microstep: 3836.79 | bwd_allreduce_microstep: 8.04 | step_microstep: 22.08 [2024-11-13 18:18:30,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.80 | bwd: 3844.89 | bwd_inner: 3836.79 | bwd_allreduce: 8.06 | step: 22.08 1%| | 600/50750 [1:35:52<82:36:56, 5.93s/it] {'loss': 0.0096, 'learning_rate': 1.5758371634931058e-05, 'epoch': 0.59} 1%| | 600/50750 [1:35:52<82:36:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:18:36,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 18:18:36,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.92 | bwd_microstep: 3845.25 | bwd_inner_microstep: 3837.72 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.12 [2024-11-13 18:18:36,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.92 | bwd: 3845.26 | bwd_inner: 3837.72 | bwd_allreduce: 7.50 | step: 21.12 1%| | 601/50750 [1:35:57<82:34:33, 5.93s/it] {'loss': 0.0294, 'learning_rate': 1.578463558765594e-05, 'epoch': 0.59} 1%| | 601/50750 [1:35:57<82:34:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:18:41,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:18:41,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.15 | bwd_microstep: 3851.15 | bwd_inner_microstep: 3843.62 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.31 [2024-11-13 18:18:41,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.15 | bwd: 3851.16 | bwd_inner: 3843.62 | bwd_allreduce: 7.50 | step: 21.31 1%| | 602/50750 [1:36:03<82:33:06, 5.93s/it] {'loss': 0.5081, 'learning_rate': 1.581089954038083e-05, 'epoch': 0.59} 1%| | 602/50750 [1:36:03<82:33:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:18:47,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:18:47,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.55 | bwd_microstep: 3861.68 | bwd_inner_microstep: 3854.15 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.24 [2024-11-13 18:18:47,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.55 | bwd: 3861.69 | bwd_inner: 3854.15 | bwd_allreduce: 7.50 | step: 21.24 1%| | 603/50750 [1:36:09<82:35:15, 5.93s/it] {'loss': 0.0493, 'learning_rate': 1.5837163493105712e-05, 'epoch': 0.59} 1%| | 603/50750 [1:36:09<82:35:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:18:53,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:18:53,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.28 | bwd_microstep: 3843.25 | bwd_inner_microstep: 3835.73 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.16 [2024-11-13 18:18:53,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.28 | bwd: 3843.26 | bwd_inner: 3835.73 | bwd_allreduce: 7.49 | step: 21.17 1%| | 604/50750 [1:36:15<82:32:44, 5.93s/it] {'loss': 0.1033, 'learning_rate': 1.58634274458306e-05, 'epoch': 0.6} 1%| | 604/50750 [1:36:15<82:32:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:18:59,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 18:18:59,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.93 | bwd_microstep: 3847.24 | bwd_inner_microstep: 3839.37 | bwd_allreduce_microstep: 7.82 | step_microstep: 22.05 [2024-11-13 18:18:59,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.93 | bwd: 3847.25 | bwd_inner: 3839.37 | bwd_allreduce: 7.84 | step: 22.05 1%| | 605/50750 [1:36:21<82:32:25, 5.93s/it] {'loss': 0.2928, 'learning_rate': 1.5889691398555483e-05, 'epoch': 0.6} 1%| | 605/50750 [1:36:21<82:32:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:19:05,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.67 | optimizer_step: 4.93 [2024-11-13 18:19:05,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.66 | bwd_microstep: 3844.30 | bwd_inner_microstep: 3836.19 | bwd_allreduce_microstep: 8.04 | step_microstep: 26.42 [2024-11-13 18:19:05,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.65 | bwd: 3844.32 | bwd_inner: 3836.19 | bwd_allreduce: 8.07 | step: 26.42 1%| | 606/50750 [1:36:27<82:34:02, 5.93s/it] {'loss': 0.0233, 'learning_rate': 1.591595535128037e-05, 'epoch': 0.6} 1%| | 606/50750 [1:36:27<82:34:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:19:11,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 18:19:11,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.55 | bwd_microstep: 3843.57 | bwd_inner_microstep: 3835.92 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.56 [2024-11-13 18:19:11,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.54 | bwd: 3843.58 | bwd_inner: 3835.92 | bwd_allreduce: 7.63 | step: 21.57 1%| | 607/50750 [1:36:33<82:31:55, 5.93s/it] {'loss': 0.0024, 'learning_rate': 1.5942219304005254e-05, 'epoch': 0.6} 1%| | 607/50750 [1:36:33<82:31:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 18:19:17,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:19:17,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.34 | bwd_microstep: 3841.06 | bwd_inner_microstep: 3833.48 | bwd_allreduce_microstep: 7.54 | step_microstep: 23.39 [2024-11-13 18:19:17,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.32 | bwd: 3841.07 | bwd_inner: 3833.48 | bwd_allreduce: 7.55 | step: 23.39 1%| | 608/50750 [1:36:39<82:29:33, 5.92s/it] {'loss': 0.01, 'learning_rate': 1.5968483256730138e-05, 'epoch': 0.6} 1%| | 608/50750 [1:36:39<82:29:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:19:23,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-13 18:19:23,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.13 | bwd_microstep: 3867.18 | bwd_inner_microstep: 3859.50 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.70 [2024-11-13 18:19:23,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.13 | bwd: 3867.20 | bwd_inner: 3859.50 | bwd_allreduce: 7.65 | step: 21.70 1%| | 609/50750 [1:36:45<82:34:38, 5.93s/it] {'loss': 0.0443, 'learning_rate': 1.5994747209455025e-05, 'epoch': 0.6} 1%| | 609/50750 [1:36:45<82:34:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:19:29,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:19:29,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.10 | bwd_microstep: 3860.57 | bwd_inner_microstep: 3853.06 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.33 [2024-11-13 18:19:29,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.07 | bwd: 3860.58 | bwd_inner: 3853.06 | bwd_allreduce: 7.48 | step: 21.33 1%| | 610/50750 [1:36:51<82:35:31, 5.93s/it] {'loss': 0.7257, 'learning_rate': 1.602101116217991e-05, 'epoch': 0.6} 1%| | 610/50750 [1:36:51<82:35:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:19:35,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 18:19:35,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.81 | bwd_microstep: 3866.97 | bwd_inner_microstep: 3859.13 | bwd_allreduce_microstep: 7.79 | step_microstep: 22.30 [2024-11-13 18:19:35,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.81 | bwd: 3866.98 | bwd_inner: 3859.13 | bwd_allreduce: 7.81 | step: 22.31 1%| | 611/50750 [1:36:57<82:38:33, 5.93s/it] {'loss': 0.0936, 'learning_rate': 1.6047275114904796e-05, 'epoch': 0.6} 1%| | 611/50750 [1:36:57<82:38:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:19:41,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:19:41,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.98 | bwd_microstep: 3860.27 | bwd_inner_microstep: 3852.41 | bwd_allreduce_microstep: 7.82 | step_microstep: 21.82 [2024-11-13 18:19:41,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.97 | bwd: 3860.28 | bwd_inner: 3852.41 | bwd_allreduce: 7.83 | step: 21.83 1%| | 612/50750 [1:37:03<82:40:32, 5.94s/it] {'loss': 0.2418, 'learning_rate': 1.607353906762968e-05, 'epoch': 0.6} 1%| | 612/50750 [1:37:03<82:40:32, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:19:47,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 18:19:47,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.22 | bwd_microstep: 3851.02 | bwd_inner_microstep: 3843.49 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.17 [2024-11-13 18:19:47,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.22 | bwd: 3851.03 | bwd_inner: 3843.49 | bwd_allreduce: 7.51 | step: 21.17 1%| | 613/50750 [1:37:09<82:38:13, 5.93s/it] {'loss': 0.0107, 'learning_rate': 1.6099803020354564e-05, 'epoch': 0.6} 1%| | 613/50750 [1:37:09<82:38:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:19:53,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 18:19:53,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.43 | bwd_microstep: 3860.14 | bwd_inner_microstep: 3851.50 | bwd_allreduce_microstep: 8.60 | step_microstep: 21.48 [2024-11-13 18:19:53,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.43 | bwd: 3860.15 | bwd_inner: 3851.50 | bwd_allreduce: 8.61 | step: 21.48 1%| | 614/50750 [1:37:15<82:37:51, 5.93s/it] {'loss': 0.0064, 'learning_rate': 1.612606697307945e-05, 'epoch': 0.6} 1%| | 614/50750 [1:37:15<82:37:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:19:59,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:19:59,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.46 | bwd_microstep: 3855.41 | bwd_inner_microstep: 3847.89 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.03 [2024-11-13 18:19:59,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.46 | bwd: 3855.42 | bwd_inner: 3847.89 | bwd_allreduce: 7.49 | step: 21.03 1%| | 615/50750 [1:37:21<82:35:46, 5.93s/it] {'loss': 0.0254, 'learning_rate': 1.6152330925804335e-05, 'epoch': 0.61} 1%| | 615/50750 [1:37:21<82:35:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:20:04,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:20:04,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.31 | bwd_microstep: 3849.94 | bwd_inner_microstep: 3842.18 | bwd_allreduce_microstep: 7.70 | step_microstep: 23.93 [2024-11-13 18:20:04,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.31 | bwd: 3849.95 | bwd_inner: 3842.18 | bwd_allreduce: 7.72 | step: 23.93 1%| | 616/50750 [1:37:26<82:33:57, 5.93s/it] {'loss': 0.0069, 'learning_rate': 1.617859487852922e-05, 'epoch': 0.61} 1%| | 616/50750 [1:37:26<82:33:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:20:10,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:20:10,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.34 | bwd_microstep: 3855.17 | bwd_inner_microstep: 3847.62 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.39 [2024-11-13 18:20:10,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.34 | bwd: 3855.18 | bwd_inner: 3847.62 | bwd_allreduce: 7.52 | step: 21.39 1%| | 617/50750 [1:37:32<82:33:40, 5.93s/it] {'loss': 0.7355, 'learning_rate': 1.6204858831254106e-05, 'epoch': 0.61} 1%| | 617/50750 [1:37:32<82:33:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:20:16,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:20:16,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.70 | bwd_microstep: 3854.22 | bwd_inner_microstep: 3846.68 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.68 [2024-11-13 18:20:16,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.70 | bwd: 3854.24 | bwd_inner: 3846.68 | bwd_allreduce: 7.52 | step: 21.68 1%| | 618/50750 [1:37:38<82:33:38, 5.93s/it] {'loss': 0.0054, 'learning_rate': 1.623112278397899e-05, 'epoch': 0.61} 1%| | 618/50750 [1:37:38<82:33:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:20:22,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:20:22,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.61 | bwd_microstep: 3850.53 | bwd_inner_microstep: 3842.96 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.35 [2024-11-13 18:20:22,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.61 | bwd: 3850.54 | bwd_inner: 3842.96 | bwd_allreduce: 7.54 | step: 21.36 1%| | 619/50750 [1:37:44<82:32:11, 5.93s/it] {'loss': 0.8769, 'learning_rate': 1.6257386736703876e-05, 'epoch': 0.61} 1%| | 619/50750 [1:37:44<82:32:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:20:28,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:20:28,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.04 | bwd_microstep: 3857.09 | bwd_inner_microstep: 3849.56 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.04 [2024-11-13 18:20:28,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.04 | bwd: 3857.10 | bwd_inner: 3849.56 | bwd_allreduce: 7.50 | step: 21.05 1%| | 620/50750 [1:37:50<82:32:31, 5.93s/it] {'loss': 0.1503, 'learning_rate': 1.628365068942876e-05, 'epoch': 0.61} 1%| | 620/50750 [1:37:50<82:32:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:20:34,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:20:34,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.10 | bwd_microstep: 3854.00 | bwd_inner_microstep: 3846.50 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.15 [2024-11-13 18:20:34,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.10 | bwd: 3854.01 | bwd_inner: 3846.50 | bwd_allreduce: 7.48 | step: 21.16 1%| | 621/50750 [1:37:56<82:32:27, 5.93s/it] {'loss': 0.0459, 'learning_rate': 1.6309914642153647e-05, 'epoch': 0.61} 1%| | 621/50750 [1:37:56<82:32:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:20:40,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:20:40,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.62 | bwd_microstep: 3860.54 | bwd_inner_microstep: 3853.01 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.34 [2024-11-13 18:20:40,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.62 | bwd: 3860.55 | bwd_inner: 3853.01 | bwd_allreduce: 7.50 | step: 21.34 1%| | 622/50750 [1:38:02<82:34:01, 5.93s/it] {'loss': 0.0038, 'learning_rate': 1.633617859487853e-05, 'epoch': 0.61} 1%| | 622/50750 [1:38:02<82:34:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:20:46,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:20:46,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.37 | bwd_microstep: 3846.48 | bwd_inner_microstep: 3838.98 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.77 [2024-11-13 18:20:46,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.37 | bwd: 3846.49 | bwd_inner: 3838.98 | bwd_allreduce: 7.47 | step: 20.77 1%| | 623/50750 [1:38:08<82:32:56, 5.93s/it] {'loss': 0.9233, 'learning_rate': 1.6362442547603415e-05, 'epoch': 0.61} 1%| | 623/50750 [1:38:08<82:32:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:20:52,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:20:52,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.63 | bwd_microstep: 3858.44 | bwd_inner_microstep: 3850.82 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.28 [2024-11-13 18:20:52,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.61 | bwd: 3858.45 | bwd_inner: 3850.82 | bwd_allreduce: 7.60 | step: 21.29 1%| | 624/50750 [1:38:14<82:34:50, 5.93s/it] {'loss': 0.3638, 'learning_rate': 1.63887065003283e-05, 'epoch': 0.61} 1%| | 624/50750 [1:38:14<82:34:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:20:58,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:20:58,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.39 | bwd_microstep: 3851.01 | bwd_inner_microstep: 3843.38 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.40 [2024-11-13 18:20:58,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.38 | bwd: 3851.02 | bwd_inner: 3843.38 | bwd_allreduce: 7.61 | step: 21.40 1%| | 625/50750 [1:38:20<82:35:28, 5.93s/it] {'loss': 0.1686, 'learning_rate': 1.6414970453053186e-05, 'epoch': 0.62} 1%| | 625/50750 [1:38:20<82:35:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:21:04,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-13 18:21:04,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.46 | bwd_microstep: 3857.51 | bwd_inner_microstep: 3849.45 | bwd_allreduce_microstep: 8.00 | step_microstep: 22.29 [2024-11-13 18:21:04,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.45 | bwd: 3857.52 | bwd_inner: 3849.45 | bwd_allreduce: 8.03 | step: 22.30 1%| | 626/50750 [1:38:26<82:35:53, 5.93s/it] {'loss': 1.3234, 'learning_rate': 1.644123440577807e-05, 'epoch': 0.62} 1%| | 626/50750 [1:38:26<82:35:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:21:10,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.57 | optimizer_step: 4.93 [2024-11-13 18:21:10,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.93 | bwd_microstep: 3848.47 | bwd_inner_microstep: 3840.48 | bwd_allreduce_microstep: 7.92 | step_microstep: 29.69 [2024-11-13 18:21:10,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.91 | bwd: 3848.49 | bwd_inner: 3840.48 | bwd_allreduce: 7.95 | step: 29.69 1%| | 627/50750 [1:38:32<82:39:30, 5.94s/it] {'loss': 0.0004, 'learning_rate': 1.6467498358502957e-05, 'epoch': 0.62} 1%| | 627/50750 [1:38:32<82:39:30, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:21:16,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.34 | optimizer_step: 4.93 [2024-11-13 18:21:16,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.99 | bwd_microstep: 3850.39 | bwd_inner_microstep: 3841.74 | bwd_allreduce_microstep: 8.58 | step_microstep: 28.23 [2024-11-13 18:21:16,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.98 | bwd: 3850.41 | bwd_inner: 3841.74 | bwd_allreduce: 8.61 | step: 28.24 1%| | 628/50750 [1:38:38<82:41:45, 5.94s/it] {'loss': 0.0052, 'learning_rate': 1.649376231122784e-05, 'epoch': 0.62} 1%| | 628/50750 [1:38:38<82:41:45, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:21:22,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.45 | optimizer_step: 4.93 [2024-11-13 18:21:22,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.04 | bwd_microstep: 3855.08 | bwd_inner_microstep: 3847.17 | bwd_allreduce_microstep: 7.85 | step_microstep: 22.55 [2024-11-13 18:21:22,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.03 | bwd: 3855.09 | bwd_inner: 3847.17 | bwd_allreduce: 7.87 | step: 22.55 1%| | 629/50750 [1:38:44<82:41:25, 5.94s/it] {'loss': 0.0042, 'learning_rate': 1.6520026263952728e-05, 'epoch': 0.62} 1%| | 629/50750 [1:38:44<82:41:25, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:21:28,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:21:28,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.34 | bwd_microstep: 3840.43 | bwd_inner_microstep: 3832.95 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.16 [2024-11-13 18:21:28,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.32 | bwd: 3840.44 | bwd_inner: 3832.95 | bwd_allreduce: 7.45 | step: 21.16 1%| | 630/50750 [1:38:49<82:35:03, 5.93s/it] {'loss': 0.0497, 'learning_rate': 1.654629021667761e-05, 'epoch': 0.62} 1%| | 630/50750 [1:38:49<82:35:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:21:33,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:21:33,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.69 | bwd_microstep: 3851.24 | bwd_inner_microstep: 3843.77 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.85 [2024-11-13 18:21:33,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.69 | bwd: 3851.25 | bwd_inner: 3843.77 | bwd_allreduce: 7.45 | step: 20.86 1%| | 631/50750 [1:38:55<82:34:39, 5.93s/it] {'loss': 0.0405, 'learning_rate': 1.65725541694025e-05, 'epoch': 0.62} 1%| | 631/50750 [1:38:55<82:34:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:21:39,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 18:21:39,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.20 | bwd_microstep: 3870.34 | bwd_inner_microstep: 3862.55 | bwd_allreduce_microstep: 7.74 | step_microstep: 22.18 [2024-11-13 18:21:39,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.20 | bwd: 3870.36 | bwd_inner: 3862.55 | bwd_allreduce: 7.76 | step: 22.18 1%| | 632/50750 [1:39:01<82:38:20, 5.94s/it] {'loss': 0.0896, 'learning_rate': 1.6598818122127382e-05, 'epoch': 0.62} 1%| | 632/50750 [1:39:01<82:38:20, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 18:21:45,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:21:45,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.30 | bwd_microstep: 3841.77 | bwd_inner_microstep: 3834.13 | bwd_allreduce_microstep: 7.60 | step_microstep: 23.30 [2024-11-13 18:21:45,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.29 | bwd: 3841.79 | bwd_inner: 3834.13 | bwd_allreduce: 7.62 | step: 23.31 1%| | 633/50750 [1:39:07<82:34:17, 5.93s/it] {'loss': 0.0049, 'learning_rate': 1.6625082074852266e-05, 'epoch': 0.62} 1%| | 633/50750 [1:39:07<82:34:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:21:51,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:21:51,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.08 | bwd_microstep: 3848.44 | bwd_inner_microstep: 3840.97 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.07 [2024-11-13 18:21:51,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.07 | bwd: 3848.45 | bwd_inner: 3840.97 | bwd_allreduce: 7.44 | step: 21.07 1%| | 634/50750 [1:39:13<82:32:34, 5.93s/it] {'loss': 0.5518, 'learning_rate': 1.665134602757715e-05, 'epoch': 0.62} 1%| | 634/50750 [1:39:13<82:32:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:21:57,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:21:57,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.55 | bwd_microstep: 3852.52 | bwd_inner_microstep: 3845.04 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.92 [2024-11-13 18:21:57,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.55 | bwd: 3852.54 | bwd_inner: 3845.04 | bwd_allreduce: 7.45 | step: 20.93 1%|▏ | 635/50750 [1:39:19<82:32:04, 5.93s/it] {'loss': 0.0048, 'learning_rate': 1.6677609980302037e-05, 'epoch': 0.63} 1%|▏ | 635/50750 [1:39:19<82:32:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:22:03,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 18:22:03,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.20 | bwd_microstep: 3850.36 | bwd_inner_microstep: 3842.69 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.69 [2024-11-13 18:22:03,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.18 | bwd: 3850.37 | bwd_inner: 3842.69 | bwd_allreduce: 7.64 | step: 21.70 1%|▏ | 636/50750 [1:39:25<82:33:25, 5.93s/it] {'loss': 0.0002, 'learning_rate': 1.670387393302692e-05, 'epoch': 0.63} 1%|▏ | 636/50750 [1:39:25<82:33:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:22:09,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:22:09,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.44 | bwd_microstep: 3848.79 | bwd_inner_microstep: 3841.32 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.04 [2024-11-13 18:22:09,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.42 | bwd: 3848.80 | bwd_inner: 3841.32 | bwd_allreduce: 7.45 | step: 21.04 1%|▏ | 637/50750 [1:39:31<82:32:40, 5.93s/it] {'loss': 0.0568, 'learning_rate': 1.6730137885751808e-05, 'epoch': 0.63} 1%|▏ | 637/50750 [1:39:31<82:32:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:22:15,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:22:15,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.14 | bwd_microstep: 3847.40 | bwd_inner_microstep: 3839.90 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.35 [2024-11-13 18:22:15,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.13 | bwd: 3847.42 | bwd_inner: 3839.90 | bwd_allreduce: 7.48 | step: 21.35 1%|▏ | 638/50750 [1:39:37<82:30:01, 5.93s/it] {'loss': 0.271, 'learning_rate': 1.6756401838476692e-05, 'epoch': 0.63} 1%|▏ | 638/50750 [1:39:37<82:30:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:22:21,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 18:22:21,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.64 | bwd_microstep: 3847.01 | bwd_inner_microstep: 3839.31 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.60 [2024-11-13 18:22:21,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.64 | bwd: 3847.02 | bwd_inner: 3839.31 | bwd_allreduce: 7.67 | step: 21.61 1%|▏ | 639/50750 [1:39:43<82:27:57, 5.92s/it] {'loss': 0.442, 'learning_rate': 1.678266579120158e-05, 'epoch': 0.63} 1%|▏ | 639/50750 [1:39:43<82:27:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:22:27,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:22:27,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.30 | bwd_microstep: 3846.47 | bwd_inner_microstep: 3838.94 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.16 [2024-11-13 18:22:27,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.28 | bwd: 3846.48 | bwd_inner: 3838.94 | bwd_allreduce: 7.50 | step: 21.17 1%|▏ | 640/50750 [1:39:49<82:27:31, 5.92s/it] {'loss': 0.0586, 'learning_rate': 1.6808929743926463e-05, 'epoch': 0.63} 1%|▏ | 640/50750 [1:39:49<82:27:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:22:33,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-13 18:22:33,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.21 | bwd_microstep: 3845.00 | bwd_inner_microstep: 3837.52 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.94 [2024-11-13 18:22:33,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.21 | bwd: 3845.01 | bwd_inner: 3837.52 | bwd_allreduce: 7.45 | step: 20.94 1%|▏ | 641/50750 [1:39:55<82:25:16, 5.92s/it] {'loss': 0.0064, 'learning_rate': 1.683519369665135e-05, 'epoch': 0.63} 1%|▏ | 641/50750 [1:39:55<82:25:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:22:39,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-13 18:22:39,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.09 | bwd_microstep: 3844.31 | bwd_inner_microstep: 3836.85 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.91 [2024-11-13 18:22:39,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.08 | bwd: 3844.32 | bwd_inner: 3836.85 | bwd_allreduce: 7.43 | step: 20.91 1%|▏ | 642/50750 [1:40:01<82:25:16, 5.92s/it] {'loss': 0.0035, 'learning_rate': 1.686145764937623e-05, 'epoch': 0.63} 1%|▏ | 642/50750 [1:40:01<82:25:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:22:45,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 18:22:45,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.82 | bwd_microstep: 3851.53 | bwd_inner_microstep: 3844.02 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.27 [2024-11-13 18:22:45,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.80 | bwd: 3851.54 | bwd_inner: 3844.02 | bwd_allreduce: 7.48 | step: 21.28 1%|▏ | 643/50750 [1:40:07<82:25:48, 5.92s/it] {'loss': 0.004, 'learning_rate': 1.6887721602101117e-05, 'epoch': 0.63} 1%|▏ | 643/50750 [1:40:07<82:25:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:22:50,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:22:50,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.41 | bwd_microstep: 3850.30 | bwd_inner_microstep: 3842.63 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.59 [2024-11-13 18:22:50,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.39 | bwd: 3850.31 | bwd_inner: 3842.63 | bwd_allreduce: 7.64 | step: 21.60 1%|▏ | 644/50750 [1:40:12<82:27:43, 5.92s/it] {'loss': 0.002, 'learning_rate': 1.6913985554826e-05, 'epoch': 0.63} 1%|▏ | 644/50750 [1:40:12<82:27:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:22:56,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.92 [2024-11-13 18:22:56,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.51 | bwd_microstep: 3852.74 | bwd_inner_microstep: 3845.10 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.55 [2024-11-13 18:22:56,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.51 | bwd: 3852.76 | bwd_inner: 3845.10 | bwd_allreduce: 7.62 | step: 21.56 1%|▏ | 645/50750 [1:40:18<82:27:55, 5.93s/it] {'loss': 0.0468, 'learning_rate': 1.694024950755089e-05, 'epoch': 0.64} 1%|▏ | 645/50750 [1:40:18<82:27:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 18:23:02,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:23:02,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.08 | bwd_microstep: 3845.83 | bwd_inner_microstep: 3838.35 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.91 [2024-11-13 18:23:02,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.06 | bwd: 3845.84 | bwd_inner: 3838.35 | bwd_allreduce: 7.45 | step: 20.91 1%|▏ | 646/50750 [1:40:24<82:27:04, 5.92s/it] {'loss': 0.001, 'learning_rate': 1.6966513460275772e-05, 'epoch': 0.64} 1%|▏ | 646/50750 [1:40:24<82:27:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:23:08,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 18:23:08,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.62 | bwd_microstep: 3855.02 | bwd_inner_microstep: 3847.31 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.76 [2024-11-13 18:23:08,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.62 | bwd: 3855.04 | bwd_inner: 3847.31 | bwd_allreduce: 7.68 | step: 21.77 1%|▏ | 647/50750 [1:40:30<82:29:00, 5.93s/it] {'loss': 0.0012, 'learning_rate': 1.699277741300066e-05, 'epoch': 0.64} 1%|▏ | 647/50750 [1:40:30<82:29:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:23:14,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 18:23:14,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.76 | bwd_microstep: 3852.14 | bwd_inner_microstep: 3844.44 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.76 [2024-11-13 18:23:14,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.74 | bwd: 3852.16 | bwd_inner: 3844.44 | bwd_allreduce: 7.68 | step: 21.77 1%|▏ | 648/50750 [1:40:36<82:31:29, 5.93s/it] {'loss': 0.5911, 'learning_rate': 1.7019041365725543e-05, 'epoch': 0.64} 1%|▏ | 648/50750 [1:40:36<82:31:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:23:20,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:23:20,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.86 | bwd_microstep: 3841.61 | bwd_inner_microstep: 3834.13 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.97 [2024-11-13 18:23:20,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.85 | bwd: 3841.62 | bwd_inner: 3834.13 | bwd_allreduce: 7.45 | step: 20.97 1%|▏ | 649/50750 [1:40:42<82:29:26, 5.93s/it] {'loss': 0.4143, 'learning_rate': 1.704530531845043e-05, 'epoch': 0.64} 1%|▏ | 649/50750 [1:40:42<82:29:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:23:26,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:23:26,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.32 | bwd_microstep: 3847.46 | bwd_inner_microstep: 3839.98 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.99 [2024-11-13 18:23:26,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.32 | bwd: 3847.47 | bwd_inner: 3839.98 | bwd_allreduce: 7.45 | step: 21.99 1%|▏ | 650/50750 [1:40:48<82:28:38, 5.93s/it] {'loss': 0.5232, 'learning_rate': 1.707156927117531e-05, 'epoch': 0.64} 1%|▏ | 650/50750 [1:40:48<82:28:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:23:32,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:23:32,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.27 | bwd_microstep: 3847.62 | bwd_inner_microstep: 3840.14 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.85 [2024-11-13 18:23:32,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.27 | bwd: 3847.64 | bwd_inner: 3840.14 | bwd_allreduce: 7.46 | step: 20.85 1%|▏ | 651/50750 [1:40:54<82:26:13, 5.92s/it] {'loss': 0.3033, 'learning_rate': 1.7097833223900198e-05, 'epoch': 0.64} 1%|▏ | 651/50750 [1:40:54<82:26:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:23:38,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 18:23:38,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.89 | bwd_microstep: 3851.33 | bwd_inner_microstep: 3843.62 | bwd_allreduce_microstep: 7.66 | step_microstep: 24.14 [2024-11-13 18:23:38,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.89 | bwd: 3851.34 | bwd_inner: 3843.62 | bwd_allreduce: 7.67 | step: 24.14 1%|▏ | 652/50750 [1:41:00<82:27:56, 5.93s/it] {'loss': 0.0042, 'learning_rate': 1.712409717662508e-05, 'epoch': 0.64} 1%|▏ | 652/50750 [1:41:00<82:27:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:23:44,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:23:44,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.48 | bwd_microstep: 3853.15 | bwd_inner_microstep: 3845.63 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.03 [2024-11-13 18:23:44,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.46 | bwd: 3853.17 | bwd_inner: 3845.63 | bwd_allreduce: 7.49 | step: 21.03 1%|▏ | 653/50750 [1:41:06<82:30:27, 5.93s/it] {'loss': 0.4885, 'learning_rate': 1.715036112934997e-05, 'epoch': 0.64} 1%|▏ | 653/50750 [1:41:06<82:30:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:23:50,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 18:23:50,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.55 | bwd_microstep: 3863.48 | bwd_inner_microstep: 3855.57 | bwd_allreduce_microstep: 7.86 | step_microstep: 22.21 [2024-11-13 18:23:50,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.55 | bwd: 3863.49 | bwd_inner: 3855.57 | bwd_allreduce: 7.88 | step: 22.21 1%|▏ | 654/50750 [1:41:12<82:35:45, 5.94s/it] {'loss': 0.7283, 'learning_rate': 1.7176625082074852e-05, 'epoch': 0.64} 1%|▏ | 654/50750 [1:41:12<82:35:45, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:23:56,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:23:56,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.13 | bwd_microstep: 3855.39 | bwd_inner_microstep: 3847.83 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.81 [2024-11-13 18:23:56,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.12 | bwd: 3855.40 | bwd_inner: 3847.83 | bwd_allreduce: 7.53 | step: 21.81 1%|▏ | 655/50750 [1:41:18<82:35:41, 5.94s/it] {'loss': 0.0018, 'learning_rate': 1.720288903479974e-05, 'epoch': 0.65} 1%|▏ | 655/50750 [1:41:18<82:35:41, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:24:02,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:24:02,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.19 | bwd_microstep: 3848.09 | bwd_inner_microstep: 3840.40 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.39 [2024-11-13 18:24:02,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.17 | bwd: 3848.11 | bwd_inner: 3840.40 | bwd_allreduce: 7.66 | step: 21.39 1%|▏ | 656/50750 [1:41:24<82:32:14, 5.93s/it] {'loss': 0.2229, 'learning_rate': 1.7229152987524623e-05, 'epoch': 0.65} 1%|▏ | 656/50750 [1:41:24<82:32:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:24:08,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:24:08,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.57 | bwd_microstep: 3854.86 | bwd_inner_microstep: 3847.12 | bwd_allreduce_microstep: 7.69 | step_microstep: 22.94 [2024-11-13 18:24:08,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.57 | bwd: 3854.88 | bwd_inner: 3847.12 | bwd_allreduce: 7.71 | step: 22.93 1%|▏ | 657/50750 [1:41:30<82:31:24, 5.93s/it] {'loss': 0.0014, 'learning_rate': 1.725541694024951e-05, 'epoch': 0.65} 1%|▏ | 657/50750 [1:41:30<82:31:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:24:14,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:24:14,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.57 | bwd_microstep: 3864.48 | bwd_inner_microstep: 3856.80 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.82 [2024-11-13 18:24:14,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.57 | bwd: 3864.49 | bwd_inner: 3856.80 | bwd_allreduce: 7.64 | step: 21.83 1%|▏ | 658/50750 [1:41:35<82:33:38, 5.93s/it] {'loss': 0.011, 'learning_rate': 1.7281680892974394e-05, 'epoch': 0.65} 1%|▏ | 658/50750 [1:41:35<82:33:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:24:19,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.92 [2024-11-13 18:24:19,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.98 | bwd_microstep: 3860.13 | bwd_inner_microstep: 3852.01 | bwd_allreduce_microstep: 8.05 | step_microstep: 23.81 [2024-11-13 18:24:19,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.97 | bwd: 3860.15 | bwd_inner: 3852.01 | bwd_allreduce: 8.08 | step: 23.81 1%|▏ | 659/50750 [1:41:41<82:37:12, 5.94s/it] {'loss': 0.0069, 'learning_rate': 1.7307944845699278e-05, 'epoch': 0.65} 1%|▏ | 659/50750 [1:41:41<82:37:12, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:24:25,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 18:24:25,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.42 | bwd_microstep: 3853.76 | bwd_inner_microstep: 3845.86 | bwd_allreduce_microstep: 7.84 | step_microstep: 21.82 [2024-11-13 18:24:25,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.41 | bwd: 3853.78 | bwd_inner: 3845.86 | bwd_allreduce: 7.87 | step: 21.81 1%|▏ | 660/50750 [1:41:47<82:35:40, 5.94s/it] {'loss': 0.8973, 'learning_rate': 1.7334208798424165e-05, 'epoch': 0.65} 1%|▏ | 660/50750 [1:41:47<82:35:40, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:24:31,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:24:31,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.55 | bwd_microstep: 3860.24 | bwd_inner_microstep: 3852.37 | bwd_allreduce_microstep: 7.81 | step_microstep: 23.62 [2024-11-13 18:24:31,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.53 | bwd: 3860.25 | bwd_inner: 3852.37 | bwd_allreduce: 7.83 | step: 23.62 1%|▏ | 661/50750 [1:41:53<82:38:14, 5.94s/it] {'loss': 0.0058, 'learning_rate': 1.736047275114905e-05, 'epoch': 0.65} 1%|▏ | 661/50750 [1:41:53<82:38:14, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:24:37,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.92 [2024-11-13 18:24:37,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.35 | bwd_microstep: 3850.75 | bwd_inner_microstep: 3842.57 | bwd_allreduce_microstep: 8.12 | step_microstep: 27.21 [2024-11-13 18:24:37,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.34 | bwd: 3850.77 | bwd_inner: 3842.57 | bwd_allreduce: 8.14 | step: 27.22 1%|▏ | 662/50750 [1:41:59<82:38:31, 5.94s/it] {'loss': 0.0423, 'learning_rate': 1.7386736703873933e-05, 'epoch': 0.65} 1%|▏ | 662/50750 [1:41:59<82:38:31, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:24:43,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:24:43,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.66 | bwd_microstep: 3852.43 | bwd_inner_microstep: 3844.89 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.40 [2024-11-13 18:24:43,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.65 | bwd: 3852.44 | bwd_inner: 3844.89 | bwd_allreduce: 7.51 | step: 21.40 1%|▏ | 663/50750 [1:42:05<82:37:32, 5.94s/it] {'loss': 0.8187, 'learning_rate': 1.741300065659882e-05, 'epoch': 0.65} 1%|▏ | 663/50750 [1:42:05<82:37:32, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:24:49,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:24:49,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.22 | bwd_microstep: 3853.21 | bwd_inner_microstep: 3845.72 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.09 [2024-11-13 18:24:49,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.20 | bwd: 3853.22 | bwd_inner: 3845.72 | bwd_allreduce: 7.46 | step: 21.09 1%|▏ | 664/50750 [1:42:11<83:12:34, 5.98s/it] {'loss': 0.6697, 'learning_rate': 1.7439264609323704e-05, 'epoch': 0.65} 1%|▏ | 664/50750 [1:42:11<83:12:34, 5.98s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:24:55,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:24:55,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.90 | bwd_microstep: 3862.39 | bwd_inner_microstep: 3854.85 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.28 [2024-11-13 18:24:55,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.90 | bwd: 3862.41 | bwd_inner: 3854.85 | bwd_allreduce: 7.51 | step: 21.29 1%|▏ | 665/50750 [1:42:17<83:00:28, 5.97s/it] {'loss': 0.0022, 'learning_rate': 1.746552856204859e-05, 'epoch': 0.66} 1%|▏ | 665/50750 [1:42:17<83:00:28, 5.97s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:25:01,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:25:01,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.70 | bwd_microstep: 3854.43 | bwd_inner_microstep: 3846.87 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.31 [2024-11-13 18:25:01,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.70 | bwd: 3854.45 | bwd_inner: 3846.87 | bwd_allreduce: 7.54 | step: 21.31 1%|▏ | 666/50750 [1:42:23<82:50:41, 5.95s/it] {'loss': 0.5771, 'learning_rate': 1.7491792514773475e-05, 'epoch': 0.66} 1%|▏ | 666/50750 [1:42:23<82:50:41, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:25:07,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:25:07,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.88 | bwd_microstep: 3855.36 | bwd_inner_microstep: 3847.82 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.68 [2024-11-13 18:25:07,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.88 | bwd: 3855.38 | bwd_inner: 3847.82 | bwd_allreduce: 7.51 | step: 21.68 1%|▏ | 667/50750 [1:42:29<82:44:12, 5.95s/it] {'loss': 0.0387, 'learning_rate': 1.751805646749836e-05, 'epoch': 0.66} 1%|▏ | 667/50750 [1:42:29<82:44:12, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:25:13,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.92 [2024-11-13 18:25:13,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.62 | bwd_microstep: 3859.88 | bwd_inner_microstep: 3851.96 | bwd_allreduce_microstep: 7.86 | step_microstep: 23.43 [2024-11-13 18:25:13,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.62 | bwd: 3859.90 | bwd_inner: 3851.96 | bwd_allreduce: 7.88 | step: 23.44 1%|▏ | 668/50750 [1:42:35<82:41:58, 5.94s/it] {'loss': 0.0179, 'learning_rate': 1.7544320420223246e-05, 'epoch': 0.66} 1%|▏ | 668/50750 [1:42:35<82:41:58, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2195 [2024-11-13 18:25:19,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 18:25:19,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.72 | bwd_microstep: 3853.63 | bwd_inner_microstep: 3845.95 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.74 [2024-11-13 18:25:19,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.71 | bwd: 3853.64 | bwd_inner: 3845.95 | bwd_allreduce: 7.65 | step: 21.74 1%|▏ | 669/50750 [1:42:41<82:40:45, 5.94s/it] {'loss': 0.0024, 'learning_rate': 1.757058437294813e-05, 'epoch': 0.66} 1%|▏ | 669/50750 [1:42:41<82:40:45, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:25:25,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:25:25,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.67 | bwd_microstep: 3856.84 | bwd_inner_microstep: 3849.29 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.72 [2024-11-13 18:25:25,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.66 | bwd: 3856.85 | bwd_inner: 3849.29 | bwd_allreduce: 7.52 | step: 21.72 1%|▏ | 670/50750 [1:42:47<82:40:11, 5.94s/it] {'loss': 0.2329, 'learning_rate': 1.7596848325673016e-05, 'epoch': 0.66} 1%|▏ | 670/50750 [1:42:47<82:40:11, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:25:31,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:25:31,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.38 | bwd_microstep: 3852.87 | bwd_inner_microstep: 3845.35 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.17 [2024-11-13 18:25:31,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.38 | bwd: 3852.89 | bwd_inner: 3845.35 | bwd_allreduce: 7.49 | step: 21.18 1%|▏ | 671/50750 [1:42:53<82:36:55, 5.94s/it] {'loss': 0.4403, 'learning_rate': 1.76231122783979e-05, 'epoch': 0.66} 1%|▏ | 671/50750 [1:42:53<82:36:55, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:25:37,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.64 | optimizer_step: 4.92 [2024-11-13 18:25:37,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.55 | bwd_microstep: 3853.48 | bwd_inner_microstep: 3845.45 | bwd_allreduce_microstep: 7.96 | step_microstep: 28.81 [2024-11-13 18:25:37,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.55 | bwd: 3853.50 | bwd_inner: 3845.45 | bwd_allreduce: 7.99 | step: 28.81 1%|▏ | 672/50750 [1:42:59<82:36:19, 5.94s/it] {'loss': 0.528, 'learning_rate': 1.7649376231122784e-05, 'epoch': 0.66} 1%|▏ | 672/50750 [1:42:59<82:36:19, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:25:43,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:25:43,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.07 | bwd_microstep: 3861.62 | bwd_inner_microstep: 3854.06 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.11 [2024-11-13 18:25:43,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.06 | bwd: 3861.63 | bwd_inner: 3854.06 | bwd_allreduce: 7.53 | step: 21.11 1%|▏ | 673/50750 [1:43:05<82:36:08, 5.94s/it] {'loss': 0.3852, 'learning_rate': 1.767564018384767e-05, 'epoch': 0.66} 1%|▏ | 673/50750 [1:43:05<82:36:08, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:25:49,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:25:49,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.13 | bwd_microstep: 3844.12 | bwd_inner_microstep: 3836.49 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.74 [2024-11-13 18:25:49,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.13 | bwd: 3844.13 | bwd_inner: 3836.49 | bwd_allreduce: 7.60 | step: 21.74 1%|▏ | 674/50750 [1:43:11<82:31:01, 5.93s/it] {'loss': 0.0056, 'learning_rate': 1.7701904136572555e-05, 'epoch': 0.66} 1%|▏ | 674/50750 [1:43:11<82:31:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:25:55,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:25:55,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.48 | bwd_microstep: 3854.94 | bwd_inner_microstep: 3847.38 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.38 [2024-11-13 18:25:55,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.48 | bwd: 3854.95 | bwd_inner: 3847.38 | bwd_allreduce: 7.53 | step: 21.39 1%|▏ | 675/50750 [1:43:17<82:30:39, 5.93s/it] {'loss': 0.005, 'learning_rate': 1.7728168089297442e-05, 'epoch': 0.67} 1%|▏ | 675/50750 [1:43:17<82:30:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:26:00,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-13 18:26:00,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.21 | bwd_microstep: 3852.64 | bwd_inner_microstep: 3844.94 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.74 [2024-11-13 18:26:00,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.21 | bwd: 3852.65 | bwd_inner: 3844.94 | bwd_allreduce: 7.67 | step: 21.75 1%|▏ | 676/50750 [1:43:22<82:29:18, 5.93s/it] {'loss': 0.5325, 'learning_rate': 1.7754432042022326e-05, 'epoch': 0.67} 1%|▏ | 676/50750 [1:43:22<82:29:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:26:06,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:26:06,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.13 | bwd_microstep: 3861.13 | bwd_inner_microstep: 3853.40 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.61 [2024-11-13 18:26:06,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.11 | bwd: 3861.15 | bwd_inner: 3853.40 | bwd_allreduce: 7.71 | step: 21.61 1%|▏ | 677/50750 [1:43:28<82:32:39, 5.93s/it] {'loss': 0.0259, 'learning_rate': 1.778069599474721e-05, 'epoch': 0.67} 1%|▏ | 677/50750 [1:43:28<82:32:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:26:12,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:26:12,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.85 | bwd_microstep: 3849.75 | bwd_inner_microstep: 3842.25 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.96 [2024-11-13 18:26:12,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.83 | bwd: 3849.76 | bwd_inner: 3842.25 | bwd_allreduce: 7.47 | step: 20.96 1%|▏ | 678/50750 [1:43:34<82:32:42, 5.93s/it] {'loss': 0.0026, 'learning_rate': 1.7806959947472097e-05, 'epoch': 0.67} 1%|▏ | 678/50750 [1:43:34<82:32:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:26:18,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:26:18,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.01 | bwd_microstep: 3855.22 | bwd_inner_microstep: 3847.70 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.47 [2024-11-13 18:26:18,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.99 | bwd: 3855.23 | bwd_inner: 3847.70 | bwd_allreduce: 7.49 | step: 21.48 1%|▏ | 679/50750 [1:43:40<82:31:02, 5.93s/it] {'loss': 0.003, 'learning_rate': 1.783322390019698e-05, 'epoch': 0.67} 1%|▏ | 679/50750 [1:43:40<82:31:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:26:24,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.42 | optimizer_step: 4.93 [2024-11-13 18:26:24,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.48 | bwd_microstep: 3857.67 | bwd_inner_microstep: 3849.56 | bwd_allreduce_microstep: 8.05 | step_microstep: 27.78 [2024-11-13 18:26:24,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.48 | bwd: 3857.69 | bwd_inner: 3849.57 | bwd_allreduce: 8.08 | step: 27.78 1%|▏ | 680/50750 [1:43:46<82:33:50, 5.94s/it] {'loss': 0.0335, 'learning_rate': 1.7859487852921868e-05, 'epoch': 0.67} 1%|▏ | 680/50750 [1:43:46<82:33:50, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:26:30,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.41 | optimizer_step: 4.93 [2024-11-13 18:26:30,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.60 | bwd_microstep: 3868.83 | bwd_inner_microstep: 3861.15 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.93 [2024-11-13 18:26:30,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.59 | bwd: 3868.84 | bwd_inner: 3861.15 | bwd_allreduce: 7.64 | step: 21.93 1%|▏ | 681/50750 [1:43:52<82:37:02, 5.94s/it] {'loss': 0.0673, 'learning_rate': 1.788575180564675e-05, 'epoch': 0.67} 1%|▏ | 681/50750 [1:43:52<82:37:02, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:26:36,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:26:36,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.51 | bwd_microstep: 3850.07 | bwd_inner_microstep: 3842.56 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.52 [2024-11-13 18:26:36,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.51 | bwd: 3850.09 | bwd_inner: 3842.56 | bwd_allreduce: 7.49 | step: 21.52 1%|▏ | 682/50750 [1:43:58<82:34:51, 5.94s/it] {'loss': 0.1946, 'learning_rate': 1.791201575837164e-05, 'epoch': 0.67} 1%|▏ | 682/50750 [1:43:58<82:34:51, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:26:42,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:26:42,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.52 | bwd_microstep: 3847.83 | bwd_inner_microstep: 3840.35 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.22 [2024-11-13 18:26:42,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.52 | bwd: 3847.84 | bwd_inner: 3840.35 | bwd_allreduce: 7.46 | step: 21.23 1%|▏ | 683/50750 [1:44:04<82:30:05, 5.93s/it] {'loss': 0.608, 'learning_rate': 1.7938279711096522e-05, 'epoch': 0.67} 1%|▏ | 683/50750 [1:44:04<82:30:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:26:48,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 18:26:48,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.08 | bwd_microstep: 3843.33 | bwd_inner_microstep: 3835.83 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.86 [2024-11-13 18:26:48,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.08 | bwd: 3843.34 | bwd_inner: 3835.83 | bwd_allreduce: 7.45 | step: 20.86 1%|▏ | 684/50750 [1:44:10<82:24:45, 5.93s/it] {'loss': 0.0026, 'learning_rate': 1.7964543663821406e-05, 'epoch': 0.67} 1%|▏ | 684/50750 [1:44:10<82:24:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:26:54,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:26:54,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.45 | bwd_microstep: 3857.45 | bwd_inner_microstep: 3849.94 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.31 [2024-11-13 18:26:54,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.45 | bwd: 3857.47 | bwd_inner: 3849.94 | bwd_allreduce: 7.49 | step: 21.32 1%|▏ | 685/50750 [1:44:16<82:26:31, 5.93s/it] {'loss': 0.0093, 'learning_rate': 1.799080761654629e-05, 'epoch': 0.67} 1%|▏ | 685/50750 [1:44:16<82:26:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 18:27:00,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:27:00,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.49 | bwd_microstep: 3850.94 | bwd_inner_microstep: 3842.99 | bwd_allreduce_microstep: 7.90 | step_microstep: 22.30 [2024-11-13 18:27:00,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.49 | bwd: 3850.96 | bwd_inner: 3842.99 | bwd_allreduce: 7.92 | step: 22.30 1%|▏ | 686/50750 [1:44:22<82:26:08, 5.93s/it] {'loss': 0.0005, 'learning_rate': 1.8017071569271177e-05, 'epoch': 0.68} 1%|▏ | 686/50750 [1:44:22<82:26:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:27:06,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 18:27:06,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.28 | bwd_microstep: 3848.97 | bwd_inner_microstep: 3841.39 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.71 [2024-11-13 18:27:06,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.27 | bwd: 3848.98 | bwd_inner: 3841.39 | bwd_allreduce: 7.56 | step: 21.72 1%|▏ | 687/50750 [1:44:28<82:26:46, 5.93s/it] {'loss': 0.269, 'learning_rate': 1.804333552199606e-05, 'epoch': 0.68} 1%|▏ | 687/50750 [1:44:28<82:26:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:27:12,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:27:12,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.24 | bwd_microstep: 3846.32 | bwd_inner_microstep: 3838.80 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.39 [2024-11-13 18:27:12,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.24 | bwd: 3846.33 | bwd_inner: 3838.80 | bwd_allreduce: 7.49 | step: 21.40 1%|▏ | 688/50750 [1:44:34<82:23:40, 5.93s/it] {'loss': 0.4987, 'learning_rate': 1.8069599474720948e-05, 'epoch': 0.68} 1%|▏ | 688/50750 [1:44:34<82:23:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:27:18,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:27:18,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.53 | bwd_microstep: 3849.27 | bwd_inner_microstep: 3841.77 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.02 [2024-11-13 18:27:18,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.52 | bwd: 3849.28 | bwd_inner: 3841.78 | bwd_allreduce: 7.47 | step: 21.02 1%|▏ | 689/50750 [1:44:40<82:22:51, 5.92s/it] {'loss': 0.0093, 'learning_rate': 1.8095863427445832e-05, 'epoch': 0.68} 1%|▏ | 689/50750 [1:44:40<82:22:51, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:27:24,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-13 18:27:24,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.76 | bwd_microstep: 3854.63 | bwd_inner_microstep: 3846.75 | bwd_allreduce_microstep: 7.83 | step_microstep: 22.38 [2024-11-13 18:27:24,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.76 | bwd: 3854.65 | bwd_inner: 3846.75 | bwd_allreduce: 7.85 | step: 22.39 1%|▏ | 690/50750 [1:44:45<82:26:06, 5.93s/it] {'loss': 0.0572, 'learning_rate': 1.812212738017072e-05, 'epoch': 0.68} 1%|▏ | 690/50750 [1:44:45<82:26:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:27:29,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:27:29,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.87 | bwd_microstep: 3853.05 | bwd_inner_microstep: 3845.53 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.88 [2024-11-13 18:27:29,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.86 | bwd: 3853.07 | bwd_inner: 3845.53 | bwd_allreduce: 7.49 | step: 21.89 1%|▏ | 691/50750 [1:44:51<82:28:22, 5.93s/it] {'loss': 0.0228, 'learning_rate': 1.8148391332895603e-05, 'epoch': 0.68} 1%|▏ | 691/50750 [1:44:51<82:28:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:27:35,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 18:27:35,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.32 | bwd_microstep: 3850.30 | bwd_inner_microstep: 3842.80 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.09 [2024-11-13 18:27:35,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.31 | bwd: 3850.32 | bwd_inner: 3842.80 | bwd_allreduce: 7.48 | step: 21.10 1%|▏ | 692/50750 [1:44:57<82:27:42, 5.93s/it] {'loss': 0.2599, 'learning_rate': 1.817465528562049e-05, 'epoch': 0.68} 1%|▏ | 692/50750 [1:44:57<82:27:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:27:41,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 18:27:41,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.27 | bwd_microstep: 3852.08 | bwd_inner_microstep: 3844.57 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.30 [2024-11-13 18:27:41,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.27 | bwd: 3852.09 | bwd_inner: 3844.57 | bwd_allreduce: 7.49 | step: 21.30 1%|▏ | 693/50750 [1:45:03<82:27:19, 5.93s/it] {'loss': 0.0034, 'learning_rate': 1.820091923834537e-05, 'epoch': 0.68} 1%|▏ | 693/50750 [1:45:03<82:27:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:27:47,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:27:47,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.10 | bwd_microstep: 3848.28 | bwd_inner_microstep: 3840.70 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.28 [2024-11-13 18:27:47,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.11 | bwd: 3848.29 | bwd_inner: 3840.70 | bwd_allreduce: 7.55 | step: 21.29 1%|▏ | 694/50750 [1:45:09<82:24:11, 5.93s/it] {'loss': 1.5825, 'learning_rate': 1.8227183191070257e-05, 'epoch': 0.68} 1%|▏ | 694/50750 [1:45:09<82:24:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:27:53,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 18:27:53,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.84 | bwd_microstep: 3856.15 | bwd_inner_microstep: 3848.56 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.47 [2024-11-13 18:27:53,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.84 | bwd: 3856.16 | bwd_inner: 3848.56 | bwd_allreduce: 7.56 | step: 21.48 1%|▏ | 695/50750 [1:45:15<82:25:01, 5.93s/it] {'loss': 0.4153, 'learning_rate': 1.825344714379514e-05, 'epoch': 0.68} 1%|▏ | 695/50750 [1:45:15<82:25:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:27:59,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 18:27:59,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.45 | bwd_microstep: 3850.68 | bwd_inner_microstep: 3842.83 | bwd_allreduce_microstep: 7.80 | step_microstep: 22.26 [2024-11-13 18:27:59,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.45 | bwd: 3850.70 | bwd_inner: 3842.83 | bwd_allreduce: 7.82 | step: 22.26 1%|▏ | 696/50750 [1:45:21<82:25:14, 5.93s/it] {'loss': 0.8804, 'learning_rate': 1.827971109652003e-05, 'epoch': 0.69} 1%|▏ | 696/50750 [1:45:21<82:25:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:28:05,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 18:28:05,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.56 | bwd_microstep: 3846.61 | bwd_inner_microstep: 3838.84 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.87 [2024-11-13 18:28:05,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.54 | bwd: 3846.63 | bwd_inner: 3838.84 | bwd_allreduce: 7.75 | step: 21.87 1%|▏ | 697/50750 [1:45:27<82:24:01, 5.93s/it] {'loss': 0.6026, 'learning_rate': 1.8305975049244912e-05, 'epoch': 0.69} 1%|▏ | 697/50750 [1:45:27<82:24:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:28:11,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.92 [2024-11-13 18:28:11,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.02 | bwd_microstep: 3849.53 | bwd_inner_microstep: 3841.92 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.82 [2024-11-13 18:28:11,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.00 | bwd: 3849.54 | bwd_inner: 3841.92 | bwd_allreduce: 7.58 | step: 21.82 1%|▏ | 698/50750 [1:45:33<82:25:18, 5.93s/it] {'loss': 0.268, 'learning_rate': 1.83322390019698e-05, 'epoch': 0.69} 1%|▏ | 698/50750 [1:45:33<82:25:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:28:17,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.92 [2024-11-13 18:28:17,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.87 | bwd_microstep: 3851.61 | bwd_inner_microstep: 3844.11 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.33 [2024-11-13 18:28:17,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.87 | bwd: 3851.62 | bwd_inner: 3844.11 | bwd_allreduce: 7.47 | step: 21.33 1%|▏ | 699/50750 [1:45:39<82:24:51, 5.93s/it] {'loss': 0.4024, 'learning_rate': 1.8358502954694683e-05, 'epoch': 0.69} 1%|▏ | 699/50750 [1:45:39<82:24:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:28:23,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:28:23,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.13 | bwd_microstep: 3859.17 | bwd_inner_microstep: 3851.71 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.94 [2024-11-13 18:28:23,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.13 | bwd: 3859.18 | bwd_inner: 3851.71 | bwd_allreduce: 7.43 | step: 20.95 1%|▏ | 700/50750 [1:45:45<82:25:16, 5.93s/it] {'loss': 0.0839, 'learning_rate': 1.838476690741957e-05, 'epoch': 0.69} 1%|▏ | 700/50750 [1:45:45<82:25:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:28:29,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:28:29,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.60 | bwd_microstep: 3852.82 | bwd_inner_microstep: 3845.33 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.96 [2024-11-13 18:28:29,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.60 | bwd: 3852.83 | bwd_inner: 3845.33 | bwd_allreduce: 7.46 | step: 20.97 1%|▏ | 701/50750 [1:45:51<82:23:10, 5.93s/it] {'loss': 0.9381, 'learning_rate': 1.8411030860144454e-05, 'epoch': 0.69} 1%|▏ | 701/50750 [1:45:51<82:23:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:28:35,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 18:28:35,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.13 | bwd_microstep: 3850.43 | bwd_inner_microstep: 3842.92 | bwd_allreduce_microstep: 7.46 | step_microstep: 22.21 [2024-11-13 18:28:35,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.13 | bwd: 3850.44 | bwd_inner: 3842.92 | bwd_allreduce: 7.48 | step: 22.21 1%|▏ | 702/50750 [1:45:57<82:22:16, 5.93s/it] {'loss': 0.204, 'learning_rate': 1.8437294812869338e-05, 'epoch': 0.69} 1%|▏ | 702/50750 [1:45:57<82:22:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:28:41,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 18:28:41,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.07 | bwd_microstep: 3858.06 | bwd_inner_microstep: 3850.49 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.19 [2024-11-13 18:28:41,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.07 | bwd: 3858.07 | bwd_inner: 3850.49 | bwd_allreduce: 7.54 | step: 21.20 1%|▏ | 703/50750 [1:46:03<82:24:18, 5.93s/it] {'loss': 0.0754, 'learning_rate': 1.846355876559422e-05, 'epoch': 0.69} 1%|▏ | 703/50750 [1:46:03<82:24:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:28:47,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:28:47,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.47 | bwd_microstep: 3854.14 | bwd_inner_microstep: 3846.66 | bwd_allreduce_microstep: 7.44 | step_microstep: 22.83 [2024-11-13 18:28:47,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.45 | bwd: 3854.15 | bwd_inner: 3846.66 | bwd_allreduce: 7.45 | step: 22.83 1%|▏ | 704/50750 [1:46:08<82:24:05, 5.93s/it] {'loss': 0.0614, 'learning_rate': 1.848982271831911e-05, 'epoch': 0.69} 1%|▏ | 704/50750 [1:46:08<82:24:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:28:52,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:28:52,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.37 | bwd_microstep: 3849.32 | bwd_inner_microstep: 3841.78 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.68 [2024-11-13 18:28:52,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.37 | bwd: 3849.33 | bwd_inner: 3841.78 | bwd_allreduce: 7.51 | step: 21.68 1%|▏ | 705/50750 [1:46:14<82:23:47, 5.93s/it] {'loss': 0.1087, 'learning_rate': 1.8516086671043992e-05, 'epoch': 0.69} 1%|▏ | 705/50750 [1:46:14<82:23:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:28:58,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 18:28:58,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.85 | bwd_microstep: 3854.44 | bwd_inner_microstep: 3846.76 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.08 [2024-11-13 18:28:58,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.85 | bwd: 3854.45 | bwd_inner: 3846.76 | bwd_allreduce: 7.65 | step: 21.09 1%|▏ | 706/50750 [1:46:20<82:23:50, 5.93s/it] {'loss': 0.1237, 'learning_rate': 1.854235062376888e-05, 'epoch': 0.7} 1%|▏ | 706/50750 [1:46:20<82:23:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:29:04,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:29:04,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.39 | bwd_microstep: 3847.42 | bwd_inner_microstep: 3839.78 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.43 [2024-11-13 18:29:04,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.39 | bwd: 3847.44 | bwd_inner: 3839.78 | bwd_allreduce: 7.62 | step: 21.44 1%|▏ | 707/50750 [1:46:26<82:21:33, 5.92s/it] {'loss': 0.0248, 'learning_rate': 1.8568614576493763e-05, 'epoch': 0.7} 1%|▏ | 707/50750 [1:46:26<82:21:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:29:10,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 18:29:10,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.87 | bwd_microstep: 3847.47 | bwd_inner_microstep: 3839.61 | bwd_allreduce_microstep: 7.81 | step_microstep: 21.90 [2024-11-13 18:29:10,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.87 | bwd: 3847.49 | bwd_inner: 3839.61 | bwd_allreduce: 7.84 | step: 21.90 1%|▏ | 708/50750 [1:46:32<82:21:36, 5.92s/it] {'loss': 0.0502, 'learning_rate': 1.859487852921865e-05, 'epoch': 0.7} 1%|▏ | 708/50750 [1:46:32<82:21:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2194 [2024-11-13 18:29:16,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:29:16,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.86 | bwd_microstep: 3847.65 | bwd_inner_microstep: 3839.88 | bwd_allreduce_microstep: 7.72 | step_microstep: 21.95 [2024-11-13 18:29:16,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.84 | bwd: 3847.66 | bwd_inner: 3839.88 | bwd_allreduce: 7.74 | step: 21.96 1%|▏ | 709/50750 [1:46:38<82:21:54, 5.93s/it] {'loss': 0.0056, 'learning_rate': 1.8621142481943534e-05, 'epoch': 0.7} 1%|▏ | 709/50750 [1:46:38<82:21:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:29:22,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:29:22,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.95 | bwd_microstep: 3846.98 | bwd_inner_microstep: 3839.46 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.20 [2024-11-13 18:29:22,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.93 | bwd: 3847.00 | bwd_inner: 3839.46 | bwd_allreduce: 7.49 | step: 21.21 1%|▏ | 710/50750 [1:46:44<82:22:47, 5.93s/it] {'loss': 0.0091, 'learning_rate': 1.8647406434668418e-05, 'epoch': 0.7} 1%|▏ | 710/50750 [1:46:44<82:22:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:29:28,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-13 18:29:28,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.79 | bwd_microstep: 3847.80 | bwd_inner_microstep: 3839.83 | bwd_allreduce_microstep: 7.90 | step_microstep: 22.28 [2024-11-13 18:29:28,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.77 | bwd: 3847.82 | bwd_inner: 3839.83 | bwd_allreduce: 7.93 | step: 22.28 1%|▏ | 711/50750 [1:46:50<82:22:38, 5.93s/it] {'loss': 0.0179, 'learning_rate': 1.8673670387393302e-05, 'epoch': 0.7} 1%|▏ | 711/50750 [1:46:50<82:22:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:29:34,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:29:34,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.90 | bwd_microstep: 3845.68 | bwd_inner_microstep: 3838.19 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.09 [2024-11-13 18:29:34,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.90 | bwd: 3845.69 | bwd_inner: 3838.19 | bwd_allreduce: 7.46 | step: 21.09 1%|▏ | 712/50750 [1:46:56<82:21:37, 5.93s/it] {'loss': 0.0148, 'learning_rate': 1.869993434011819e-05, 'epoch': 0.7} 1%|▏ | 712/50750 [1:46:56<82:21:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:29:40,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 18:29:40,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.40 | bwd_microstep: 3849.08 | bwd_inner_microstep: 3841.49 | bwd_allreduce_microstep: 7.54 | step_microstep: 23.15 [2024-11-13 18:29:40,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.40 | bwd: 3849.09 | bwd_inner: 3841.49 | bwd_allreduce: 7.56 | step: 23.15 1%|▏ | 713/50750 [1:47:02<82:21:18, 5.93s/it] {'loss': 0.7526, 'learning_rate': 1.8726198292843073e-05, 'epoch': 0.7} 1%|▏ | 713/50750 [1:47:02<82:21:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:29:46,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 18:29:46,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.61 | bwd_microstep: 3848.47 | bwd_inner_microstep: 3840.95 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.03 [2024-11-13 18:29:46,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.60 | bwd: 3848.48 | bwd_inner: 3840.95 | bwd_allreduce: 7.50 | step: 21.04 1%|▏ | 714/50750 [1:47:08<82:19:54, 5.92s/it] {'loss': 0.0483, 'learning_rate': 1.875246224556796e-05, 'epoch': 0.7} 1%|▏ | 714/50750 [1:47:08<82:19:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:29:52,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 18:29:52,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.68 | bwd_microstep: 3860.29 | bwd_inner_microstep: 3852.53 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.68 [2024-11-13 18:29:52,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.68 | bwd: 3860.30 | bwd_inner: 3852.53 | bwd_allreduce: 7.73 | step: 21.69 1%|▏ | 715/50750 [1:47:14<82:23:45, 5.93s/it] {'loss': 0.2427, 'learning_rate': 1.8778726198292844e-05, 'epoch': 0.7} 1%|▏ | 715/50750 [1:47:14<82:23:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:29:58,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:29:58,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.11 | bwd_microstep: 3855.66 | bwd_inner_microstep: 3845.93 | bwd_allreduce_microstep: 9.64 | step_microstep: 21.75 [2024-11-13 18:29:58,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.10 | bwd: 3855.69 | bwd_inner: 3845.93 | bwd_allreduce: 9.68 | step: 21.73 1%|▏ | 716/50750 [1:47:20<82:23:42, 5.93s/it] {'loss': 0.7384, 'learning_rate': 1.880499015101773e-05, 'epoch': 0.71} 1%|▏ | 716/50750 [1:47:20<82:23:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:30:04,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:30:04,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.81 | bwd_microstep: 3847.63 | bwd_inner_microstep: 3840.14 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.94 [2024-11-13 18:30:04,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.80 | bwd: 3847.65 | bwd_inner: 3840.14 | bwd_allreduce: 7.47 | step: 20.94 1%|▏ | 717/50750 [1:47:25<82:21:11, 5.93s/it] {'loss': 0.0057, 'learning_rate': 1.8831254103742615e-05, 'epoch': 0.71} 1%|▏ | 717/50750 [1:47:26<82:21:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:30:09,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:30:09,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.85 | bwd_microstep: 3849.79 | bwd_inner_microstep: 3841.81 | bwd_allreduce_microstep: 7.92 | step_microstep: 25.31 [2024-11-13 18:30:09,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.85 | bwd: 3849.81 | bwd_inner: 3841.81 | bwd_allreduce: 7.94 | step: 25.30 1%|▏ | 718/50750 [1:47:31<82:23:07, 5.93s/it] {'loss': 0.2466, 'learning_rate': 1.8857518056467502e-05, 'epoch': 0.71} 1%|▏ | 718/50750 [1:47:31<82:23:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:30:15,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-13 18:30:15,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.76 | bwd_microstep: 3845.28 | bwd_inner_microstep: 3837.42 | bwd_allreduce_microstep: 7.81 | step_microstep: 21.81 [2024-11-13 18:30:15,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.74 | bwd: 3845.29 | bwd_inner: 3837.42 | bwd_allreduce: 7.83 | step: 21.81 1%|▏ | 719/50750 [1:47:37<82:23:20, 5.93s/it] {'loss': 0.0065, 'learning_rate': 1.8883782009192386e-05, 'epoch': 0.71} 1%|▏ | 719/50750 [1:47:37<82:23:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:30:21,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:30:21,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.39 | bwd_microstep: 3849.33 | bwd_inner_microstep: 3841.86 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.88 [2024-11-13 18:30:21,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.34 | bwd: 3849.34 | bwd_inner: 3841.86 | bwd_allreduce: 7.45 | step: 20.89 1%|▏ | 720/50750 [1:47:43<82:23:09, 5.93s/it] {'loss': 0.4307, 'learning_rate': 1.891004596191727e-05, 'epoch': 0.71} 1%|▏ | 720/50750 [1:47:43<82:23:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:30:27,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:30:27,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.77 | bwd_microstep: 3853.42 | bwd_inner_microstep: 3845.84 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.48 [2024-11-13 18:30:27,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.77 | bwd: 3853.43 | bwd_inner: 3845.84 | bwd_allreduce: 7.55 | step: 21.49 1%|▏ | 721/50750 [1:47:49<82:22:08, 5.93s/it] {'loss': 0.0064, 'learning_rate': 1.8936309914642153e-05, 'epoch': 0.71} 1%|▏ | 721/50750 [1:47:49<82:22:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:30:33,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 18:30:33,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.50 | bwd_microstep: 3846.37 | bwd_inner_microstep: 3838.53 | bwd_allreduce_microstep: 7.80 | step_microstep: 21.84 [2024-11-13 18:30:33,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.49 | bwd: 3846.39 | bwd_inner: 3838.53 | bwd_allreduce: 7.82 | step: 21.85 1%|▏ | 722/50750 [1:47:55<82:20:30, 5.93s/it] {'loss': 0.9709, 'learning_rate': 1.896257386736704e-05, 'epoch': 0.71} 1%|▏ | 722/50750 [1:47:55<82:20:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:30:39,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:30:39,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.84 | bwd_microstep: 3849.45 | bwd_inner_microstep: 3841.98 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.80 [2024-11-13 18:30:39,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.83 | bwd: 3849.46 | bwd_inner: 3841.98 | bwd_allreduce: 7.45 | step: 20.80 1%|▏ | 723/50750 [1:48:01<82:21:11, 5.93s/it] {'loss': 0.0036, 'learning_rate': 1.8988837820091924e-05, 'epoch': 0.71} 1%|▏ | 723/50750 [1:48:01<82:21:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:30:45,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:30:45,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.98 | bwd_microstep: 3857.34 | bwd_inner_microstep: 3849.85 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.40 [2024-11-13 18:30:45,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.98 | bwd: 3857.35 | bwd_inner: 3849.85 | bwd_allreduce: 7.46 | step: 21.40 1%|▏ | 724/50750 [1:48:07<82:21:49, 5.93s/it] {'loss': 0.0114, 'learning_rate': 1.901510177281681e-05, 'epoch': 0.71} 1%|▏ | 724/50750 [1:48:07<82:21:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:30:51,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 18:30:51,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.51 | bwd_microstep: 3853.60 | bwd_inner_microstep: 3845.84 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.79 [2024-11-13 18:30:51,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.48 | bwd: 3853.61 | bwd_inner: 3845.84 | bwd_allreduce: 7.73 | step: 21.80 1%|▏ | 725/50750 [1:48:13<82:22:48, 5.93s/it] {'loss': 0.5966, 'learning_rate': 1.9041365725541695e-05, 'epoch': 0.71} 1%|▏ | 725/50750 [1:48:13<82:22:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:30:57,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 18:30:57,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.94 | bwd_microstep: 3852.87 | bwd_inner_microstep: 3845.33 | bwd_allreduce_microstep: 7.49 | step_microstep: 24.76 [2024-11-13 18:30:57,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.92 | bwd: 3852.88 | bwd_inner: 3845.33 | bwd_allreduce: 7.51 | step: 24.76 1%|▏ | 726/50750 [1:48:19<82:23:14, 5.93s/it] {'loss': 0.255, 'learning_rate': 1.9067629678266582e-05, 'epoch': 0.72} 1%|▏ | 726/50750 [1:48:19<82:23:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 18:31:03,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 18:31:03,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.65 | bwd_microstep: 3847.38 | bwd_inner_microstep: 3839.52 | bwd_allreduce_microstep: 7.80 | step_microstep: 21.70 [2024-11-13 18:31:03,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.63 | bwd: 3847.40 | bwd_inner: 3839.52 | bwd_allreduce: 7.82 | step: 21.70 1%|▏ | 727/50750 [1:48:25<82:22:09, 5.93s/it] {'loss': 0.1429, 'learning_rate': 1.9093893630991466e-05, 'epoch': 0.72} 1%|▏ | 727/50750 [1:48:25<82:22:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:31:09,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 18:31:09,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.58 | bwd_microstep: 3842.96 | bwd_inner_microstep: 3835.47 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.87 [2024-11-13 18:31:09,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.56 | bwd: 3842.97 | bwd_inner: 3835.47 | bwd_allreduce: 7.46 | step: 20.87 1%|▏ | 728/50750 [1:48:31<82:19:05, 5.92s/it] {'loss': 0.0004, 'learning_rate': 1.912015758371635e-05, 'epoch': 0.72} 1%|▏ | 728/50750 [1:48:31<82:19:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:31:15,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 18:31:15,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.06 | bwd_microstep: 3844.64 | bwd_inner_microstep: 3837.00 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.61 [2024-11-13 18:31:15,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.06 | bwd: 3844.66 | bwd_inner: 3837.00 | bwd_allreduce: 7.62 | step: 21.62 1%|▏ | 729/50750 [1:48:37<82:17:22, 5.92s/it] {'loss': 1.117, 'learning_rate': 1.9146421536441237e-05, 'epoch': 0.72} 1%|▏ | 729/50750 [1:48:37<82:17:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:31:21,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:31:21,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.45 | bwd_microstep: 3847.73 | bwd_inner_microstep: 3840.25 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.98 [2024-11-13 18:31:21,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.43 | bwd: 3847.74 | bwd_inner: 3840.25 | bwd_allreduce: 7.46 | step: 20.98 1%|▏ | 730/50750 [1:48:43<82:17:13, 5.92s/it] {'loss': 0.2076, 'learning_rate': 1.917268548916612e-05, 'epoch': 0.72} 1%|▏ | 730/50750 [1:48:43<82:17:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:31:26,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 18:31:26,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.68 | bwd_microstep: 3850.66 | bwd_inner_microstep: 3843.20 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.88 [2024-11-13 18:31:26,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.69 | bwd: 3850.68 | bwd_inner: 3843.20 | bwd_allreduce: 7.43 | step: 20.88 1%|▏ | 731/50750 [1:48:48<82:17:45, 5.92s/it] {'loss': 0.2218, 'learning_rate': 1.9198949441891008e-05, 'epoch': 0.72} 1%|▏ | 731/50750 [1:48:48<82:17:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:31:32,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 18:31:32,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.15 | bwd_microstep: 3856.38 | bwd_inner_microstep: 3848.87 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.14 [2024-11-13 18:31:32,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.15 | bwd: 3856.40 | bwd_inner: 3848.87 | bwd_allreduce: 7.49 | step: 21.14 1%|▏ | 732/50750 [1:48:54<82:18:49, 5.92s/it] {'loss': 0.0835, 'learning_rate': 1.922521339461589e-05, 'epoch': 0.72} 1%|▏ | 732/50750 [1:48:54<82:18:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:31:38,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:31:38,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.89 | bwd_microstep: 3842.37 | bwd_inner_microstep: 3834.86 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.19 [2024-11-13 18:31:38,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.89 | bwd: 3842.39 | bwd_inner: 3834.86 | bwd_allreduce: 7.49 | step: 21.19 1%|▏ | 733/50750 [1:49:00<82:15:43, 5.92s/it] {'loss': 0.6087, 'learning_rate': 1.9251477347340775e-05, 'epoch': 0.72} 1%|▏ | 733/50750 [1:49:00<82:15:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:31:44,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:31:44,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.06 | bwd_microstep: 3854.35 | bwd_inner_microstep: 3846.85 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.01 [2024-11-13 18:31:44,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.06 | bwd: 3854.36 | bwd_inner: 3846.85 | bwd_allreduce: 7.47 | step: 21.01 1%|▏ | 734/50750 [1:49:06<82:16:48, 5.92s/it] {'loss': 0.0564, 'learning_rate': 1.9277741300065662e-05, 'epoch': 0.72} 1%|▏ | 734/50750 [1:49:06<82:16:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:31:50,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:31:50,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.71 | bwd_microstep: 3843.71 | bwd_inner_microstep: 3836.10 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.32 [2024-11-13 18:31:50,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.71 | bwd: 3843.73 | bwd_inner: 3836.10 | bwd_allreduce: 7.57 | step: 21.32 1%|▏ | 735/50750 [1:49:12<82:14:58, 5.92s/it] {'loss': 0.1951, 'learning_rate': 1.9304005252790546e-05, 'epoch': 0.72} 1%|▏ | 735/50750 [1:49:12<82:14:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:31:56,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:31:56,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.65 | bwd_microstep: 3846.56 | bwd_inner_microstep: 3839.08 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.92 [2024-11-13 18:31:56,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.65 | bwd: 3846.57 | bwd_inner: 3839.08 | bwd_allreduce: 7.45 | step: 20.93 1%|▏ | 736/50750 [1:49:18<82:14:34, 5.92s/it] {'loss': 0.0415, 'learning_rate': 1.933026920551543e-05, 'epoch': 0.73} 1%|▏ | 736/50750 [1:49:18<82:14:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:32:02,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:32:02,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.76 | bwd_microstep: 3845.09 | bwd_inner_microstep: 3837.63 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.89 [2024-11-13 18:32:02,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.76 | bwd: 3845.10 | bwd_inner: 3837.63 | bwd_allreduce: 7.43 | step: 20.89 1%|▏ | 737/50750 [1:49:24<82:13:21, 5.92s/it] {'loss': 0.0328, 'learning_rate': 1.9356533158240317e-05, 'epoch': 0.73} 1%|▏ | 737/50750 [1:49:24<82:13:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:32:08,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:32:08,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.30 | bwd_microstep: 3849.08 | bwd_inner_microstep: 3841.57 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.04 [2024-11-13 18:32:08,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.30 | bwd: 3849.09 | bwd_inner: 3841.57 | bwd_allreduce: 7.49 | step: 21.04 1%|▏ | 738/50750 [1:49:30<82:13:50, 5.92s/it] {'loss': 0.0085, 'learning_rate': 1.93827971109652e-05, 'epoch': 0.73} 1%|▏ | 738/50750 [1:49:30<82:13:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:32:14,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:32:14,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.28 | bwd_microstep: 3851.09 | bwd_inner_microstep: 3843.60 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.10 [2024-11-13 18:32:14,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.28 | bwd: 3851.11 | bwd_inner: 3843.60 | bwd_allreduce: 7.46 | step: 21.11 1%|▏ | 739/50750 [1:49:36<82:15:39, 5.92s/it] {'loss': 0.2562, 'learning_rate': 1.9409061063690088e-05, 'epoch': 0.73} 1%|▏ | 739/50750 [1:49:36<82:15:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:32:20,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:32:20,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.97 | bwd_microstep: 3847.36 | bwd_inner_microstep: 3839.90 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.94 [2024-11-13 18:32:20,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.96 | bwd: 3847.37 | bwd_inner: 3839.90 | bwd_allreduce: 7.43 | step: 20.94 1%|▏ | 740/50750 [1:49:42<82:16:38, 5.92s/it] {'loss': 0.0535, 'learning_rate': 1.9435325016414972e-05, 'epoch': 0.73} 1%|▏ | 740/50750 [1:49:42<82:16:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:32:26,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:32:26,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.56 | bwd_microstep: 3847.26 | bwd_inner_microstep: 3839.80 | bwd_allreduce_microstep: 7.42 | step_microstep: 23.40 [2024-11-13 18:32:26,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.56 | bwd: 3847.27 | bwd_inner: 3839.80 | bwd_allreduce: 7.44 | step: 23.40 1%|▏ | 741/50750 [1:49:48<82:15:40, 5.92s/it] {'loss': 0.0052, 'learning_rate': 1.946158896913986e-05, 'epoch': 0.73} 1%|▏ | 741/50750 [1:49:48<82:15:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:32:32,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:32:32,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.35 | bwd_microstep: 3848.70 | bwd_inner_microstep: 3841.18 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.07 [2024-11-13 18:32:32,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.35 | bwd: 3848.72 | bwd_inner: 3841.18 | bwd_allreduce: 7.49 | step: 21.08 1%|▏ | 742/50750 [1:49:54<82:15:29, 5.92s/it] {'loss': 0.6522, 'learning_rate': 1.9487852921864743e-05, 'epoch': 0.73} 1%|▏ | 742/50750 [1:49:54<82:15:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:32:38,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:32:38,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.78 | bwd_microstep: 3840.55 | bwd_inner_microstep: 3833.05 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.19 [2024-11-13 18:32:38,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.78 | bwd: 3840.56 | bwd_inner: 3833.05 | bwd_allreduce: 7.46 | step: 21.20 1%|▏ | 743/50750 [1:49:59<82:12:03, 5.92s/it] {'loss': 0.506, 'learning_rate': 1.9514116874589626e-05, 'epoch': 0.73} 1%|▏ | 743/50750 [1:50:00<82:12:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:32:43,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:32:43,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.89 | bwd_microstep: 3845.88 | bwd_inner_microstep: 3838.33 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.07 [2024-11-13 18:32:43,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.89 | bwd: 3845.89 | bwd_inner: 3838.33 | bwd_allreduce: 7.52 | step: 21.07 1%|▏ | 744/50750 [1:50:05<82:10:48, 5.92s/it] {'loss': 0.004, 'learning_rate': 1.9540380827314514e-05, 'epoch': 0.73} 1%|▏ | 744/50750 [1:50:05<82:10:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:32:49,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:32:49,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.45 | bwd_microstep: 3850.90 | bwd_inner_microstep: 3843.34 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.62 [2024-11-13 18:32:49,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.45 | bwd: 3850.91 | bwd_inner: 3843.34 | bwd_allreduce: 7.53 | step: 21.63 1%|▏ | 745/50750 [1:50:11<82:13:04, 5.92s/it] {'loss': 0.1205, 'learning_rate': 1.9566644780039397e-05, 'epoch': 0.73} 1%|▏ | 745/50750 [1:50:11<82:13:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:32:55,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 18:32:55,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.90 | bwd_microstep: 3844.03 | bwd_inner_microstep: 3836.52 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.16 [2024-11-13 18:32:55,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.89 | bwd: 3844.05 | bwd_inner: 3836.52 | bwd_allreduce: 7.49 | step: 21.16 1%|▏ | 746/50750 [1:50:17<82:14:37, 5.92s/it] {'loss': 0.6777, 'learning_rate': 1.959290873276428e-05, 'epoch': 0.73} 1%|▏ | 746/50750 [1:50:17<82:14:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:33:01,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:33:01,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.58 | bwd_microstep: 3842.48 | bwd_inner_microstep: 3834.77 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.45 [2024-11-13 18:33:01,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.58 | bwd: 3842.49 | bwd_inner: 3834.77 | bwd_allreduce: 7.68 | step: 21.46 1%|▏ | 747/50750 [1:50:23<82:12:53, 5.92s/it] {'loss': 0.0023, 'learning_rate': 1.961917268548917e-05, 'epoch': 0.74} 1%|▏ | 747/50750 [1:50:23<82:12:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:33:07,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 18:33:07,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.61 | bwd_microstep: 3841.31 | bwd_inner_microstep: 3833.81 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.97 [2024-11-13 18:33:07,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.61 | bwd: 3841.32 | bwd_inner: 3833.81 | bwd_allreduce: 7.47 | step: 20.97 1%|▏ | 748/50750 [1:50:29<82:10:33, 5.92s/it] {'loss': 0.0142, 'learning_rate': 1.9645436638214052e-05, 'epoch': 0.74} 1%|▏ | 748/50750 [1:50:29<82:10:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:33:13,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 18:33:13,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.82 | bwd_microstep: 3846.00 | bwd_inner_microstep: 3838.51 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.92 [2024-11-13 18:33:13,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.82 | bwd: 3846.02 | bwd_inner: 3838.51 | bwd_allreduce: 7.47 | step: 20.93 1%|▏ | 749/50750 [1:50:35<82:10:32, 5.92s/it] {'loss': 0.4268, 'learning_rate': 1.967170059093894e-05, 'epoch': 0.74} 1%|▏ | 749/50750 [1:50:35<82:10:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:33:19,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:33:19,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.16 | bwd_microstep: 3851.16 | bwd_inner_microstep: 3843.66 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.32 [2024-11-13 18:33:19,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.16 | bwd: 3851.17 | bwd_inner: 3843.66 | bwd_allreduce: 7.48 | step: 21.32 1%|▏ | 750/50750 [1:50:41<82:12:05, 5.92s/it] {'loss': 0.0055, 'learning_rate': 1.9697964543663823e-05, 'epoch': 0.74} 1%|▏ | 750/50750 [1:50:41<82:12:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:33:25,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:33:25,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.91 | bwd_microstep: 3858.01 | bwd_inner_microstep: 3850.12 | bwd_allreduce_microstep: 7.84 | step_microstep: 23.83 [2024-11-13 18:33:25,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.91 | bwd: 3858.03 | bwd_inner: 3850.12 | bwd_allreduce: 7.86 | step: 23.83 1%|▏ | 751/50750 [1:50:47<82:16:39, 5.92s/it] {'loss': 0.05, 'learning_rate': 1.972422849638871e-05, 'epoch': 0.74} 1%|▏ | 751/50750 [1:50:47<82:16:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:33:31,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.30 | optimizer_step: 4.92 [2024-11-13 18:33:31,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.66 | bwd_microstep: 3849.19 | bwd_inner_microstep: 3840.81 | bwd_allreduce_microstep: 8.13 | step_microstep: 26.49 [2024-11-13 18:33:31,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.64 | bwd: 3849.21 | bwd_inner: 3840.81 | bwd_allreduce: 8.16 | step: 26.51 1%|▏ | 752/50750 [1:50:53<82:18:40, 5.93s/it] {'loss': 0.071, 'learning_rate': 1.9750492449113594e-05, 'epoch': 0.74} 1%|▏ | 752/50750 [1:50:53<82:18:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:33:37,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:33:37,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.15 | bwd_microstep: 3843.28 | bwd_inner_microstep: 3835.79 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.96 [2024-11-13 18:33:37,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.15 | bwd: 3843.29 | bwd_inner: 3835.79 | bwd_allreduce: 7.46 | step: 20.96 1%|▏ | 753/50750 [1:50:59<82:16:23, 5.92s/it] {'loss': 0.0062, 'learning_rate': 1.9776756401838478e-05, 'epoch': 0.74} 1%|▏ | 753/50750 [1:50:59<82:16:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:33:43,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.00 | optimizer_step: 4.93 [2024-11-13 18:33:43,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.98 | bwd_microstep: 3850.76 | bwd_inner_microstep: 3843.27 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.93 [2024-11-13 18:33:43,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.98 | bwd: 3850.78 | bwd_inner: 3843.27 | bwd_allreduce: 7.47 | step: 20.93 1%|▏ | 754/50750 [1:51:05<82:15:42, 5.92s/it] {'loss': 0.0084, 'learning_rate': 1.980302035456336e-05, 'epoch': 0.74} 1%|▏ | 754/50750 [1:51:05<82:15:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:33:49,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:33:49,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.17 | bwd_microstep: 3859.90 | bwd_inner_microstep: 3852.33 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.13 [2024-11-13 18:33:49,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.16 | bwd: 3859.92 | bwd_inner: 3852.33 | bwd_allreduce: 7.55 | step: 21.14 1%|▏ | 755/50750 [1:51:11<82:18:51, 5.93s/it] {'loss': 0.1383, 'learning_rate': 1.982928430728825e-05, 'epoch': 0.74} 1%|▏ | 755/50750 [1:51:11<82:18:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:33:55,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:33:55,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.13 | bwd_microstep: 3850.24 | bwd_inner_microstep: 3842.68 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.91 [2024-11-13 18:33:55,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.13 | bwd: 3850.25 | bwd_inner: 3842.68 | bwd_allreduce: 7.53 | step: 21.91 1%|▏ | 756/50750 [1:51:16<82:17:55, 5.93s/it] {'loss': 0.0188, 'learning_rate': 1.9855548260013132e-05, 'epoch': 0.74} 1%|▏ | 756/50750 [1:51:17<82:17:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:34:00,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.49 | optimizer_step: 4.93 [2024-11-13 18:34:00,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.14 | bwd_microstep: 3849.31 | bwd_inner_microstep: 3841.24 | bwd_allreduce_microstep: 8.02 | step_microstep: 22.61 [2024-11-13 18:34:00,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.13 | bwd: 3849.33 | bwd_inner: 3841.24 | bwd_allreduce: 8.04 | step: 22.62 1%|▏ | 757/50750 [1:51:22<82:18:46, 5.93s/it] {'loss': 0.0112, 'learning_rate': 1.988181221273802e-05, 'epoch': 0.75} 1%|▏ | 757/50750 [1:51:22<82:18:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:34:06,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 18:34:06,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.68 | bwd_microstep: 3862.35 | bwd_inner_microstep: 3854.36 | bwd_allreduce_microstep: 7.92 | step_microstep: 23.86 [2024-11-13 18:34:06,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.67 | bwd: 3862.37 | bwd_inner: 3854.36 | bwd_allreduce: 7.95 | step: 23.86 1%|▏ | 758/50750 [1:51:28<82:24:11, 5.93s/it] {'loss': 0.7539, 'learning_rate': 1.9908076165462903e-05, 'epoch': 0.75} 1%|▏ | 758/50750 [1:51:28<82:24:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:34:12,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:34:12,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.65 | bwd_microstep: 3845.54 | bwd_inner_microstep: 3837.99 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.06 [2024-11-13 18:34:12,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.62 | bwd: 3845.55 | bwd_inner: 3837.99 | bwd_allreduce: 7.52 | step: 21.07 1%|▏ | 759/50750 [1:51:34<82:21:56, 5.93s/it] {'loss': 0.3037, 'learning_rate': 1.993434011818779e-05, 'epoch': 0.75} 1%|▏ | 759/50750 [1:51:34<82:21:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:34:18,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:34:18,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.82 | bwd_microstep: 3847.21 | bwd_inner_microstep: 3839.67 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.49 [2024-11-13 18:34:18,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.83 | bwd: 3847.22 | bwd_inner: 3839.67 | bwd_allreduce: 7.52 | step: 21.50 1%|▏ | 760/50750 [1:51:40<82:18:51, 5.93s/it] {'loss': 0.46, 'learning_rate': 1.9960604070912674e-05, 'epoch': 0.75} 1%|▏ | 760/50750 [1:51:40<82:18:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:34:24,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 18:34:24,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.50 | bwd_microstep: 3847.40 | bwd_inner_microstep: 3839.86 | bwd_allreduce_microstep: 7.50 | step_microstep: 22.48 [2024-11-13 18:34:24,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.49 | bwd: 3847.41 | bwd_inner: 3839.86 | bwd_allreduce: 7.51 | step: 22.48 1%|▏ | 761/50750 [1:51:46<82:17:31, 5.93s/it] {'loss': 0.0038, 'learning_rate': 1.998686802363756e-05, 'epoch': 0.75} 1%|▏ | 761/50750 [1:51:46<82:17:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:34:30,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 18:34:30,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.97 | bwd_microstep: 3846.15 | bwd_inner_microstep: 3838.43 | bwd_allreduce_microstep: 7.67 | step_microstep: 22.04 [2024-11-13 18:34:30,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.97 | bwd: 3846.16 | bwd_inner: 3838.43 | bwd_allreduce: 7.69 | step: 22.04 2%|▏ | 762/50750 [1:51:52<82:16:12, 5.92s/it] {'loss': 0.0003, 'learning_rate': 2.0013131976362445e-05, 'epoch': 0.75} 2%|▏ | 762/50750 [1:51:52<82:16:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:34:36,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:34:36,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.66 | bwd_microstep: 3852.93 | bwd_inner_microstep: 3845.18 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.78 [2024-11-13 18:34:36,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.65 | bwd: 3852.95 | bwd_inner: 3845.18 | bwd_allreduce: 7.72 | step: 21.77 2%|▏ | 763/50750 [1:51:58<82:17:01, 5.93s/it] {'loss': 0.0004, 'learning_rate': 2.003939592908733e-05, 'epoch': 0.75} 2%|▏ | 763/50750 [1:51:58<82:17:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:34:42,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 18:34:42,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.55 | bwd_microstep: 3851.05 | bwd_inner_microstep: 3843.51 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.16 [2024-11-13 18:34:42,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.55 | bwd: 3851.06 | bwd_inner: 3843.51 | bwd_allreduce: 7.51 | step: 21.17 2%|▏ | 764/50750 [1:52:04<82:15:51, 5.92s/it] {'loss': 0.1692, 'learning_rate': 2.0065659881812216e-05, 'epoch': 0.75} 2%|▏ | 764/50750 [1:52:04<82:15:51, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:34:48,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:34:48,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.28 | bwd_microstep: 3848.67 | bwd_inner_microstep: 3841.14 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.12 [2024-11-13 18:34:48,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.28 | bwd: 3848.68 | bwd_inner: 3841.14 | bwd_allreduce: 7.50 | step: 21.12 2%|▏ | 765/50750 [1:52:10<82:14:10, 5.92s/it] {'loss': 0.0233, 'learning_rate': 2.00919238345371e-05, 'epoch': 0.75} 2%|▏ | 765/50750 [1:52:10<82:14:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:34:54,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 18:34:54,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.75 | bwd_microstep: 3852.75 | bwd_inner_microstep: 3845.05 | bwd_allreduce_microstep: 7.65 | step_microstep: 22.29 [2024-11-13 18:34:54,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.75 | bwd: 3852.77 | bwd_inner: 3845.05 | bwd_allreduce: 7.67 | step: 22.28 2%|▏ | 766/50750 [1:52:16<82:16:57, 5.93s/it] {'loss': 0.013, 'learning_rate': 2.0118187787261984e-05, 'epoch': 0.75} 2%|▏ | 766/50750 [1:52:16<82:16:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:35:00,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:35:00,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.94 | bwd_microstep: 3843.94 | bwd_inner_microstep: 3836.46 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.82 [2024-11-13 18:35:00,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.93 | bwd: 3843.95 | bwd_inner: 3836.46 | bwd_allreduce: 7.45 | step: 20.82 2%|▏ | 767/50750 [1:52:22<82:14:06, 5.92s/it] {'loss': 0.0088, 'learning_rate': 2.014445173998687e-05, 'epoch': 0.76} 2%|▏ | 767/50750 [1:52:22<82:14:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:35:06,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:35:06,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.37 | bwd_microstep: 3849.34 | bwd_inner_microstep: 3841.88 | bwd_allreduce_microstep: 7.41 | step_microstep: 21.05 [2024-11-13 18:35:06,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.37 | bwd: 3849.35 | bwd_inner: 3841.88 | bwd_allreduce: 7.43 | step: 21.06 2%|▏ | 768/50750 [1:52:28<82:15:25, 5.92s/it] {'loss': 0.2827, 'learning_rate': 2.0170715692711755e-05, 'epoch': 0.76} 2%|▏ | 768/50750 [1:52:28<82:15:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:35:12,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:35:12,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.20 | bwd_microstep: 3846.69 | bwd_inner_microstep: 3839.22 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.02 [2024-11-13 18:35:12,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.20 | bwd: 3846.71 | bwd_inner: 3839.22 | bwd_allreduce: 7.44 | step: 21.03 2%|▏ | 769/50750 [1:52:34<82:13:59, 5.92s/it] {'loss': 0.0024, 'learning_rate': 2.0196979645436642e-05, 'epoch': 0.76} 2%|▏ | 769/50750 [1:52:34<82:13:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:35:17,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:35:17,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.51 | bwd_microstep: 3849.30 | bwd_inner_microstep: 3841.72 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.80 [2024-11-13 18:35:17,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.49 | bwd: 3849.31 | bwd_inner: 3841.72 | bwd_allreduce: 7.55 | step: 21.80 2%|▏ | 770/50750 [1:52:39<82:13:50, 5.92s/it] {'loss': 0.8929, 'learning_rate': 2.0223243598161522e-05, 'epoch': 0.76} 2%|▏ | 770/50750 [1:52:39<82:13:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:35:23,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:35:23,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.77 | bwd_microstep: 3848.87 | bwd_inner_microstep: 3841.15 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.60 [2024-11-13 18:35:23,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.77 | bwd: 3848.88 | bwd_inner: 3841.15 | bwd_allreduce: 7.69 | step: 21.60 2%|▏ | 771/50750 [1:52:45<82:13:23, 5.92s/it] {'loss': 0.0022, 'learning_rate': 2.0249507550886413e-05, 'epoch': 0.76} 2%|▏ | 771/50750 [1:52:45<82:13:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:35:29,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 18:35:29,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.15 | bwd_microstep: 3843.08 | bwd_inner_microstep: 3835.55 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.08 [2024-11-13 18:35:29,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.13 | bwd: 3843.09 | bwd_inner: 3835.55 | bwd_allreduce: 7.50 | step: 21.08 2%|▏ | 772/50750 [1:52:51<82:12:05, 5.92s/it] {'loss': 0.7064, 'learning_rate': 2.0275771503611293e-05, 'epoch': 0.76} 2%|▏ | 772/50750 [1:52:51<82:12:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:35:35,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:35:35,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.91 | bwd_microstep: 3845.43 | bwd_inner_microstep: 3837.90 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.46 [2024-11-13 18:35:35,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.91 | bwd: 3845.45 | bwd_inner: 3837.90 | bwd_allreduce: 7.50 | step: 21.46 2%|▏ | 773/50750 [1:52:57<82:11:31, 5.92s/it] {'loss': 0.0032, 'learning_rate': 2.0302035456336184e-05, 'epoch': 0.76} 2%|▏ | 773/50750 [1:52:57<82:11:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:35:41,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:35:41,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.18 | bwd_microstep: 3853.01 | bwd_inner_microstep: 3845.21 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.97 [2024-11-13 18:35:41,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.18 | bwd: 3853.02 | bwd_inner: 3845.21 | bwd_allreduce: 7.76 | step: 21.98 2%|▏ | 774/50750 [1:53:03<82:14:24, 5.92s/it] {'loss': 0.0041, 'learning_rate': 2.0328299409061064e-05, 'epoch': 0.76} 2%|▏ | 774/50750 [1:53:03<82:14:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:35:47,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.96 [2024-11-13 18:35:47,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.57 | bwd_microstep: 3854.47 | bwd_inner_microstep: 3846.79 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.44 [2024-11-13 18:35:47,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.55 | bwd: 3854.49 | bwd_inner: 3846.79 | bwd_allreduce: 7.65 | step: 21.44 2%|▏ | 775/50750 [1:53:09<82:16:15, 5.93s/it] {'loss': 0.0105, 'learning_rate': 2.0354563361785948e-05, 'epoch': 0.76} 2%|▏ | 775/50750 [1:53:09<82:16:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:35:53,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.35 | optimizer_step: 4.93 [2024-11-13 18:35:53,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.39 | bwd_microstep: 3858.43 | bwd_inner_microstep: 3850.76 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.64 [2024-11-13 18:35:53,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.39 | bwd: 3858.45 | bwd_inner: 3850.76 | bwd_allreduce: 7.65 | step: 21.65 2%|▏ | 776/50750 [1:53:15<82:19:02, 5.93s/it] {'loss': 0.3063, 'learning_rate': 2.0380827314510835e-05, 'epoch': 0.76} 2%|▏ | 776/50750 [1:53:15<82:19:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:35:59,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 18:35:59,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.43 | bwd_microstep: 3853.27 | bwd_inner_microstep: 3845.51 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.94 [2024-11-13 18:35:59,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.41 | bwd: 3853.28 | bwd_inner: 3845.51 | bwd_allreduce: 7.73 | step: 21.94 2%|▏ | 777/50750 [1:53:21<82:18:30, 5.93s/it] {'loss': 0.0002, 'learning_rate': 2.040709126723572e-05, 'epoch': 0.77} 2%|▏ | 777/50750 [1:53:21<82:18:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:36:05,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 18:36:05,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.22 | bwd_microstep: 3852.39 | bwd_inner_microstep: 3844.86 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.33 [2024-11-13 18:36:05,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.21 | bwd: 3852.41 | bwd_inner: 3844.86 | bwd_allreduce: 7.50 | step: 21.34 2%|▏ | 778/50750 [1:53:27<82:18:26, 5.93s/it] {'loss': 0.0012, 'learning_rate': 2.0433355219960606e-05, 'epoch': 0.77} 2%|▏ | 778/50750 [1:53:27<82:18:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:36:11,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:36:11,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.91 | bwd_microstep: 3855.01 | bwd_inner_microstep: 3847.29 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.59 [2024-11-13 18:36:11,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.90 | bwd: 3855.02 | bwd_inner: 3847.29 | bwd_allreduce: 7.69 | step: 21.60 2%|▏ | 779/50750 [1:53:33<82:19:00, 5.93s/it] {'loss': 0.2017, 'learning_rate': 2.045961917268549e-05, 'epoch': 0.77} 2%|▏ | 779/50750 [1:53:33<82:19:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:36:17,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:36:17,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.74 | bwd_microstep: 3850.35 | bwd_inner_microstep: 3842.80 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.02 [2024-11-13 18:36:17,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.72 | bwd: 3850.37 | bwd_inner: 3842.80 | bwd_allreduce: 7.53 | step: 21.02 2%|▏ | 780/50750 [1:53:39<82:19:17, 5.93s/it] {'loss': 0.6388, 'learning_rate': 2.0485883125410377e-05, 'epoch': 0.77} 2%|▏ | 780/50750 [1:53:39<82:19:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:36:23,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.92 [2024-11-13 18:36:23,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.18 | bwd_microstep: 3842.99 | bwd_inner_microstep: 3835.28 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.92 [2024-11-13 18:36:23,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.18 | bwd: 3843.00 | bwd_inner: 3835.28 | bwd_allreduce: 7.68 | step: 21.92 2%|▏ | 781/50750 [1:53:45<82:18:00, 5.93s/it] {'loss': 0.0019, 'learning_rate': 2.051214707813526e-05, 'epoch': 0.77} 2%|▏ | 781/50750 [1:53:45<82:18:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:36:29,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 18:36:29,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.04 | bwd_microstep: 3843.91 | bwd_inner_microstep: 3835.91 | bwd_allreduce_microstep: 7.94 | step_microstep: 24.20 [2024-11-13 18:36:29,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.03 | bwd: 3843.92 | bwd_inner: 3835.91 | bwd_allreduce: 7.97 | step: 24.20 2%|▏ | 782/50750 [1:53:51<82:19:14, 5.93s/it] {'loss': 0.05, 'learning_rate': 2.0538411030860148e-05, 'epoch': 0.77} 2%|▏ | 782/50750 [1:53:51<82:19:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 18:36:35,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.31 | optimizer_step: 4.93 [2024-11-13 18:36:35,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.73 | bwd_microstep: 3850.68 | bwd_inner_microstep: 3842.94 | bwd_allreduce_microstep: 7.69 | step_microstep: 22.62 [2024-11-13 18:36:35,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.72 | bwd: 3850.69 | bwd_inner: 3842.94 | bwd_allreduce: 7.71 | step: 22.62 2%|▏ | 783/50750 [1:53:57<82:22:48, 5.94s/it] {'loss': 0.001, 'learning_rate': 2.056467498358503e-05, 'epoch': 0.77} 2%|▏ | 783/50750 [1:53:57<82:22:48, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:36:41,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 18:36:41,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.47 | bwd_microstep: 3848.46 | bwd_inner_microstep: 3840.62 | bwd_allreduce_microstep: 7.79 | step_microstep: 21.90 [2024-11-13 18:36:41,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.46 | bwd: 3848.47 | bwd_inner: 3840.62 | bwd_allreduce: 7.81 | step: 21.91 2%|▏ | 784/50750 [1:54:02<82:23:20, 5.94s/it] {'loss': 0.3901, 'learning_rate': 2.0590938936309915e-05, 'epoch': 0.77} 2%|▏ | 784/50750 [1:54:02<82:23:20, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:36:46,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:36:46,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.79 | bwd_microstep: 3844.08 | bwd_inner_microstep: 3836.59 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.01 [2024-11-13 18:36:46,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.78 | bwd: 3844.09 | bwd_inner: 3836.59 | bwd_allreduce: 7.47 | step: 21.02 2%|▏ | 785/50750 [1:54:08<82:19:32, 5.93s/it] {'loss': 0.0033, 'learning_rate': 2.0617202889034802e-05, 'epoch': 0.77} 2%|▏ | 785/50750 [1:54:08<82:19:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:36:52,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-13 18:36:52,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.93 | bwd_microstep: 3848.72 | bwd_inner_microstep: 3839.91 | bwd_allreduce_microstep: 8.76 | step_microstep: 22.47 [2024-11-13 18:36:52,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.94 | bwd: 3848.73 | bwd_inner: 3839.91 | bwd_allreduce: 8.78 | step: 22.47 2%|▏ | 786/50750 [1:54:14<82:18:40, 5.93s/it] {'loss': 0.5994, 'learning_rate': 2.0643466841759686e-05, 'epoch': 0.77} 2%|▏ | 786/50750 [1:54:14<82:18:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:36:58,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:36:58,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.09 | bwd_microstep: 3842.02 | bwd_inner_microstep: 3834.52 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.95 [2024-11-13 18:36:58,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.08 | bwd: 3842.04 | bwd_inner: 3834.52 | bwd_allreduce: 7.48 | step: 20.96 2%|▏ | 787/50750 [1:54:20<82:14:57, 5.93s/it] {'loss': 0.2318, 'learning_rate': 2.0669730794484573e-05, 'epoch': 0.78} 2%|▏ | 787/50750 [1:54:20<82:14:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:37:04,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:37:04,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.91 | bwd_microstep: 3852.15 | bwd_inner_microstep: 3844.60 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.19 [2024-11-13 18:37:04,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.90 | bwd: 3852.17 | bwd_inner: 3844.60 | bwd_allreduce: 7.52 | step: 21.19 2%|▏ | 788/50750 [1:54:26<82:14:02, 5.93s/it] {'loss': 0.4791, 'learning_rate': 2.0695994747209457e-05, 'epoch': 0.78} 2%|▏ | 788/50750 [1:54:26<82:14:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:37:10,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:37:10,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.63 | bwd_microstep: 3840.07 | bwd_inner_microstep: 3831.97 | bwd_allreduce_microstep: 8.04 | step_microstep: 26.09 [2024-11-13 18:37:10,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.63 | bwd: 3840.09 | bwd_inner: 3831.97 | bwd_allreduce: 8.06 | step: 26.09 2%|▏ | 789/50750 [1:54:32<82:11:57, 5.92s/it] {'loss': 0.3782, 'learning_rate': 2.0722258699934344e-05, 'epoch': 0.78} 2%|▏ | 789/50750 [1:54:32<82:11:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:37:16,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:37:16,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.75 | bwd_microstep: 3842.24 | bwd_inner_microstep: 3834.65 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.31 [2024-11-13 18:37:16,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.73 | bwd: 3842.25 | bwd_inner: 3834.65 | bwd_allreduce: 7.57 | step: 21.31 2%|▏ | 790/50750 [1:54:38<82:10:10, 5.92s/it] {'loss': 0.0345, 'learning_rate': 2.0748522652659228e-05, 'epoch': 0.78} 2%|▏ | 790/50750 [1:54:38<82:10:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:37:22,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:37:22,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.84 | bwd_microstep: 3852.52 | bwd_inner_microstep: 3844.98 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.59 [2024-11-13 18:37:22,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.84 | bwd: 3852.53 | bwd_inner: 3844.98 | bwd_allreduce: 7.51 | step: 21.59 2%|▏ | 791/50750 [1:54:44<82:12:39, 5.92s/it] {'loss': 0.5353, 'learning_rate': 2.0774786605384115e-05, 'epoch': 0.78} 2%|▏ | 791/50750 [1:54:44<82:12:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:37:28,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:37:28,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.23 | bwd_microstep: 3843.91 | bwd_inner_microstep: 3836.39 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.07 [2024-11-13 18:37:28,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.22 | bwd: 3843.92 | bwd_inner: 3836.39 | bwd_allreduce: 7.49 | step: 21.07 2%|▏ | 792/50750 [1:54:50<82:12:08, 5.92s/it] {'loss': 0.0123, 'learning_rate': 2.0801050558108996e-05, 'epoch': 0.78} 2%|▏ | 792/50750 [1:54:50<82:12:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 18:37:34,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:37:34,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.84 | bwd_microstep: 3845.95 | bwd_inner_microstep: 3838.43 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.13 [2024-11-13 18:37:34,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.84 | bwd: 3845.96 | bwd_inner: 3838.43 | bwd_allreduce: 7.49 | step: 21.13 2%|▏ | 793/50750 [1:54:56<82:09:44, 5.92s/it] {'loss': 0.3794, 'learning_rate': 2.082731451083388e-05, 'epoch': 0.78} 2%|▏ | 793/50750 [1:54:56<82:09:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:37:40,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 18:37:40,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.77 | bwd_microstep: 3848.14 | bwd_inner_microstep: 3840.39 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.24 [2024-11-13 18:37:40,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.77 | bwd: 3848.16 | bwd_inner: 3840.39 | bwd_allreduce: 7.73 | step: 21.25 2%|▏ | 794/50750 [1:55:02<82:08:48, 5.92s/it] {'loss': 0.0134, 'learning_rate': 2.0853578463558766e-05, 'epoch': 0.78} 2%|▏ | 794/50750 [1:55:02<82:08:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:37:46,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 18:37:46,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.42 | bwd_microstep: 3846.05 | bwd_inner_microstep: 3838.51 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.77 [2024-11-13 18:37:46,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.42 | bwd: 3846.06 | bwd_inner: 3838.51 | bwd_allreduce: 7.51 | step: 21.78 2%|▏ | 795/50750 [1:55:08<82:07:59, 5.92s/it] {'loss': 0.4459, 'learning_rate': 2.087984241628365e-05, 'epoch': 0.78} 2%|▏ | 795/50750 [1:55:08<82:07:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:37:52,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:37:52,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.32 | bwd_microstep: 3848.64 | bwd_inner_microstep: 3840.84 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.91 [2024-11-13 18:37:52,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.30 | bwd: 3848.65 | bwd_inner: 3840.84 | bwd_allreduce: 7.76 | step: 21.91 2%|▏ | 796/50750 [1:55:14<82:11:08, 5.92s/it] {'loss': 0.4413, 'learning_rate': 2.0906106369008537e-05, 'epoch': 0.78} 2%|▏ | 796/50750 [1:55:14<82:11:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:37:58,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 18:37:58,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.56 | bwd_microstep: 3845.35 | bwd_inner_microstep: 3837.80 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.60 [2024-11-13 18:37:58,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.54 | bwd: 3845.36 | bwd_inner: 3837.80 | bwd_allreduce: 7.52 | step: 21.60 2%|▏ | 797/50750 [1:55:19<82:14:03, 5.93s/it] {'loss': 0.5838, 'learning_rate': 2.093237032173342e-05, 'epoch': 0.79} 2%|▏ | 797/50750 [1:55:19<82:14:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:38:03,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:38:03,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.53 | bwd_microstep: 3848.13 | bwd_inner_microstep: 3840.41 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.17 [2024-11-13 18:38:03,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.51 | bwd: 3848.14 | bwd_inner: 3840.41 | bwd_allreduce: 7.70 | step: 21.17 2%|▏ | 798/50750 [1:55:25<82:13:08, 5.93s/it] {'loss': 0.3725, 'learning_rate': 2.095863427445831e-05, 'epoch': 0.79} 2%|▏ | 798/50750 [1:55:25<82:13:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 18:38:09,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 18:38:09,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.86 | bwd_microstep: 3853.90 | bwd_inner_microstep: 3845.97 | bwd_allreduce_microstep: 7.88 | step_microstep: 21.67 [2024-11-13 18:38:09,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.85 | bwd: 3853.91 | bwd_inner: 3845.97 | bwd_allreduce: 7.90 | step: 21.67 2%|▏ | 799/50750 [1:55:31<82:14:36, 5.93s/it] {'loss': 0.6967, 'learning_rate': 2.0984898227183192e-05, 'epoch': 0.79} 2%|▏ | 799/50750 [1:55:31<82:14:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:38:15,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:38:15,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.92 | bwd_microstep: 3850.60 | bwd_inner_microstep: 3843.06 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.34 [2024-11-13 18:38:15,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.91 | bwd: 3850.61 | bwd_inner: 3843.06 | bwd_allreduce: 7.51 | step: 21.34 2%|▏ | 800/50750 [1:55:37<82:16:33, 5.93s/it] {'loss': 0.8653, 'learning_rate': 2.101116217990808e-05, 'epoch': 0.79} 2%|▏ | 800/50750 [1:55:37<82:16:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:38:21,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 18:38:21,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.68 | bwd_microstep: 3844.74 | bwd_inner_microstep: 3837.19 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.43 [2024-11-13 18:38:21,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.66 | bwd: 3844.75 | bwd_inner: 3837.19 | bwd_allreduce: 7.52 | step: 21.44 2%|▏ | 801/50750 [1:55:43<82:14:55, 5.93s/it] {'loss': 0.1742, 'learning_rate': 2.1037426132632963e-05, 'epoch': 0.79} 2%|▏ | 801/50750 [1:55:43<82:14:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:38:27,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.92 [2024-11-13 18:38:27,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.81 | bwd_microstep: 3853.62 | bwd_inner_microstep: 3845.81 | bwd_allreduce_microstep: 7.75 | step_microstep: 28.33 [2024-11-13 18:38:27,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.79 | bwd: 3853.64 | bwd_inner: 3845.81 | bwd_allreduce: 7.77 | step: 28.33 2%|▏ | 802/50750 [1:55:49<82:17:43, 5.93s/it] {'loss': 0.6164, 'learning_rate': 2.1063690085357847e-05, 'epoch': 0.79} 2%|▏ | 802/50750 [1:55:49<82:17:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:38:33,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:38:33,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.45 | bwd_microstep: 3844.45 | bwd_inner_microstep: 3836.92 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.02 [2024-11-13 18:38:33,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.45 | bwd: 3844.46 | bwd_inner: 3836.92 | bwd_allreduce: 7.50 | step: 21.02 2%|▏ | 803/50750 [1:55:55<82:13:24, 5.93s/it] {'loss': 0.3272, 'learning_rate': 2.1089954038082734e-05, 'epoch': 0.79} 2%|▏ | 803/50750 [1:55:55<82:13:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:38:39,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-13 18:38:39,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.89 | bwd_microstep: 3840.91 | bwd_inner_microstep: 3832.86 | bwd_allreduce_microstep: 8.00 | step_microstep: 22.42 [2024-11-13 18:38:39,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.87 | bwd: 3840.92 | bwd_inner: 3832.86 | bwd_allreduce: 8.02 | step: 22.42 2%|▏ | 804/50750 [1:56:01<82:12:00, 5.92s/it] {'loss': 0.112, 'learning_rate': 2.1116217990807618e-05, 'epoch': 0.79} 2%|▏ | 804/50750 [1:56:01<82:12:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:38:45,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.92 [2024-11-13 18:38:45,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.15 | bwd_microstep: 3843.56 | bwd_inner_microstep: 3835.65 | bwd_allreduce_microstep: 7.87 | step_microstep: 22.22 [2024-11-13 18:38:45,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.13 | bwd: 3843.58 | bwd_inner: 3835.65 | bwd_allreduce: 7.89 | step: 22.23 2%|▏ | 805/50750 [1:56:07<82:14:41, 5.93s/it] {'loss': 0.212, 'learning_rate': 2.1142481943532505e-05, 'epoch': 0.79} 2%|▏ | 805/50750 [1:56:07<82:14:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:38:51,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.94 [2024-11-13 18:38:51,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.31 | bwd_microstep: 3846.63 | bwd_inner_microstep: 3838.51 | bwd_allreduce_microstep: 8.06 | step_microstep: 23.95 [2024-11-13 18:38:51,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.30 | bwd: 3846.65 | bwd_inner: 3838.51 | bwd_allreduce: 8.08 | step: 23.94 2%|▏ | 806/50750 [1:56:13<82:17:12, 5.93s/it] {'loss': 0.0267, 'learning_rate': 2.116874589625739e-05, 'epoch': 0.79} 2%|▏ | 806/50750 [1:56:13<82:17:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:38:57,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 18:38:57,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.62 | bwd_microstep: 3844.30 | bwd_inner_microstep: 3836.46 | bwd_allreduce_microstep: 7.79 | step_microstep: 21.40 [2024-11-13 18:38:57,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.61 | bwd: 3844.31 | bwd_inner: 3836.46 | bwd_allreduce: 7.81 | step: 21.40 2%|▏ | 807/50750 [1:56:19<82:16:23, 5.93s/it] {'loss': 0.0194, 'learning_rate': 2.1195009848982276e-05, 'epoch': 0.8} 2%|▏ | 807/50750 [1:56:19<82:16:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:39:03,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:39:03,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.79 | bwd_microstep: 3862.67 | bwd_inner_microstep: 3853.34 | bwd_allreduce_microstep: 9.28 | step_microstep: 21.47 [2024-11-13 18:39:03,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.77 | bwd: 3862.68 | bwd_inner: 3853.34 | bwd_allreduce: 9.30 | step: 21.47 2%|▏ | 808/50750 [1:56:25<82:17:55, 5.93s/it] {'loss': 0.1695, 'learning_rate': 2.122127380170716e-05, 'epoch': 0.8} 2%|▏ | 808/50750 [1:56:25<82:17:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:39:09,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:39:09,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.33 | bwd_microstep: 3848.56 | bwd_inner_microstep: 3841.06 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.04 [2024-11-13 18:39:09,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.32 | bwd: 3848.57 | bwd_inner: 3841.06 | bwd_allreduce: 7.48 | step: 21.04 2%|▏ | 809/50750 [1:56:31<82:16:15, 5.93s/it] {'loss': 0.0902, 'learning_rate': 2.124753775443204e-05, 'epoch': 0.8} 2%|▏ | 809/50750 [1:56:31<82:16:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:39:15,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 18:39:15,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.49 | bwd_microstep: 3846.85 | bwd_inner_microstep: 3839.17 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.68 [2024-11-13 18:39:15,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.50 | bwd: 3846.86 | bwd_inner: 3839.17 | bwd_allreduce: 7.65 | step: 21.69 2%|▏ | 810/50750 [1:56:37<82:13:14, 5.93s/it] {'loss': 0.0048, 'learning_rate': 2.127380170715693e-05, 'epoch': 0.8} 2%|▏ | 810/50750 [1:56:37<82:13:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:39:20,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:39:20,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.92 | bwd_microstep: 3839.53 | bwd_inner_microstep: 3832.02 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.17 [2024-11-13 18:39:20,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.92 | bwd: 3839.54 | bwd_inner: 3832.02 | bwd_allreduce: 7.48 | step: 21.17 2%|▏ | 811/50750 [1:56:42<82:09:10, 5.92s/it] {'loss': 0.0151, 'learning_rate': 2.130006565988181e-05, 'epoch': 0.8} 2%|▏ | 811/50750 [1:56:42<82:09:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:39:26,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:39:26,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.03 | bwd_microstep: 3845.39 | bwd_inner_microstep: 3837.84 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.57 [2024-11-13 18:39:26,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.03 | bwd: 3845.41 | bwd_inner: 3837.84 | bwd_allreduce: 7.53 | step: 21.57 2%|▏ | 812/50750 [1:56:48<82:07:47, 5.92s/it] {'loss': 0.0153, 'learning_rate': 2.13263296126067e-05, 'epoch': 0.8} 2%|▏ | 812/50750 [1:56:48<82:07:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:39:32,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 18:39:32,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.44 | bwd_microstep: 3846.97 | bwd_inner_microstep: 3839.43 | bwd_allreduce_microstep: 7.50 | step_microstep: 20.70 [2024-11-13 18:39:32,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.42 | bwd: 3846.99 | bwd_inner: 3839.43 | bwd_allreduce: 7.51 | step: 20.71 2%|▏ | 813/50750 [1:56:54<82:09:23, 5.92s/it] {'loss': 0.0006, 'learning_rate': 2.1352593565331582e-05, 'epoch': 0.8} 2%|▏ | 813/50750 [1:56:54<82:09:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 18:39:38,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:39:38,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.16 | bwd_microstep: 3862.12 | bwd_inner_microstep: 3854.01 | bwd_allreduce_microstep: 8.04 | step_microstep: 22.71 [2024-11-13 18:39:38,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.17 | bwd: 3862.14 | bwd_inner: 3854.01 | bwd_allreduce: 8.07 | step: 22.71 2%|▏ | 814/50750 [1:57:00<82:15:56, 5.93s/it] {'loss': 0.2981, 'learning_rate': 2.137885751805647e-05, 'epoch': 0.8} 2%|▏ | 814/50750 [1:57:00<82:15:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:39:44,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.92 [2024-11-13 18:39:44,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.52 | bwd_microstep: 3847.01 | bwd_inner_microstep: 3839.16 | bwd_allreduce_microstep: 7.80 | step_microstep: 22.08 [2024-11-13 18:39:44,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.51 | bwd: 3847.02 | bwd_inner: 3839.16 | bwd_allreduce: 7.82 | step: 22.08 2%|▏ | 815/50750 [1:57:06<82:16:11, 5.93s/it] {'loss': 0.0069, 'learning_rate': 2.1405121470781353e-05, 'epoch': 0.8} 2%|▏ | 815/50750 [1:57:06<82:16:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:39:50,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:39:50,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.22 | bwd_microstep: 3847.04 | bwd_inner_microstep: 3839.52 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.15 [2024-11-13 18:39:50,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.20 | bwd: 3847.05 | bwd_inner: 3839.52 | bwd_allreduce: 7.50 | step: 21.15 2%|▏ | 816/50750 [1:57:12<82:15:38, 5.93s/it] {'loss': 0.0074, 'learning_rate': 2.143138542350624e-05, 'epoch': 0.8} 2%|▏ | 816/50750 [1:57:12<82:15:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:39:56,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 18:39:56,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.80 | bwd_microstep: 3857.21 | bwd_inner_microstep: 3849.41 | bwd_allreduce_microstep: 7.74 | step_microstep: 23.04 [2024-11-13 18:39:56,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.80 | bwd: 3857.23 | bwd_inner: 3849.41 | bwd_allreduce: 7.76 | step: 23.04 2%|▏ | 817/50750 [1:57:18<82:16:02, 5.93s/it] {'loss': 0.0607, 'learning_rate': 2.1457649376231124e-05, 'epoch': 0.8} 2%|▏ | 817/50750 [1:57:18<82:16:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:40:02,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 18:40:02,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.73 | bwd_microstep: 3852.25 | bwd_inner_microstep: 3844.39 | bwd_allreduce_microstep: 7.82 | step_microstep: 22.82 [2024-11-13 18:40:02,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.73 | bwd: 3852.27 | bwd_inner: 3844.39 | bwd_allreduce: 7.84 | step: 22.82 2%|▏ | 818/50750 [1:57:24<82:15:01, 5.93s/it] {'loss': 0.0036, 'learning_rate': 2.1483913328956007e-05, 'epoch': 0.81} 2%|▏ | 818/50750 [1:57:24<82:15:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:40:08,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:40:08,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.30 | bwd_microstep: 3845.24 | bwd_inner_microstep: 3837.27 | bwd_allreduce_microstep: 7.91 | step_microstep: 21.49 [2024-11-13 18:40:08,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.29 | bwd: 3845.26 | bwd_inner: 3837.27 | bwd_allreduce: 7.93 | step: 21.48 2%|▏ | 819/50750 [1:57:30<82:11:52, 5.93s/it] {'loss': 0.0061, 'learning_rate': 2.1510177281680895e-05, 'epoch': 0.81} 2%|▏ | 819/50750 [1:57:30<82:11:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:40:14,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:40:14,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.77 | bwd_microstep: 3851.38 | bwd_inner_microstep: 3843.89 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.52 [2024-11-13 18:40:14,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.76 | bwd: 3851.39 | bwd_inner: 3843.89 | bwd_allreduce: 7.46 | step: 21.53 2%|▏ | 820/50750 [1:57:36<82:12:45, 5.93s/it] {'loss': 0.6285, 'learning_rate': 2.153644123440578e-05, 'epoch': 0.81} 2%|▏ | 820/50750 [1:57:36<82:12:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:40:20,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 18:40:20,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.39 | bwd_microstep: 3847.65 | bwd_inner_microstep: 3840.01 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.83 [2024-11-13 18:40:20,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.39 | bwd: 3847.67 | bwd_inner: 3840.01 | bwd_allreduce: 7.62 | step: 21.83 2%|▏ | 821/50750 [1:57:42<82:13:23, 5.93s/it] {'loss': 0.0022, 'learning_rate': 2.1562705187130666e-05, 'epoch': 0.81} 2%|▏ | 821/50750 [1:57:42<82:13:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:40:26,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:40:26,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.85 | bwd_microstep: 3847.22 | bwd_inner_microstep: 3839.69 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.67 [2024-11-13 18:40:26,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.83 | bwd: 3847.23 | bwd_inner: 3839.69 | bwd_allreduce: 7.50 | step: 21.67 2%|▏ | 822/50750 [1:57:48<82:13:11, 5.93s/it] {'loss': 0.0266, 'learning_rate': 2.158896913985555e-05, 'epoch': 0.81} 2%|▏ | 822/50750 [1:57:48<82:13:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:40:32,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:40:32,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.03 | bwd_microstep: 3842.93 | bwd_inner_microstep: 3835.45 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.88 [2024-11-13 18:40:32,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.03 | bwd: 3842.94 | bwd_inner: 3835.45 | bwd_allreduce: 7.45 | step: 20.89 2%|▏ | 823/50750 [1:57:54<82:10:25, 5.93s/it] {'loss': 0.8759, 'learning_rate': 2.1615233092580436e-05, 'epoch': 0.81} 2%|▏ | 823/50750 [1:57:54<82:10:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:40:38,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:40:38,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.43 | bwd_microstep: 3846.50 | bwd_inner_microstep: 3839.00 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.93 [2024-11-13 18:40:38,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.43 | bwd: 3846.51 | bwd_inner: 3839.00 | bwd_allreduce: 7.47 | step: 20.93 2%|▏ | 824/50750 [1:58:00<82:09:32, 5.92s/it] {'loss': 0.0011, 'learning_rate': 2.164149704530532e-05, 'epoch': 0.81} 2%|▏ | 824/50750 [1:58:00<82:09:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:40:43,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 18:40:43,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.02 | bwd_microstep: 3847.58 | bwd_inner_microstep: 3840.09 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.97 [2024-11-13 18:40:43,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.02 | bwd: 3847.59 | bwd_inner: 3840.09 | bwd_allreduce: 7.46 | step: 20.97 2%|▏ | 825/50750 [1:58:05<82:08:47, 5.92s/it] {'loss': 0.5041, 'learning_rate': 2.1667760998030207e-05, 'epoch': 0.81} 2%|▏ | 825/50750 [1:58:05<82:08:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:40:49,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:40:49,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.79 | bwd_microstep: 3843.74 | bwd_inner_microstep: 3836.23 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.45 [2024-11-13 18:40:49,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.79 | bwd: 3843.75 | bwd_inner: 3836.23 | bwd_allreduce: 7.48 | step: 21.45 2%|▏ | 826/50750 [1:58:11<82:09:22, 5.92s/it] {'loss': 0.0079, 'learning_rate': 2.169402495075509e-05, 'epoch': 0.81} 2%|▏ | 826/50750 [1:58:11<82:09:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:40:55,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:40:55,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.35 | bwd_microstep: 3854.38 | bwd_inner_microstep: 3846.84 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.27 [2024-11-13 18:40:55,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.35 | bwd: 3854.40 | bwd_inner: 3846.84 | bwd_allreduce: 7.51 | step: 21.27 2%|▏ | 827/50750 [1:58:17<82:10:08, 5.93s/it] {'loss': 0.0007, 'learning_rate': 2.1720288903479975e-05, 'epoch': 0.81} 2%|▏ | 827/50750 [1:58:17<82:10:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:41:01,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:41:01,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.39 | bwd_microstep: 3852.85 | bwd_inner_microstep: 3845.35 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.95 [2024-11-13 18:41:01,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.39 | bwd: 3852.86 | bwd_inner: 3845.35 | bwd_allreduce: 7.47 | step: 20.96 2%|▏ | 828/50750 [1:58:23<82:09:14, 5.92s/it] {'loss': 0.0055, 'learning_rate': 2.1746552856204862e-05, 'epoch': 0.82} 2%|▏ | 828/50750 [1:58:23<82:09:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:41:07,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 18:41:07,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.16 | bwd_microstep: 3855.46 | bwd_inner_microstep: 3847.88 | bwd_allreduce_microstep: 7.53 | step_microstep: 20.89 [2024-11-13 18:41:07,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.13 | bwd: 3855.47 | bwd_inner: 3847.88 | bwd_allreduce: 7.55 | step: 20.89 2%|▏ | 829/50750 [1:58:29<82:09:59, 5.93s/it] {'loss': 0.7704, 'learning_rate': 2.1772816808929746e-05, 'epoch': 0.82} 2%|▏ | 829/50750 [1:58:29<82:09:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:41:13,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-13 18:41:13,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.81 | bwd_microstep: 3850.40 | bwd_inner_microstep: 3842.81 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.83 [2024-11-13 18:41:13,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.81 | bwd: 3850.41 | bwd_inner: 3842.81 | bwd_allreduce: 7.57 | step: 21.83 2%|▏ | 830/50750 [1:58:35<82:12:33, 5.93s/it] {'loss': 0.0404, 'learning_rate': 2.1799080761654633e-05, 'epoch': 0.82} 2%|▏ | 830/50750 [1:58:35<82:12:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:41:19,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 18:41:19,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.72 | bwd_microstep: 3847.17 | bwd_inner_microstep: 3839.52 | bwd_allreduce_microstep: 7.60 | step_microstep: 20.82 [2024-11-13 18:41:19,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.71 | bwd: 3847.18 | bwd_inner: 3839.52 | bwd_allreduce: 7.62 | step: 20.82 2%|▏ | 831/50750 [1:58:41<82:10:21, 5.93s/it] {'loss': 0.0028, 'learning_rate': 2.1825344714379513e-05, 'epoch': 0.82} 2%|▏ | 831/50750 [1:58:41<82:10:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:41:25,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.29 | optimizer_step: 4.93 [2024-11-13 18:41:25,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.17 | bwd_microstep: 3844.16 | bwd_inner_microstep: 3834.07 | bwd_allreduce_microstep: 10.00 | step_microstep: 22.87 [2024-11-13 18:41:25,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.17 | bwd: 3844.18 | bwd_inner: 3834.07 | bwd_allreduce: 10.04 | step: 22.86 2%|▏ | 832/50750 [1:58:47<82:09:13, 5.92s/it] {'loss': 0.206, 'learning_rate': 2.1851608667104404e-05, 'epoch': 0.82} 2%|▏ | 832/50750 [1:58:47<82:09:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:41:31,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-13 18:41:31,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.62 | bwd_microstep: 3848.49 | bwd_inner_microstep: 3840.76 | bwd_allreduce_microstep: 7.69 | step_microstep: 22.01 [2024-11-13 18:41:31,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.61 | bwd: 3848.51 | bwd_inner: 3840.76 | bwd_allreduce: 7.70 | step: 22.01 2%|▏ | 833/50750 [1:58:53<82:09:27, 5.93s/it] {'loss': 0.0793, 'learning_rate': 2.1877872619829284e-05, 'epoch': 0.82} 2%|▏ | 833/50750 [1:58:53<82:09:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:41:37,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:41:37,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.55 | bwd_microstep: 3852.25 | bwd_inner_microstep: 3844.46 | bwd_allreduce_microstep: 7.74 | step_microstep: 22.37 [2024-11-13 18:41:37,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.54 | bwd: 3852.27 | bwd_inner: 3844.46 | bwd_allreduce: 7.76 | step: 22.37 2%|▏ | 834/50750 [1:58:59<82:12:56, 5.93s/it] {'loss': 0.0005, 'learning_rate': 2.1904136572554175e-05, 'epoch': 0.82} 2%|▏ | 834/50750 [1:58:59<82:12:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:41:43,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:41:43,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.21 | bwd_microstep: 3843.76 | bwd_inner_microstep: 3836.23 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.03 [2024-11-13 18:41:43,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.19 | bwd: 3843.77 | bwd_inner: 3836.23 | bwd_allreduce: 7.50 | step: 21.04 2%|▏ | 835/50750 [1:59:05<82:10:33, 5.93s/it] {'loss': 0.0014, 'learning_rate': 2.1930400525279055e-05, 'epoch': 0.82} 2%|▏ | 835/50750 [1:59:05<82:10:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:41:49,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 18:41:49,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.56 | bwd_microstep: 3849.78 | bwd_inner_microstep: 3842.22 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.70 [2024-11-13 18:41:49,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.56 | bwd: 3849.79 | bwd_inner: 3842.22 | bwd_allreduce: 7.54 | step: 21.71 2%|▏ | 836/50750 [1:59:11<82:09:54, 5.93s/it] {'loss': 0.0082, 'learning_rate': 2.195666447800394e-05, 'epoch': 0.82} 2%|▏ | 836/50750 [1:59:11<82:09:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:41:55,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 18:41:55,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.35 | bwd_microstep: 3843.83 | bwd_inner_microstep: 3836.30 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.23 [2024-11-13 18:41:55,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.33 | bwd: 3843.84 | bwd_inner: 3836.30 | bwd_allreduce: 7.51 | step: 21.23 2%|▏ | 837/50750 [1:59:17<82:08:51, 5.92s/it] {'loss': 0.6763, 'learning_rate': 2.1982928430728826e-05, 'epoch': 0.82} 2%|▏ | 837/50750 [1:59:17<82:08:51, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:42:01,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 18:42:01,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.13 | bwd_microstep: 3851.93 | bwd_inner_microstep: 3844.18 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.46 [2024-11-13 18:42:01,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.13 | bwd: 3851.95 | bwd_inner: 3844.18 | bwd_allreduce: 7.72 | step: 22.47 2%|▏ | 838/50750 [1:59:22<82:10:25, 5.93s/it] {'loss': 0.0106, 'learning_rate': 2.200919238345371e-05, 'epoch': 0.83} 2%|▏ | 838/50750 [1:59:22<82:10:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:42:06,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:42:06,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.59 | bwd_microstep: 3844.60 | bwd_inner_microstep: 3836.88 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.99 [2024-11-13 18:42:06,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.58 | bwd: 3844.62 | bwd_inner: 3836.88 | bwd_allreduce: 7.70 | step: 22.00 2%|▏ | 839/50750 [1:59:28<82:10:17, 5.93s/it] {'loss': 0.3437, 'learning_rate': 2.2035456336178597e-05, 'epoch': 0.83} 2%|▏ | 839/50750 [1:59:28<82:10:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:42:12,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:42:12,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.40 | bwd_microstep: 3845.20 | bwd_inner_microstep: 3837.67 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.87 [2024-11-13 18:42:12,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.40 | bwd: 3845.21 | bwd_inner: 3837.67 | bwd_allreduce: 7.50 | step: 21.87 2%|▏ | 840/50750 [1:59:34<82:08:17, 5.92s/it] {'loss': 0.0062, 'learning_rate': 2.206172028890348e-05, 'epoch': 0.83} 2%|▏ | 840/50750 [1:59:34<82:08:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:42:18,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:42:18,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.70 | bwd_microstep: 3852.79 | bwd_inner_microstep: 3845.20 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.65 [2024-11-13 18:42:18,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.68 | bwd: 3852.80 | bwd_inner: 3845.20 | bwd_allreduce: 7.56 | step: 21.66 2%|▏ | 841/50750 [1:59:40<82:09:08, 5.93s/it] {'loss': 0.3214, 'learning_rate': 2.2087984241628368e-05, 'epoch': 0.83} 2%|▏ | 841/50750 [1:59:40<82:09:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 18:42:24,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:42:24,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.72 | bwd_microstep: 3852.07 | bwd_inner_microstep: 3844.60 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.06 [2024-11-13 18:42:24,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.71 | bwd: 3852.09 | bwd_inner: 3844.60 | bwd_allreduce: 7.44 | step: 21.07 2%|▏ | 842/50750 [1:59:46<82:10:23, 5.93s/it] {'loss': 0.4743, 'learning_rate': 2.2114248194353252e-05, 'epoch': 0.83} 2%|▏ | 842/50750 [1:59:46<82:10:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:42:30,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:42:30,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.93 | bwd_microstep: 3846.09 | bwd_inner_microstep: 3838.63 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.92 [2024-11-13 18:42:30,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.93 | bwd: 3846.10 | bwd_inner: 3838.63 | bwd_allreduce: 7.43 | step: 20.93 2%|▏ | 843/50750 [1:59:52<82:08:05, 5.92s/it] {'loss': 0.1796, 'learning_rate': 2.214051214707814e-05, 'epoch': 0.83} 2%|▏ | 843/50750 [1:59:52<82:08:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:42:36,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 18:42:36,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.25 | bwd_microstep: 3837.99 | bwd_inner_microstep: 3830.49 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.29 [2024-11-13 18:42:36,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.25 | bwd: 3838.00 | bwd_inner: 3830.49 | bwd_allreduce: 7.47 | step: 21.29 2%|▏ | 844/50750 [1:59:58<82:03:39, 5.92s/it] {'loss': 0.6357, 'learning_rate': 2.2166776099803023e-05, 'epoch': 0.83} 2%|▏ | 844/50750 [1:59:58<82:03:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:42:42,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 18:42:42,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.09 | bwd_microstep: 3845.66 | bwd_inner_microstep: 3838.20 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.93 [2024-11-13 18:42:42,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.09 | bwd: 3845.67 | bwd_inner: 3838.20 | bwd_allreduce: 7.44 | step: 20.94 2%|▏ | 845/50750 [2:00:04<82:02:34, 5.92s/it] {'loss': 0.0021, 'learning_rate': 2.2193040052527906e-05, 'epoch': 0.83} 2%|▏ | 845/50750 [2:00:04<82:02:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:42:48,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.94 [2024-11-13 18:42:48,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.51 | bwd_microstep: 3853.92 | bwd_inner_microstep: 3846.44 | bwd_allreduce_microstep: 7.44 | step_microstep: 23.57 [2024-11-13 18:42:48,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.51 | bwd: 3853.93 | bwd_inner: 3846.44 | bwd_allreduce: 7.45 | step: 23.58 2%|▏ | 846/50750 [2:00:10<82:06:26, 5.92s/it] {'loss': 0.4644, 'learning_rate': 2.2219304005252794e-05, 'epoch': 0.83} 2%|▏ | 846/50750 [2:00:10<82:06:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:42:54,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:42:54,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.43 | bwd_microstep: 3845.38 | bwd_inner_microstep: 3835.43 | bwd_allreduce_microstep: 9.86 | step_microstep: 22.19 [2024-11-13 18:42:54,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.43 | bwd: 3845.41 | bwd_inner: 3835.43 | bwd_allreduce: 9.90 | step: 22.19 2%|▏ | 847/50750 [2:00:16<82:04:53, 5.92s/it] {'loss': 0.911, 'learning_rate': 2.2245567957977677e-05, 'epoch': 0.83} 2%|▏ | 847/50750 [2:00:16<82:04:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:43:00,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 18:43:00,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.66 | bwd_microstep: 3847.90 | bwd_inner_microstep: 3840.42 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.06 [2024-11-13 18:43:00,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.66 | bwd: 3847.92 | bwd_inner: 3840.42 | bwd_allreduce: 7.46 | step: 21.06 2%|▏ | 848/50750 [2:00:22<82:04:48, 5.92s/it] {'loss': 0.2152, 'learning_rate': 2.2271831910702565e-05, 'epoch': 0.84} 2%|▏ | 848/50750 [2:00:22<82:04:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:43:06,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:43:06,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.27 | bwd_microstep: 3842.13 | bwd_inner_microstep: 3834.59 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.08 [2024-11-13 18:43:06,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.27 | bwd: 3842.15 | bwd_inner: 3834.59 | bwd_allreduce: 7.52 | step: 21.08 2%|▏ | 849/50750 [2:00:28<82:01:43, 5.92s/it] {'loss': 0.0454, 'learning_rate': 2.229809586342745e-05, 'epoch': 0.84} 2%|▏ | 849/50750 [2:00:28<82:01:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:43:12,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:43:12,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.16 | bwd_microstep: 3849.08 | bwd_inner_microstep: 3841.59 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.04 [2024-11-13 18:43:12,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.16 | bwd: 3849.10 | bwd_inner: 3841.59 | bwd_allreduce: 7.47 | step: 21.05 2%|▏ | 850/50750 [2:00:34<82:03:41, 5.92s/it] {'loss': 0.003, 'learning_rate': 2.2324359816152335e-05, 'epoch': 0.84} 2%|▏ | 850/50750 [2:00:34<82:03:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:43:18,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.64 | optimizer_step: 4.92 [2024-11-13 18:43:18,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.92 | bwd_microstep: 3850.43 | bwd_inner_microstep: 3842.58 | bwd_allreduce_microstep: 7.80 | step_microstep: 29.40 [2024-11-13 18:43:18,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.92 | bwd: 3850.45 | bwd_inner: 3842.58 | bwd_allreduce: 7.82 | step: 29.40 2%|▏ | 851/50750 [2:00:39<82:06:09, 5.92s/it] {'loss': 1.5853, 'learning_rate': 2.235062376887722e-05, 'epoch': 0.84} 2%|▏ | 851/50750 [2:00:39<82:06:09, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:43:23,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:43:23,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.18 | bwd_microstep: 3848.44 | bwd_inner_microstep: 3839.26 | bwd_allreduce_microstep: 9.10 | step_microstep: 21.78 [2024-11-13 18:43:23,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.16 | bwd: 3848.47 | bwd_inner: 3839.26 | bwd_allreduce: 9.14 | step: 21.77 2%|▏ | 852/50750 [2:00:45<82:06:36, 5.92s/it] {'loss': 0.0672, 'learning_rate': 2.23768877216021e-05, 'epoch': 0.84} 2%|▏ | 852/50750 [2:00:45<82:06:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:43:29,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.97 | optimizer_step: 4.92 [2024-11-13 18:43:29,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.37 | bwd_microstep: 3846.99 | bwd_inner_microstep: 3839.44 | bwd_allreduce_microstep: 7.52 | step_microstep: 23.73 [2024-11-13 18:43:29,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.38 | bwd: 3847.01 | bwd_inner: 3839.44 | bwd_allreduce: 7.53 | step: 23.75 2%|▏ | 853/50750 [2:00:51<82:06:16, 5.92s/it] {'loss': 0.1649, 'learning_rate': 2.2403151674326987e-05, 'epoch': 0.84} 2%|▏ | 853/50750 [2:00:51<82:06:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:43:35,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 18:43:35,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.93 | bwd_microstep: 3837.48 | bwd_inner_microstep: 3830.02 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.90 [2024-11-13 18:43:35,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.91 | bwd: 3837.49 | bwd_inner: 3830.02 | bwd_allreduce: 7.43 | step: 20.90 2%|▏ | 854/50750 [2:00:57<82:03:30, 5.92s/it] {'loss': 0.6775, 'learning_rate': 2.242941562705187e-05, 'epoch': 0.84} 2%|▏ | 854/50750 [2:00:57<82:03:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 18:43:41,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 18:43:41,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.82 | bwd_microstep: 3851.94 | bwd_inner_microstep: 3844.43 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.47 [2024-11-13 18:43:41,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.82 | bwd: 3851.95 | bwd_inner: 3844.43 | bwd_allreduce: 7.48 | step: 21.47 2%|▏ | 855/50750 [2:01:03<82:04:27, 5.92s/it] {'loss': 0.0283, 'learning_rate': 2.2455679579776758e-05, 'epoch': 0.84} 2%|▏ | 855/50750 [2:01:03<82:04:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:43:47,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 18:43:47,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.78 | bwd_microstep: 3841.85 | bwd_inner_microstep: 3834.07 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.51 [2024-11-13 18:43:47,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.78 | bwd: 3841.87 | bwd_inner: 3834.07 | bwd_allreduce: 7.75 | step: 21.52 2%|▏ | 856/50750 [2:01:09<82:02:53, 5.92s/it] {'loss': 0.0091, 'learning_rate': 2.248194353250164e-05, 'epoch': 0.84} 2%|▏ | 856/50750 [2:01:09<82:02:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:43:53,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:43:53,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.90 | bwd_microstep: 3859.98 | bwd_inner_microstep: 3851.55 | bwd_allreduce_microstep: 8.38 | step_microstep: 21.19 [2024-11-13 18:43:53,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.89 | bwd: 3859.99 | bwd_inner: 3851.55 | bwd_allreduce: 8.40 | step: 21.19 2%|▏ | 857/50750 [2:01:15<82:06:21, 5.92s/it] {'loss': 0.0179, 'learning_rate': 2.250820748522653e-05, 'epoch': 0.84} 2%|▏ | 857/50750 [2:01:15<82:06:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:43:59,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:43:59,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.96 | bwd_microstep: 3850.64 | bwd_inner_microstep: 3842.94 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.41 [2024-11-13 18:43:59,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.96 | bwd: 3850.66 | bwd_inner: 3842.94 | bwd_allreduce: 7.67 | step: 21.41 2%|▏ | 858/50750 [2:01:21<82:05:25, 5.92s/it] {'loss': 0.3316, 'learning_rate': 2.2534471437951412e-05, 'epoch': 0.85} 2%|▏ | 858/50750 [2:01:21<82:05:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:44:05,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:44:05,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.35 | bwd_microstep: 3851.90 | bwd_inner_microstep: 3844.37 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.11 [2024-11-13 18:44:05,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.35 | bwd: 3851.91 | bwd_inner: 3844.37 | bwd_allreduce: 7.50 | step: 21.11 2%|▏ | 859/50750 [2:01:27<82:05:04, 5.92s/it] {'loss': 0.0069, 'learning_rate': 2.25607353906763e-05, 'epoch': 0.85} 2%|▏ | 859/50750 [2:01:27<82:05:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:44:11,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.65 | optimizer_step: 4.92 [2024-11-13 18:44:11,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.12 | bwd_microstep: 3843.66 | bwd_inner_microstep: 3835.71 | bwd_allreduce_microstep: 7.89 | step_microstep: 30.05 [2024-11-13 18:44:11,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.12 | bwd: 3843.68 | bwd_inner: 3835.71 | bwd_allreduce: 7.92 | step: 30.05 2%|▏ | 860/50750 [2:01:33<82:06:10, 5.92s/it] {'loss': 0.0312, 'learning_rate': 2.2586999343401183e-05, 'epoch': 0.85} 2%|▏ | 860/50750 [2:01:33<82:06:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:44:17,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 18:44:17,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.64 | bwd_microstep: 3850.82 | bwd_inner_microstep: 3843.30 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.31 [2024-11-13 18:44:17,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.64 | bwd: 3850.83 | bwd_inner: 3843.30 | bwd_allreduce: 7.49 | step: 21.32 2%|▏ | 861/50750 [2:01:39<82:06:18, 5.92s/it] {'loss': 0.4189, 'learning_rate': 2.2613263296126067e-05, 'epoch': 0.85} 2%|▏ | 861/50750 [2:01:39<82:06:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:44:23,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.51 | optimizer_step: 4.93 [2024-11-13 18:44:23,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.60 | bwd_microstep: 3841.67 | bwd_inner_microstep: 3834.11 | bwd_allreduce_microstep: 7.52 | step_microstep: 22.66 [2024-11-13 18:44:23,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.60 | bwd: 3841.68 | bwd_inner: 3834.11 | bwd_allreduce: 7.53 | step: 22.66 2%|▏ | 862/50750 [2:01:45<82:03:36, 5.92s/it] {'loss': 0.2552, 'learning_rate': 2.2639527248850954e-05, 'epoch': 0.85} 2%|▏ | 862/50750 [2:01:45<82:03:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 18:44:29,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:44:29,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.88 | bwd_microstep: 3846.86 | bwd_inner_microstep: 3839.10 | bwd_allreduce_microstep: 7.71 | step_microstep: 24.33 [2024-11-13 18:44:29,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.87 | bwd: 3846.88 | bwd_inner: 3839.10 | bwd_allreduce: 7.73 | step: 24.33 2%|▏ | 863/50750 [2:01:51<82:03:46, 5.92s/it] {'loss': 0.0777, 'learning_rate': 2.2665791201575838e-05, 'epoch': 0.85} 2%|▏ | 863/50750 [2:01:51<82:03:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:44:35,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 18:44:35,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.69 | bwd_microstep: 3842.40 | bwd_inner_microstep: 3834.62 | bwd_allreduce_microstep: 7.73 | step_microstep: 24.29 [2024-11-13 18:44:35,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.69 | bwd: 3842.42 | bwd_inner: 3834.62 | bwd_allreduce: 7.75 | step: 24.29 2%|▏ | 864/50750 [2:01:56<82:03:01, 5.92s/it] {'loss': 0.0086, 'learning_rate': 2.2692055154300725e-05, 'epoch': 0.85} 2%|▏ | 864/50750 [2:01:56<82:03:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:44:40,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:44:40,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.03 | bwd_microstep: 3841.23 | bwd_inner_microstep: 3833.26 | bwd_allreduce_microstep: 7.91 | step_microstep: 23.56 [2024-11-13 18:44:40,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.02 | bwd: 3841.25 | bwd_inner: 3833.26 | bwd_allreduce: 7.93 | step: 23.55 2%|▏ | 865/50750 [2:02:02<82:01:08, 5.92s/it] {'loss': 0.0072, 'learning_rate': 2.271831910702561e-05, 'epoch': 0.85} 2%|▏ | 865/50750 [2:02:02<82:01:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:44:46,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:44:46,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.20 | bwd_microstep: 3843.24 | bwd_inner_microstep: 3835.48 | bwd_allreduce_microstep: 7.70 | step_microstep: 23.83 [2024-11-13 18:44:46,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.20 | bwd: 3843.25 | bwd_inner: 3835.49 | bwd_allreduce: 7.72 | step: 23.83 2%|▏ | 866/50750 [2:02:08<82:00:55, 5.92s/it] {'loss': 0.2011, 'learning_rate': 2.2744583059750496e-05, 'epoch': 0.85} 2%|▏ | 866/50750 [2:02:08<82:00:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:44:52,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 18:44:52,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.76 | bwd_microstep: 3846.95 | bwd_inner_microstep: 3839.12 | bwd_allreduce_microstep: 7.78 | step_microstep: 21.79 [2024-11-13 18:44:52,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.76 | bwd: 3846.96 | bwd_inner: 3839.12 | bwd_allreduce: 7.80 | step: 21.80 2%|▏ | 867/50750 [2:02:14<82:01:17, 5.92s/it] {'loss': 0.268, 'learning_rate': 2.277084701247538e-05, 'epoch': 0.85} 2%|▏ | 867/50750 [2:02:14<82:01:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:44:58,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:44:58,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.12 | bwd_microstep: 3840.75 | bwd_inner_microstep: 3833.28 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.96 [2024-11-13 18:44:58,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.10 | bwd: 3840.76 | bwd_inner: 3833.28 | bwd_allreduce: 7.45 | step: 20.96 2%|▏ | 868/50750 [2:02:20<82:00:28, 5.92s/it] {'loss': 0.4804, 'learning_rate': 2.2797110965200267e-05, 'epoch': 0.86} 2%|▏ | 868/50750 [2:02:20<82:00:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:45:04,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:45:04,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.83 | bwd_microstep: 3854.63 | bwd_inner_microstep: 3847.16 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.08 [2024-11-13 18:45:04,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.83 | bwd: 3854.64 | bwd_inner: 3847.16 | bwd_allreduce: 7.44 | step: 21.08 2%|▏ | 869/50750 [2:02:26<82:02:59, 5.92s/it] {'loss': 0.013, 'learning_rate': 2.282337491792515e-05, 'epoch': 0.86} 2%|▏ | 869/50750 [2:02:26<82:02:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:45:10,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:45:10,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.60 | bwd_microstep: 3843.62 | bwd_inner_microstep: 3836.14 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.36 [2024-11-13 18:45:10,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.60 | bwd: 3843.63 | bwd_inner: 3836.14 | bwd_allreduce: 7.45 | step: 21.36 2%|▏ | 870/50750 [2:02:32<82:01:43, 5.92s/it] {'loss': 0.4182, 'learning_rate': 2.284963887065003e-05, 'epoch': 0.86} 2%|▏ | 870/50750 [2:02:32<82:01:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:45:16,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:45:16,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.34 | bwd_microstep: 3855.07 | bwd_inner_microstep: 3847.36 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.53 [2024-11-13 18:45:16,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.33 | bwd: 3855.08 | bwd_inner: 3847.36 | bwd_allreduce: 7.68 | step: 21.53 2%|▏ | 871/50750 [2:02:38<82:05:28, 5.92s/it] {'loss': 0.0078, 'learning_rate': 2.2875902823374922e-05, 'epoch': 0.86} 2%|▏ | 871/50750 [2:02:38<82:05:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:45:22,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:45:22,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.89 | bwd_microstep: 3849.68 | bwd_inner_microstep: 3842.17 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.21 [2024-11-13 18:45:22,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.87 | bwd: 3849.69 | bwd_inner: 3842.17 | bwd_allreduce: 7.48 | step: 21.21 2%|▏ | 872/50750 [2:02:44<82:06:22, 5.93s/it] {'loss': 0.437, 'learning_rate': 2.2902166776099802e-05, 'epoch': 0.86} 2%|▏ | 872/50750 [2:02:44<82:06:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:45:28,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:45:28,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.88 | bwd_microstep: 3844.77 | bwd_inner_microstep: 3837.26 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.99 [2024-11-13 18:45:28,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.89 | bwd: 3844.78 | bwd_inner: 3837.26 | bwd_allreduce: 7.49 | step: 20.99 2%|▏ | 873/50750 [2:02:50<82:03:02, 5.92s/it] {'loss': 0.0053, 'learning_rate': 2.2928430728824693e-05, 'epoch': 0.86} 2%|▏ | 873/50750 [2:02:50<82:03:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:45:34,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:45:34,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.47 | bwd_microstep: 3847.25 | bwd_inner_microstep: 3839.76 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.06 [2024-11-13 18:45:34,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.47 | bwd: 3847.26 | bwd_inner: 3839.76 | bwd_allreduce: 7.47 | step: 21.07 2%|▏ | 874/50750 [2:02:56<82:02:42, 5.92s/it] {'loss': 0.3266, 'learning_rate': 2.2954694681549573e-05, 'epoch': 0.86} 2%|▏ | 874/50750 [2:02:56<82:02:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:45:40,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 5.09 [2024-11-13 18:45:40,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.18 | bwd_microstep: 3847.07 | bwd_inner_microstep: 3839.56 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.34 [2024-11-13 18:45:40,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.18 | bwd: 3847.09 | bwd_inner: 3839.56 | bwd_allreduce: 7.48 | step: 21.35 2%|▏ | 875/50750 [2:03:02<82:01:35, 5.92s/it] {'loss': 0.2452, 'learning_rate': 2.298095863427446e-05, 'epoch': 0.86} 2%|▏ | 875/50750 [2:03:02<82:01:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:45:46,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 18:45:46,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.66 | bwd_microstep: 3847.01 | bwd_inner_microstep: 3839.56 | bwd_allreduce_microstep: 7.41 | step_microstep: 21.11 [2024-11-13 18:45:46,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.66 | bwd: 3847.03 | bwd_inner: 3839.56 | bwd_allreduce: 7.43 | step: 21.11 2%|▏ | 876/50750 [2:03:08<82:01:02, 5.92s/it] {'loss': 0.2527, 'learning_rate': 2.3007222586999344e-05, 'epoch': 0.86} 2%|▏ | 876/50750 [2:03:08<82:01:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:45:51,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.29 | optimizer_step: 4.93 [2024-11-13 18:45:51,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.20 | bwd_microstep: 3843.40 | bwd_inner_microstep: 3835.31 | bwd_allreduce_microstep: 8.04 | step_microstep: 22.56 [2024-11-13 18:45:51,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.20 | bwd: 3843.41 | bwd_inner: 3835.31 | bwd_allreduce: 8.06 | step: 22.56 2%|▏ | 877/50750 [2:03:13<82:03:17, 5.92s/it] {'loss': 0.0038, 'learning_rate': 2.303348653972423e-05, 'epoch': 0.86} 2%|▏ | 877/50750 [2:03:13<82:03:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:45:57,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 18:45:57,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.12 | bwd_microstep: 3866.68 | bwd_inner_microstep: 3859.17 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.33 [2024-11-13 18:45:57,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.11 | bwd: 3866.69 | bwd_inner: 3859.17 | bwd_allreduce: 7.48 | step: 21.33 2%|▏ | 878/50750 [2:03:19<82:08:45, 5.93s/it] {'loss': 0.06, 'learning_rate': 2.3059750492449115e-05, 'epoch': 0.87} 2%|▏ | 878/50750 [2:03:19<82:08:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:46:03,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 18:46:03,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.76 | bwd_microstep: 3854.91 | bwd_inner_microstep: 3847.30 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.29 [2024-11-13 18:46:03,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.74 | bwd: 3854.92 | bwd_inner: 3847.30 | bwd_allreduce: 7.58 | step: 21.30 2%|▏ | 879/50750 [2:03:25<82:08:34, 5.93s/it] {'loss': 0.0007, 'learning_rate': 2.3086014445174e-05, 'epoch': 0.87} 2%|▏ | 879/50750 [2:03:25<82:08:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:46:09,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:46:09,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.42 | bwd_microstep: 3841.77 | bwd_inner_microstep: 3833.80 | bwd_allreduce_microstep: 7.92 | step_microstep: 24.24 [2024-11-13 18:46:09,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.41 | bwd: 3841.79 | bwd_inner: 3833.79 | bwd_allreduce: 7.94 | step: 24.24 2%|▏ | 880/50750 [2:03:31<82:08:02, 5.93s/it] {'loss': 0.0124, 'learning_rate': 2.3112278397898886e-05, 'epoch': 0.87} 2%|▏ | 880/50750 [2:03:31<82:08:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:46:15,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-13 18:46:15,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.35 | bwd_microstep: 3858.00 | bwd_inner_microstep: 3850.01 | bwd_allreduce_microstep: 7.93 | step_microstep: 22.10 [2024-11-13 18:46:15,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.34 | bwd: 3858.21 | bwd_inner: 3850.01 | bwd_allreduce: 7.96 | step: 22.10 2%|▏ | 881/50750 [2:03:37<82:09:30, 5.93s/it] {'loss': 0.4491, 'learning_rate': 2.313854235062377e-05, 'epoch': 0.87} 2%|▏ | 881/50750 [2:03:37<82:09:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:46:21,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:46:21,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.07 | bwd_microstep: 3843.04 | bwd_inner_microstep: 3835.46 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.27 [2024-11-13 18:46:21,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.05 | bwd: 3843.06 | bwd_inner: 3835.46 | bwd_allreduce: 7.56 | step: 21.27 2%|▏ | 882/50750 [2:03:43<82:08:07, 5.93s/it] {'loss': 0.5719, 'learning_rate': 2.3164806303348657e-05, 'epoch': 0.87} 2%|▏ | 882/50750 [2:03:43<82:08:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:46:27,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.29 | optimizer_step: 4.93 [2024-11-13 18:46:27,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.12 | bwd_microstep: 3841.51 | bwd_inner_microstep: 3833.75 | bwd_allreduce_microstep: 7.70 | step_microstep: 22.72 [2024-11-13 18:46:27,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.12 | bwd: 3841.52 | bwd_inner: 3833.75 | bwd_allreduce: 7.73 | step: 22.72 2%|▏ | 883/50750 [2:03:49<82:05:53, 5.93s/it] {'loss': 0.2163, 'learning_rate': 2.319107025607354e-05, 'epoch': 0.87} 2%|▏ | 883/50750 [2:03:49<82:05:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:46:33,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 18:46:33,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.66 | bwd_microstep: 3853.96 | bwd_inner_microstep: 3846.13 | bwd_allreduce_microstep: 7.79 | step_microstep: 22.07 [2024-11-13 18:46:33,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.63 | bwd: 3853.98 | bwd_inner: 3846.13 | bwd_allreduce: 7.81 | step: 22.07 2%|▏ | 884/50750 [2:03:55<82:08:56, 5.93s/it] {'loss': 0.0196, 'learning_rate': 2.3217334208798428e-05, 'epoch': 0.87} 2%|▏ | 884/50750 [2:03:55<82:08:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:46:39,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:46:39,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.27 | bwd_microstep: 3855.48 | bwd_inner_microstep: 3847.79 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.93 [2024-11-13 18:46:39,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.25 | bwd: 3855.49 | bwd_inner: 3847.79 | bwd_allreduce: 7.66 | step: 21.93 2%|▏ | 885/50750 [2:04:01<82:10:44, 5.93s/it] {'loss': 0.0031, 'learning_rate': 2.324359816152331e-05, 'epoch': 0.87} 2%|▏ | 885/50750 [2:04:01<82:10:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:46:45,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 18:46:45,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.33 | bwd_microstep: 3852.19 | bwd_inner_microstep: 3844.49 | bwd_allreduce_microstep: 7.65 | step_microstep: 22.01 [2024-11-13 18:46:45,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.33 | bwd: 3852.20 | bwd_inner: 3844.49 | bwd_allreduce: 7.67 | step: 22.02 2%|▏ | 886/50750 [2:04:07<82:09:22, 5.93s/it] {'loss': 1.1608, 'learning_rate': 2.32698621142482e-05, 'epoch': 0.87} 2%|▏ | 886/50750 [2:04:07<82:09:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:46:51,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 18:46:51,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.60 | bwd_microstep: 3852.53 | bwd_inner_microstep: 3845.04 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.99 [2024-11-13 18:46:51,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.59 | bwd: 3852.54 | bwd_inner: 3845.04 | bwd_allreduce: 7.46 | step: 21.00 2%|▏ | 887/50750 [2:04:13<82:10:14, 5.93s/it] {'loss': 0.0037, 'learning_rate': 2.3296126066973082e-05, 'epoch': 0.87} 2%|▏ | 887/50750 [2:04:13<82:10:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:46:57,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.63 | optimizer_step: 4.93 [2024-11-13 18:46:57,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.72 | bwd_microstep: 3863.15 | bwd_inner_microstep: 3854.13 | bwd_allreduce_microstep: 8.96 | step_microstep: 27.27 [2024-11-13 18:46:57,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.71 | bwd: 3863.17 | bwd_inner: 3854.13 | bwd_allreduce: 8.99 | step: 27.29 2%|▏ | 888/50750 [2:04:19<82:14:47, 5.94s/it] {'loss': 0.0121, 'learning_rate': 2.3322390019697966e-05, 'epoch': 0.87} 2%|▏ | 888/50750 [2:04:19<82:14:47, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:47:03,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 18:47:03,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.89 | bwd_microstep: 3864.16 | bwd_inner_microstep: 3856.46 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.78 [2024-11-13 18:47:03,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.87 | bwd: 3864.18 | bwd_inner: 3856.46 | bwd_allreduce: 7.68 | step: 21.78 2%|▏ | 889/50750 [2:04:25<82:17:00, 5.94s/it] {'loss': 0.0044, 'learning_rate': 2.3348653972422853e-05, 'epoch': 0.88} 2%|▏ | 889/50750 [2:04:25<82:17:00, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:47:09,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.94 [2024-11-13 18:47:09,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.99 | bwd_microstep: 3865.37 | bwd_inner_microstep: 3857.85 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.31 [2024-11-13 18:47:09,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.97 | bwd: 3865.38 | bwd_inner: 3857.85 | bwd_allreduce: 7.50 | step: 21.31 2%|▏ | 890/50750 [2:04:31<82:15:52, 5.94s/it] {'loss': 0.3388, 'learning_rate': 2.3374917925147734e-05, 'epoch': 0.88} 2%|▏ | 890/50750 [2:04:31<82:15:52, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:47:15,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 18:47:15,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.26 | bwd_microstep: 3879.80 | bwd_inner_microstep: 3872.23 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.73 [2024-11-13 18:47:15,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.26 | bwd: 3879.81 | bwd_inner: 3872.23 | bwd_allreduce: 7.54 | step: 21.73 2%|▏ | 891/50750 [2:04:37<82:19:29, 5.94s/it] {'loss': 0.4812, 'learning_rate': 2.3401181877872624e-05, 'epoch': 0.88} 2%|▏ | 891/50750 [2:04:37<82:19:29, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:47:21,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:47:21,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.23 | bwd_microstep: 3851.48 | bwd_inner_microstep: 3843.77 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.45 [2024-11-13 18:47:21,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.22 | bwd: 3851.49 | bwd_inner: 3843.77 | bwd_allreduce: 7.68 | step: 21.46 2%|▏ | 892/50750 [2:04:42<82:15:43, 5.94s/it] {'loss': 0.0209, 'learning_rate': 2.3427445830597505e-05, 'epoch': 0.88} 2%|▏ | 892/50750 [2:04:42<82:15:43, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:47:26,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:47:26,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.24 | bwd_microstep: 3851.19 | bwd_inner_microstep: 3843.66 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.49 [2024-11-13 18:47:26,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.24 | bwd: 3851.21 | bwd_inner: 3843.66 | bwd_allreduce: 7.51 | step: 21.49 2%|▏ | 893/50750 [2:04:48<82:13:06, 5.94s/it] {'loss': 0.3866, 'learning_rate': 2.3453709783322395e-05, 'epoch': 0.88} 2%|▏ | 893/50750 [2:04:48<82:13:06, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:47:32,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:47:32,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.91 | bwd_microstep: 3841.56 | bwd_inner_microstep: 3833.97 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.74 [2024-11-13 18:47:32,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.91 | bwd: 3841.57 | bwd_inner: 3833.97 | bwd_allreduce: 7.55 | step: 21.75 2%|▏ | 894/50750 [2:04:54<82:07:27, 5.93s/it] {'loss': 0.025, 'learning_rate': 2.3479973736047276e-05, 'epoch': 0.88} 2%|▏ | 894/50750 [2:04:54<82:07:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:47:38,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 18:47:38,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.50 | bwd_microstep: 3848.41 | bwd_inner_microstep: 3840.89 | bwd_allreduce_microstep: 7.47 | step_microstep: 22.00 [2024-11-13 18:47:38,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.50 | bwd: 3848.42 | bwd_inner: 3840.89 | bwd_allreduce: 7.48 | step: 22.00 2%|▏ | 895/50750 [2:05:00<82:06:55, 5.93s/it] {'loss': 0.0011, 'learning_rate': 2.350623768877216e-05, 'epoch': 0.88} 2%|▏ | 895/50750 [2:05:00<82:06:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:47:44,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:47:44,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.01 | bwd_microstep: 3844.59 | bwd_inner_microstep: 3837.09 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.10 [2024-11-13 18:47:44,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.01 | bwd: 3844.60 | bwd_inner: 3837.09 | bwd_allreduce: 7.47 | step: 21.10 2%|▏ | 896/50750 [2:05:06<82:02:38, 5.92s/it] {'loss': 0.1391, 'learning_rate': 2.3532501641497046e-05, 'epoch': 0.88} 2%|▏ | 896/50750 [2:05:06<82:02:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:47:50,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.37 | optimizer_step: 4.93 [2024-11-13 18:47:50,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.77 | bwd_microstep: 3850.07 | bwd_inner_microstep: 3842.40 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.95 [2024-11-13 18:47:50,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.77 | bwd: 3850.09 | bwd_inner: 3842.40 | bwd_allreduce: 7.65 | step: 21.95 2%|▏ | 897/50750 [2:05:12<82:03:16, 5.93s/it] {'loss': 0.0832, 'learning_rate': 2.355876559422193e-05, 'epoch': 0.88} 2%|▏ | 897/50750 [2:05:12<82:03:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:47:56,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 18:47:56,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.84 | bwd_microstep: 3846.41 | bwd_inner_microstep: 3838.60 | bwd_allreduce_microstep: 7.76 | step_microstep: 21.83 [2024-11-13 18:47:56,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.83 | bwd: 3846.42 | bwd_inner: 3838.60 | bwd_allreduce: 7.78 | step: 21.84 2%|▏ | 898/50750 [2:05:18<82:04:43, 5.93s/it] {'loss': 0.0015, 'learning_rate': 2.3585029546946817e-05, 'epoch': 0.88} 2%|▏ | 898/50750 [2:05:18<82:04:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:48:02,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.92 [2024-11-13 18:48:02,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.05 | bwd_microstep: 3848.50 | bwd_inner_microstep: 3840.74 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.51 [2024-11-13 18:48:02,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.03 | bwd: 3848.51 | bwd_inner: 3840.74 | bwd_allreduce: 7.72 | step: 22.51 2%|▏ | 899/50750 [2:05:24<82:07:14, 5.93s/it] {'loss': 0.1472, 'learning_rate': 2.36112934996717e-05, 'epoch': 0.89} 2%|▏ | 899/50750 [2:05:24<82:07:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:48:08,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 4.92 [2024-11-13 18:48:08,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.32 | bwd_microstep: 3845.39 | bwd_inner_microstep: 3837.52 | bwd_allreduce_microstep: 7.83 | step_microstep: 22.41 [2024-11-13 18:48:08,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.30 | bwd: 3845.41 | bwd_inner: 3837.52 | bwd_allreduce: 7.85 | step: 22.42 2%|▏ | 900/50750 [2:05:30<82:08:23, 5.93s/it] {'loss': 0.3007, 'learning_rate': 2.363755745239659e-05, 'epoch': 0.89} 2%|▏ | 900/50750 [2:05:30<82:08:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:48:14,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 18:48:14,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.82 | bwd_microstep: 3847.34 | bwd_inner_microstep: 3839.20 | bwd_allreduce_microstep: 8.09 | step_microstep: 22.01 [2024-11-13 18:48:14,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.81 | bwd: 3847.35 | bwd_inner: 3839.20 | bwd_allreduce: 8.11 | step: 22.02 2%|▏ | 901/50750 [2:05:36<82:09:45, 5.93s/it] {'loss': 0.0045, 'learning_rate': 2.3663821405121472e-05, 'epoch': 0.89} 2%|▏ | 901/50750 [2:05:36<82:09:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:48:20,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 18:48:20,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.99 | bwd_microstep: 3845.05 | bwd_inner_microstep: 3837.55 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.03 [2024-11-13 18:48:20,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.97 | bwd: 3845.06 | bwd_inner: 3837.55 | bwd_allreduce: 7.47 | step: 21.03 2%|▏ | 902/50750 [2:05:42<82:06:34, 5.93s/it] {'loss': 0.0025, 'learning_rate': 2.369008535784636e-05, 'epoch': 0.89} 2%|▏ | 902/50750 [2:05:42<82:06:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:48:26,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:48:26,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.37 | bwd_microstep: 3851.69 | bwd_inner_microstep: 3844.21 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.87 [2024-11-13 18:48:26,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.37 | bwd: 3851.71 | bwd_inner: 3844.21 | bwd_allreduce: 7.46 | step: 20.88 2%|▏ | 903/50750 [2:05:48<82:05:01, 5.93s/it] {'loss': 1.0781, 'learning_rate': 2.3716349310571243e-05, 'epoch': 0.89} 2%|▏ | 903/50750 [2:05:48<82:05:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:48:32,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:48:32,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.32 | bwd_microstep: 3855.62 | bwd_inner_microstep: 3844.20 | bwd_allreduce_microstep: 11.37 | step_microstep: 21.25 [2024-11-13 18:48:32,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.32 | bwd: 3855.63 | bwd_inner: 3844.20 | bwd_allreduce: 11.39 | step: 21.27 2%|▏ | 904/50750 [2:05:54<82:04:54, 5.93s/it] {'loss': 0.0009, 'learning_rate': 2.3742613263296127e-05, 'epoch': 0.89} 2%|▏ | 904/50750 [2:05:54<82:04:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:48:37,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-13 18:48:37,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1992.80 | bwd_microstep: 3785.04 | bwd_inner_microstep: 3777.00 | bwd_allreduce_microstep: 7.98 | step_microstep: 23.89 [2024-11-13 18:48:37,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1992.78 | bwd: 3785.05 | bwd_inner: 3777.00 | bwd_allreduce: 8.00 | step: 23.89 2%|▏ | 905/50750 [2:05:59<81:40:11, 5.90s/it] {'loss': 0.6028, 'learning_rate': 2.3768877216021014e-05, 'epoch': 0.89} 2%|▏ | 905/50750 [2:05:59<81:40:11, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:48:43,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:48:43,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.42 | bwd_microstep: 3858.23 | bwd_inner_microstep: 3850.73 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.03 [2024-11-13 18:48:43,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.42 | bwd: 3858.24 | bwd_inner: 3850.73 | bwd_allreduce: 7.47 | step: 21.03 2%|▏ | 906/50750 [2:06:05<81:47:56, 5.91s/it] {'loss': 0.0381, 'learning_rate': 2.3795141168745898e-05, 'epoch': 0.89} 2%|▏ | 906/50750 [2:06:05<81:47:56, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:48:49,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.92 [2024-11-13 18:48:49,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.55 | bwd_microstep: 3848.44 | bwd_inner_microstep: 3840.68 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.68 [2024-11-13 18:48:49,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.55 | bwd: 3848.45 | bwd_inner: 3840.68 | bwd_allreduce: 7.73 | step: 22.69 2%|▏ | 907/50750 [2:06:11<81:52:24, 5.91s/it] {'loss': 0.0038, 'learning_rate': 2.3821405121470785e-05, 'epoch': 0.89} 2%|▏ | 907/50750 [2:06:11<81:52:24, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:48:55,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:48:55,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.94 | bwd_microstep: 3849.26 | bwd_inner_microstep: 3841.72 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.29 [2024-11-13 18:48:55,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.92 | bwd: 3849.27 | bwd_inner: 3841.72 | bwd_allreduce: 7.51 | step: 21.29 2%|▏ | 908/50750 [2:06:17<81:55:59, 5.92s/it] {'loss': 0.6104, 'learning_rate': 2.384766907419567e-05, 'epoch': 0.89} 2%|▏ | 908/50750 [2:06:17<81:55:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:49:01,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:49:01,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.52 | bwd_microstep: 3842.08 | bwd_inner_microstep: 3834.29 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.35 [2024-11-13 18:49:01,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.51 | bwd: 3842.10 | bwd_inner: 3834.29 | bwd_allreduce: 7.76 | step: 21.36 2%|▏ | 909/50750 [2:06:23<81:55:57, 5.92s/it] {'loss': 0.0035, 'learning_rate': 2.3873933026920556e-05, 'epoch': 0.9} 2%|▏ | 909/50750 [2:06:23<81:55:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:49:07,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:49:07,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.70 | bwd_microstep: 3842.04 | bwd_inner_microstep: 3834.33 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.35 [2024-11-13 18:49:07,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.69 | bwd: 3842.06 | bwd_inner: 3834.33 | bwd_allreduce: 7.67 | step: 21.35 2%|▏ | 910/50750 [2:06:29<81:55:56, 5.92s/it] {'loss': 0.0029, 'learning_rate': 2.390019697964544e-05, 'epoch': 0.9} 2%|▏ | 910/50750 [2:06:29<81:55:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:49:13,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 18:49:13,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.49 | bwd_microstep: 3841.78 | bwd_inner_microstep: 3834.10 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.80 [2024-11-13 18:49:13,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.49 | bwd: 3841.79 | bwd_inner: 3834.10 | bwd_allreduce: 7.65 | step: 20.80 2%|▏ | 911/50750 [2:06:35<81:54:23, 5.92s/it] {'loss': 0.0582, 'learning_rate': 2.3926460932370327e-05, 'epoch': 0.9} 2%|▏ | 911/50750 [2:06:35<81:54:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:49:19,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:49:19,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.20 | bwd_microstep: 3841.35 | bwd_inner_microstep: 3833.89 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.99 [2024-11-13 18:49:19,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.20 | bwd: 3841.36 | bwd_inner: 3833.89 | bwd_allreduce: 7.44 | step: 21.00 2%|▏ | 912/50750 [2:06:41<81:53:08, 5.91s/it] {'loss': 0.4805, 'learning_rate': 2.3952724885095207e-05, 'epoch': 0.9} 2%|▏ | 912/50750 [2:06:41<81:53:08, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:49:25,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:49:25,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.49 | bwd_microstep: 3864.81 | bwd_inner_microstep: 3857.34 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.83 [2024-11-13 18:49:25,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.49 | bwd: 3864.82 | bwd_inner: 3857.34 | bwd_allreduce: 7.44 | step: 20.83 2%|▏ | 913/50750 [2:06:47<81:57:27, 5.92s/it] {'loss': 0.6005, 'learning_rate': 2.397898883782009e-05, 'epoch': 0.9} 2%|▏ | 913/50750 [2:06:47<81:57:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:49:31,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 18:49:31,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.44 | bwd_microstep: 3844.62 | bwd_inner_microstep: 3837.14 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.95 [2024-11-13 18:49:31,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.44 | bwd: 3844.63 | bwd_inner: 3837.14 | bwd_allreduce: 7.45 | step: 20.95 2%|▏ | 914/50750 [2:06:53<81:55:52, 5.92s/it] {'loss': 0.6145, 'learning_rate': 2.4005252790544978e-05, 'epoch': 0.9} 2%|▏ | 914/50750 [2:06:53<81:55:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:49:37,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-13 18:49:37,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.41 | bwd_microstep: 3849.82 | bwd_inner_microstep: 3842.35 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.82 [2024-11-13 18:49:37,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.41 | bwd: 3849.83 | bwd_inner: 3842.35 | bwd_allreduce: 7.45 | step: 20.82 2%|▏ | 915/50750 [2:06:59<81:55:53, 5.92s/it] {'loss': 0.0232, 'learning_rate': 2.4031516743269862e-05, 'epoch': 0.9} 2%|▏ | 915/50750 [2:06:59<81:55:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:49:43,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:49:43,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.24 | bwd_microstep: 3841.24 | bwd_inner_microstep: 3833.76 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.87 [2024-11-13 18:49:43,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.24 | bwd: 3841.25 | bwd_inner: 3833.76 | bwd_allreduce: 7.45 | step: 20.88 2%|▏ | 916/50750 [2:07:05<81:53:46, 5.92s/it] {'loss': 0.0109, 'learning_rate': 2.405778069599475e-05, 'epoch': 0.9} 2%|▏ | 916/50750 [2:07:05<81:53:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:49:49,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 18:49:49,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.83 | bwd_microstep: 3842.59 | bwd_inner_microstep: 3835.13 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.80 [2024-11-13 18:49:49,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.83 | bwd: 3842.60 | bwd_inner: 3835.13 | bwd_allreduce: 7.44 | step: 20.80 2%|▏ | 917/50750 [2:07:10<81:53:35, 5.92s/it] {'loss': 0.1914, 'learning_rate': 2.4084044648719633e-05, 'epoch': 0.9} 2%|▏ | 917/50750 [2:07:10<81:53:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:49:54,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:49:54,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.15 | bwd_microstep: 3846.28 | bwd_inner_microstep: 3838.81 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.01 [2024-11-13 18:49:54,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.14 | bwd: 3846.30 | bwd_inner: 3838.81 | bwd_allreduce: 7.45 | step: 21.01 2%|▏ | 918/50750 [2:07:16<81:54:03, 5.92s/it] {'loss': 0.6074, 'learning_rate': 2.411030860144452e-05, 'epoch': 0.9} 2%|▏ | 918/50750 [2:07:16<81:54:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:50:00,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 18:50:00,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.73 | bwd_microstep: 3846.75 | bwd_inner_microstep: 3838.79 | bwd_allreduce_microstep: 7.89 | step_microstep: 24.80 [2024-11-13 18:50:00,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.73 | bwd: 3846.77 | bwd_inner: 3838.79 | bwd_allreduce: 7.92 | step: 24.80 2%|▏ | 919/50750 [2:07:22<81:55:55, 5.92s/it] {'loss': 0.3781, 'learning_rate': 2.4136572554169404e-05, 'epoch': 0.91} 2%|▏ | 919/50750 [2:07:22<81:55:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:50:06,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 5.08 [2024-11-13 18:50:06,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.74 | bwd_microstep: 3854.45 | bwd_inner_microstep: 3846.92 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.20 [2024-11-13 18:50:06,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.72 | bwd: 3854.46 | bwd_inner: 3846.92 | bwd_allreduce: 7.50 | step: 21.20 2%|▏ | 920/50750 [2:07:28<81:59:45, 5.92s/it] {'loss': 0.0276, 'learning_rate': 2.416283650689429e-05, 'epoch': 0.91} 2%|▏ | 920/50750 [2:07:28<81:59:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:50:12,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:50:12,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.76 | bwd_microstep: 3852.64 | bwd_inner_microstep: 3844.75 | bwd_allreduce_microstep: 7.83 | step_microstep: 22.75 [2024-11-13 18:50:12,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.76 | bwd: 3852.66 | bwd_inner: 3844.75 | bwd_allreduce: 7.85 | step: 22.75 2%|▏ | 921/50750 [2:07:34<82:00:48, 5.93s/it] {'loss': 0.1801, 'learning_rate': 2.4189100459619175e-05, 'epoch': 0.91} 2%|▏ | 921/50750 [2:07:34<82:00:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:50:18,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 18:50:18,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.29 | bwd_microstep: 3853.26 | bwd_inner_microstep: 3845.77 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.22 [2024-11-13 18:50:18,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.28 | bwd: 3853.27 | bwd_inner: 3845.77 | bwd_allreduce: 7.46 | step: 21.22 2%|▏ | 922/50750 [2:07:40<82:02:19, 5.93s/it] {'loss': 0.2825, 'learning_rate': 2.421536441234406e-05, 'epoch': 0.91} 2%|▏ | 922/50750 [2:07:40<82:02:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:50:24,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:50:24,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.20 | bwd_microstep: 3856.52 | bwd_inner_microstep: 3848.86 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.53 [2024-11-13 18:50:24,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.18 | bwd: 3856.53 | bwd_inner: 3848.86 | bwd_allreduce: 7.63 | step: 21.53 2%|▏ | 923/50750 [2:07:46<82:04:40, 5.93s/it] {'loss': 0.8078, 'learning_rate': 2.4241628365068946e-05, 'epoch': 0.91} 2%|▏ | 923/50750 [2:07:46<82:04:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:50:30,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:50:30,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.73 | bwd_microstep: 3844.70 | bwd_inner_microstep: 3837.02 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.49 [2024-11-13 18:50:30,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.74 | bwd: 3844.71 | bwd_inner: 3837.02 | bwd_allreduce: 7.65 | step: 21.49 2%|▏ | 924/50750 [2:07:52<82:00:48, 5.93s/it] {'loss': 0.2083, 'learning_rate': 2.426789231779383e-05, 'epoch': 0.91} 2%|▏ | 924/50750 [2:07:52<82:00:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:50:36,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 18:50:36,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.62 | bwd_microstep: 3850.01 | bwd_inner_microstep: 3842.36 | bwd_allreduce_microstep: 7.61 | step_microstep: 22.01 [2024-11-13 18:50:36,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.62 | bwd: 3850.03 | bwd_inner: 3842.36 | bwd_allreduce: 7.63 | step: 22.02 2%|▏ | 925/50750 [2:07:58<82:00:40, 5.93s/it] {'loss': 0.4953, 'learning_rate': 2.4294156270518716e-05, 'epoch': 0.91} 2%|▏ | 925/50750 [2:07:58<82:00:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:50:42,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:50:42,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.77 | bwd_microstep: 3846.25 | bwd_inner_microstep: 3838.56 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.03 [2024-11-13 18:50:42,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.76 | bwd: 3846.26 | bwd_inner: 3838.56 | bwd_allreduce: 7.66 | step: 21.03 2%|▏ | 926/50750 [2:08:04<81:59:51, 5.92s/it] {'loss': 0.0946, 'learning_rate': 2.43204202232436e-05, 'epoch': 0.91} 2%|▏ | 926/50750 [2:08:04<81:59:51, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:50:48,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 18:50:48,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.72 | bwd_microstep: 3853.86 | bwd_inner_microstep: 3845.76 | bwd_allreduce_microstep: 8.04 | step_microstep: 23.61 [2024-11-13 18:50:48,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.72 | bwd: 3853.88 | bwd_inner: 3845.76 | bwd_allreduce: 8.07 | step: 23.61 2%|▏ | 927/50750 [2:08:10<82:01:30, 5.93s/it] {'loss': 0.0407, 'learning_rate': 2.4346684175968487e-05, 'epoch': 0.91} 2%|▏ | 927/50750 [2:08:10<82:01:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:50:54,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:50:54,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.07 | bwd_microstep: 3852.11 | bwd_inner_microstep: 3844.32 | bwd_allreduce_microstep: 7.73 | step_microstep: 23.89 [2024-11-13 18:50:54,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.05 | bwd: 3852.12 | bwd_inner: 3844.32 | bwd_allreduce: 7.75 | step: 23.89 2%|▏ | 928/50750 [2:08:16<82:03:09, 5.93s/it] {'loss': 0.3022, 'learning_rate': 2.437294812869337e-05, 'epoch': 0.91} 2%|▏ | 928/50750 [2:08:16<82:03:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:51:00,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:51:00,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.82 | bwd_microstep: 3850.92 | bwd_inner_microstep: 3843.46 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.89 [2024-11-13 18:51:00,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.82 | bwd: 3850.93 | bwd_inner: 3843.46 | bwd_allreduce: 7.43 | step: 20.89 2%|▏ | 929/50750 [2:08:22<82:01:39, 5.93s/it] {'loss': 0.3233, 'learning_rate': 2.4399212081418258e-05, 'epoch': 0.92} 2%|▏ | 929/50750 [2:08:22<82:01:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:51:06,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 5.05 [2024-11-13 18:51:06,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.84 | bwd_microstep: 3854.14 | bwd_inner_microstep: 3846.47 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.75 [2024-11-13 18:51:06,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.84 | bwd: 3854.15 | bwd_inner: 3846.47 | bwd_allreduce: 7.65 | step: 21.75 2%|▏ | 930/50750 [2:08:28<82:03:38, 5.93s/it] {'loss': 0.0477, 'learning_rate': 2.4425476034143142e-05, 'epoch': 0.92} 2%|▏ | 930/50750 [2:08:28<82:03:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:51:12,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 18:51:12,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.14 | bwd_microstep: 3852.96 | bwd_inner_microstep: 3845.02 | bwd_allreduce_microstep: 7.90 | step_microstep: 22.02 [2024-11-13 18:51:12,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.14 | bwd: 3852.98 | bwd_inner: 3845.02 | bwd_allreduce: 7.91 | step: 22.03 2%|▏ | 931/50750 [2:08:33<82:04:10, 5.93s/it] {'loss': 0.07, 'learning_rate': 2.4451739986868022e-05, 'epoch': 0.92} 2%|▏ | 931/50750 [2:08:33<82:04:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 18:51:17,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:51:17,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.13 | bwd_microstep: 3848.48 | bwd_inner_microstep: 3840.99 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.37 [2024-11-13 18:51:17,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.11 | bwd: 3848.49 | bwd_inner: 3840.99 | bwd_allreduce: 7.46 | step: 21.37 2%|▏ | 932/50750 [2:08:39<82:04:53, 5.93s/it] {'loss': 0.033, 'learning_rate': 2.4478003939592913e-05, 'epoch': 0.92} 2%|▏ | 932/50750 [2:08:39<82:04:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:51:23,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 18:51:23,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.64 | bwd_microstep: 3848.93 | bwd_inner_microstep: 3841.46 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.02 [2024-11-13 18:51:23,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.64 | bwd: 3848.94 | bwd_inner: 3841.46 | bwd_allreduce: 7.44 | step: 21.03 2%|▏ | 933/50750 [2:08:45<82:01:53, 5.93s/it] {'loss': 0.0461, 'learning_rate': 2.4504267892317793e-05, 'epoch': 0.92} 2%|▏ | 933/50750 [2:08:45<82:01:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:51:29,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:51:29,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.48 | bwd_microstep: 3847.70 | bwd_inner_microstep: 3840.18 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.25 [2024-11-13 18:51:29,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.47 | bwd: 3847.72 | bwd_inner: 3840.18 | bwd_allreduce: 7.50 | step: 21.26 2%|▏ | 934/50750 [2:08:51<81:59:13, 5.92s/it] {'loss': 0.0267, 'learning_rate': 2.453053184504268e-05, 'epoch': 0.92} 2%|▏ | 934/50750 [2:08:51<81:59:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:51:35,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 18:51:35,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.59 | bwd_microstep: 3856.77 | bwd_inner_microstep: 3849.24 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.97 [2024-11-13 18:51:35,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.59 | bwd: 3856.78 | bwd_inner: 3849.24 | bwd_allreduce: 7.50 | step: 20.98 2%|▏ | 935/50750 [2:08:57<82:00:20, 5.93s/it] {'loss': 0.0094, 'learning_rate': 2.4556795797767564e-05, 'epoch': 0.92} 2%|▏ | 935/50750 [2:08:57<82:00:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:51:41,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 18:51:41,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.24 | bwd_microstep: 3857.38 | bwd_inner_microstep: 3849.86 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.99 [2024-11-13 18:51:41,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.25 | bwd: 3857.40 | bwd_inner: 3849.86 | bwd_allreduce: 7.50 | step: 20.99 2%|▏ | 936/50750 [2:09:03<82:02:50, 5.93s/it] {'loss': 0.0056, 'learning_rate': 2.458305975049245e-05, 'epoch': 0.92} 2%|▏ | 936/50750 [2:09:03<82:02:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 18:51:47,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 18:51:47,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.88 | bwd_microstep: 3854.16 | bwd_inner_microstep: 3846.65 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.43 [2024-11-13 18:51:47,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.88 | bwd: 3854.18 | bwd_inner: 3846.65 | bwd_allreduce: 7.48 | step: 21.43 2%|▏ | 937/50750 [2:09:09<82:03:30, 5.93s/it] {'loss': 0.226, 'learning_rate': 2.4609323703217335e-05, 'epoch': 0.92} 2%|▏ | 937/50750 [2:09:09<82:03:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:51:53,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-13 18:51:53,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.68 | bwd_microstep: 3854.47 | bwd_inner_microstep: 3846.57 | bwd_allreduce_microstep: 7.83 | step_microstep: 29.50 [2024-11-13 18:51:53,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.68 | bwd: 3854.49 | bwd_inner: 3846.57 | bwd_allreduce: 7.87 | step: 29.51 2%|▏ | 938/50750 [2:09:15<82:04:50, 5.93s/it] {'loss': 0.0163, 'learning_rate': 2.463558765594222e-05, 'epoch': 0.92} 2%|▏ | 938/50750 [2:09:15<82:04:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:51:59,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:51:59,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.95 | bwd_microstep: 3854.37 | bwd_inner_microstep: 3846.86 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.86 [2024-11-13 18:51:59,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.94 | bwd: 3854.38 | bwd_inner: 3846.86 | bwd_allreduce: 7.48 | step: 20.86 2%|▏ | 939/50750 [2:09:21<82:04:24, 5.93s/it] {'loss': 0.0048, 'learning_rate': 2.4661851608667106e-05, 'epoch': 0.93} 2%|▏ | 939/50750 [2:09:21<82:04:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:52:05,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:52:05,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.38 | bwd_microstep: 3844.96 | bwd_inner_microstep: 3837.44 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.04 [2024-11-13 18:52:05,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.38 | bwd: 3844.98 | bwd_inner: 3837.44 | bwd_allreduce: 7.50 | step: 21.04 2%|▏ | 940/50750 [2:09:27<82:01:26, 5.93s/it] {'loss': 0.0054, 'learning_rate': 2.468811556139199e-05, 'epoch': 0.93} 2%|▏ | 940/50750 [2:09:27<82:01:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:52:11,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 18:52:11,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.74 | bwd_microstep: 3846.45 | bwd_inner_microstep: 3838.64 | bwd_allreduce_microstep: 7.77 | step_microstep: 21.59 [2024-11-13 18:52:11,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.74 | bwd: 3846.46 | bwd_inner: 3838.64 | bwd_allreduce: 7.79 | step: 21.60 2%|▏ | 941/50750 [2:09:33<81:58:17, 5.92s/it] {'loss': 0.2193, 'learning_rate': 2.4714379514116877e-05, 'epoch': 0.93} 2%|▏ | 941/50750 [2:09:33<81:58:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:52:17,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.50 | optimizer_step: 4.93 [2024-11-13 18:52:17,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.38 | bwd_microstep: 3850.87 | bwd_inner_microstep: 3843.30 | bwd_allreduce_microstep: 7.52 | step_microstep: 22.57 [2024-11-13 18:52:17,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.37 | bwd: 3850.88 | bwd_inner: 3843.30 | bwd_allreduce: 7.54 | step: 22.58 2%|▏ | 942/50750 [2:09:39<81:58:51, 5.93s/it] {'loss': 0.3593, 'learning_rate': 2.474064346684176e-05, 'epoch': 0.93} 2%|▏ | 942/50750 [2:09:39<81:58:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:52:23,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:52:23,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.37 | bwd_microstep: 3850.86 | bwd_inner_microstep: 3843.26 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.20 [2024-11-13 18:52:23,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.37 | bwd: 3850.87 | bwd_inner: 3843.26 | bwd_allreduce: 7.57 | step: 21.21 2%|▏ | 943/50750 [2:09:45<81:58:35, 5.93s/it] {'loss': 0.0043, 'learning_rate': 2.4766907419566648e-05, 'epoch': 0.93} 2%|▏ | 943/50750 [2:09:45<81:58:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:52:29,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 18:52:29,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.08 | bwd_microstep: 3849.67 | bwd_inner_microstep: 3842.17 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.29 [2024-11-13 18:52:29,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.06 | bwd: 3849.68 | bwd_inner: 3842.17 | bwd_allreduce: 7.47 | step: 21.29 2%|▏ | 944/50750 [2:09:51<81:59:38, 5.93s/it] {'loss': 0.0011, 'learning_rate': 2.4793171372291532e-05, 'epoch': 0.93} 2%|▏ | 944/50750 [2:09:51<81:59:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:52:34,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:52:34,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.64 | bwd_microstep: 3854.35 | bwd_inner_microstep: 3846.81 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.03 [2024-11-13 18:52:34,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.64 | bwd: 3854.36 | bwd_inner: 3846.81 | bwd_allreduce: 7.51 | step: 21.04 2%|▏ | 945/50750 [2:09:56<81:59:12, 5.93s/it] {'loss': 0.0015, 'learning_rate': 2.481943532501642e-05, 'epoch': 0.93} 2%|▏ | 945/50750 [2:09:56<81:59:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:52:40,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 18:52:40,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.34 | bwd_microstep: 3848.99 | bwd_inner_microstep: 3841.26 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.86 [2024-11-13 18:52:40,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.34 | bwd: 3849.00 | bwd_inner: 3841.26 | bwd_allreduce: 7.69 | step: 21.86 2%|▏ | 946/50750 [2:10:02<81:58:20, 5.93s/it] {'loss': 0.5109, 'learning_rate': 2.4845699277741303e-05, 'epoch': 0.93} 2%|▏ | 946/50750 [2:10:02<81:58:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:52:46,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.55 | optimizer_step: 4.93 [2024-11-13 18:52:46,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.52 | bwd_microstep: 3858.37 | bwd_inner_microstep: 3850.82 | bwd_allreduce_microstep: 7.52 | step_microstep: 23.35 [2024-11-13 18:52:46,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.50 | bwd: 3858.39 | bwd_inner: 3850.82 | bwd_allreduce: 7.53 | step: 23.37 2%|▏ | 947/50750 [2:10:08<82:01:38, 5.93s/it] {'loss': 0.3626, 'learning_rate': 2.4871963230466186e-05, 'epoch': 0.93} 2%|▏ | 947/50750 [2:10:08<82:01:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:52:52,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:52:52,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.01 | bwd_microstep: 3855.94 | bwd_inner_microstep: 3845.15 | bwd_allreduce_microstep: 10.71 | step_microstep: 21.39 [2024-11-13 18:52:52,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.99 | bwd: 3855.97 | bwd_inner: 3845.15 | bwd_allreduce: 10.75 | step: 21.38 2%|▏ | 948/50750 [2:10:14<82:02:00, 5.93s/it] {'loss': 0.0023, 'learning_rate': 2.4898227183191074e-05, 'epoch': 0.93} 2%|▏ | 948/50750 [2:10:14<82:02:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:52:58,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:52:58,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.79 | bwd_microstep: 3854.58 | bwd_inner_microstep: 3847.12 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.52 [2024-11-13 18:52:58,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.77 | bwd: 3854.59 | bwd_inner: 3847.12 | bwd_allreduce: 7.43 | step: 21.54 2%|▏ | 949/50750 [2:10:20<82:02:52, 5.93s/it] {'loss': 0.0097, 'learning_rate': 2.4924491135915957e-05, 'epoch': 0.93} 2%|▏ | 949/50750 [2:10:20<82:02:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:53:04,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 18:53:04,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.87 | bwd_microstep: 3857.56 | bwd_inner_microstep: 3849.78 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.36 [2024-11-13 18:53:04,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.87 | bwd: 3857.57 | bwd_inner: 3849.78 | bwd_allreduce: 7.75 | step: 21.37 2%|▏ | 950/50750 [2:10:26<82:02:55, 5.93s/it] {'loss': 0.421, 'learning_rate': 2.4950755088640845e-05, 'epoch': 0.94} 2%|▏ | 950/50750 [2:10:26<82:02:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:53:10,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.44 | optimizer_step: 4.93 [2024-11-13 18:53:10,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.98 | bwd_microstep: 3856.08 | bwd_inner_microstep: 3848.50 | bwd_allreduce_microstep: 7.53 | step_microstep: 22.96 [2024-11-13 18:53:10,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.97 | bwd: 3856.09 | bwd_inner: 3848.50 | bwd_allreduce: 7.54 | step: 22.97 2%|▏ | 951/50750 [2:10:32<82:03:14, 5.93s/it] {'loss': 0.0001, 'learning_rate': 2.4977019041365725e-05, 'epoch': 0.94} 2%|▏ | 951/50750 [2:10:32<82:03:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:53:16,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-13 18:53:16,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.48 | bwd_microstep: 3855.18 | bwd_inner_microstep: 3847.25 | bwd_allreduce_microstep: 7.87 | step_microstep: 22.10 [2024-11-13 18:53:16,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.47 | bwd: 3855.19 | bwd_inner: 3847.25 | bwd_allreduce: 7.90 | step: 22.10 2%|▏ | 952/50750 [2:10:38<82:03:01, 5.93s/it] {'loss': 0.0654, 'learning_rate': 2.5003282994090615e-05, 'epoch': 0.94} 2%|▏ | 952/50750 [2:10:38<82:03:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:53:22,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:53:22,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.34 | bwd_microstep: 3852.23 | bwd_inner_microstep: 3844.72 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.24 [2024-11-13 18:53:22,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.33 | bwd: 3852.24 | bwd_inner: 3844.72 | bwd_allreduce: 7.48 | step: 21.24 2%|▏ | 953/50750 [2:10:44<82:02:09, 5.93s/it] {'loss': 0.6884, 'learning_rate': 2.5029546946815496e-05, 'epoch': 0.94} 2%|▏ | 953/50750 [2:10:44<82:02:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:53:28,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:53:28,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.26 | bwd_microstep: 3856.02 | bwd_inner_microstep: 3848.32 | bwd_allreduce_microstep: 7.66 | step_microstep: 20.98 [2024-11-13 18:53:28,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.26 | bwd: 3856.03 | bwd_inner: 3848.32 | bwd_allreduce: 7.67 | step: 20.99 2%|▏ | 954/50750 [2:10:50<82:01:26, 5.93s/it] {'loss': 0.0015, 'learning_rate': 2.5055810899540386e-05, 'epoch': 0.94} 2%|▏ | 954/50750 [2:10:50<82:01:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:53:34,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 18:53:34,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.42 | bwd_microstep: 3853.61 | bwd_inner_microstep: 3846.08 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.29 [2024-11-13 18:53:34,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.42 | bwd: 3853.63 | bwd_inner: 3846.08 | bwd_allreduce: 7.50 | step: 21.31 2%|▏ | 955/50750 [2:10:56<82:00:45, 5.93s/it] {'loss': 0.0647, 'learning_rate': 2.5082074852265267e-05, 'epoch': 0.94} 2%|▏ | 955/50750 [2:10:56<82:00:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:53:40,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:53:40,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.23 | bwd_microstep: 3844.29 | bwd_inner_microstep: 3836.82 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.89 [2024-11-13 18:53:40,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.23 | bwd: 3844.31 | bwd_inner: 3836.82 | bwd_allreduce: 7.45 | step: 20.90 2%|▏ | 956/50750 [2:11:02<81:58:24, 5.93s/it] {'loss': 0.0042, 'learning_rate': 2.510833880499015e-05, 'epoch': 0.94} 2%|▏ | 956/50750 [2:11:02<81:58:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:53:46,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:53:46,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.71 | bwd_microstep: 3854.63 | bwd_inner_microstep: 3847.13 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.84 [2024-11-13 18:53:46,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.71 | bwd: 3854.64 | bwd_inner: 3847.13 | bwd_allreduce: 7.47 | step: 20.84 2%|▏ | 957/50750 [2:11:08<81:58:07, 5.93s/it] {'loss': 0.6115, 'learning_rate': 2.5134602757715038e-05, 'epoch': 0.94} 2%|▏ | 957/50750 [2:11:08<81:58:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:53:52,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:53:52,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.80 | bwd_microstep: 3854.28 | bwd_inner_microstep: 3846.71 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.39 [2024-11-13 18:53:52,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.80 | bwd: 3854.29 | bwd_inner: 3846.71 | bwd_allreduce: 7.54 | step: 21.39 2%|▏ | 958/50750 [2:11:14<81:58:07, 5.93s/it] {'loss': 0.6274, 'learning_rate': 2.516086671043992e-05, 'epoch': 0.94} 2%|▏ | 958/50750 [2:11:14<81:58:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:53:58,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:53:58,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.90 | bwd_microstep: 3852.14 | bwd_inner_microstep: 3844.59 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.08 [2024-11-13 18:53:58,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.88 | bwd: 3852.16 | bwd_inner: 3844.59 | bwd_allreduce: 7.53 | step: 21.09 2%|▏ | 959/50750 [2:11:19<81:59:08, 5.93s/it] {'loss': 0.0036, 'learning_rate': 2.518713066316481e-05, 'epoch': 0.94} 2%|▏ | 959/50750 [2:11:19<81:59:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:54:03,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:54:03,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.86 | bwd_microstep: 3850.82 | bwd_inner_microstep: 3843.28 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.07 [2024-11-13 18:54:03,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.86 | bwd: 3850.83 | bwd_inner: 3843.28 | bwd_allreduce: 7.51 | step: 21.07 2%|▏ | 960/50750 [2:11:25<81:57:48, 5.93s/it] {'loss': 0.7234, 'learning_rate': 2.5213394615889692e-05, 'epoch': 0.95} 2%|▏ | 960/50750 [2:11:25<81:57:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:54:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 18:54:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.30 | bwd_microstep: 3840.17 | bwd_inner_microstep: 3832.64 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.00 [2024-11-13 18:54:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.30 | bwd: 3840.18 | bwd_inner: 3832.64 | bwd_allreduce: 7.50 | step: 21.00 2%|▏ | 961/50750 [2:11:31<81:53:40, 5.92s/it] {'loss': 0.0035, 'learning_rate': 2.523965856861458e-05, 'epoch': 0.95} 2%|▏ | 961/50750 [2:11:31<81:53:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:54:15,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:54:15,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.80 | bwd_microstep: 3849.74 | bwd_inner_microstep: 3842.07 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.87 [2024-11-13 18:54:15,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.80 | bwd: 3849.75 | bwd_inner: 3842.07 | bwd_allreduce: 7.64 | step: 21.88 2%|▏ | 962/50750 [2:11:37<81:53:15, 5.92s/it] {'loss': 0.0023, 'learning_rate': 2.5265922521339463e-05, 'epoch': 0.95} 2%|▏ | 962/50750 [2:11:37<81:53:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:54:21,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 18:54:21,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.29 | bwd_microstep: 3848.01 | bwd_inner_microstep: 3840.50 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.18 [2024-11-13 18:54:21,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.28 | bwd: 3848.02 | bwd_inner: 3840.50 | bwd_allreduce: 7.49 | step: 21.18 2%|▏ | 963/50750 [2:11:43<81:55:16, 5.92s/it] {'loss': 0.124, 'learning_rate': 2.529218647406435e-05, 'epoch': 0.95} 2%|▏ | 963/50750 [2:11:43<81:55:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:54:27,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:54:27,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.47 | bwd_microstep: 3852.53 | bwd_inner_microstep: 3844.96 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.09 [2024-11-13 18:54:27,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.47 | bwd: 3852.55 | bwd_inner: 3844.96 | bwd_allreduce: 7.54 | step: 21.09 2%|▏ | 964/50750 [2:11:49<81:54:43, 5.92s/it] {'loss': 0.0022, 'learning_rate': 2.5318450426789234e-05, 'epoch': 0.95} 2%|▏ | 964/50750 [2:11:49<81:54:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:54:33,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:54:33,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.87 | bwd_microstep: 3845.86 | bwd_inner_microstep: 3838.27 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.52 [2024-11-13 18:54:33,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.85 | bwd: 3845.87 | bwd_inner: 3838.27 | bwd_allreduce: 7.56 | step: 21.52 2%|▏ | 965/50750 [2:11:55<81:56:22, 5.93s/it] {'loss': 0.1073, 'learning_rate': 2.5344714379514118e-05, 'epoch': 0.95} 2%|▏ | 965/50750 [2:11:55<81:56:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:54:39,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.93 [2024-11-13 18:54:39,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.42 | bwd_microstep: 3855.04 | bwd_inner_microstep: 3847.03 | bwd_allreduce_microstep: 7.95 | step_microstep: 29.80 [2024-11-13 18:54:39,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.42 | bwd: 3855.07 | bwd_inner: 3847.03 | bwd_allreduce: 7.97 | step: 29.80 2%|▏ | 966/50750 [2:12:01<81:59:06, 5.93s/it] {'loss': 0.096, 'learning_rate': 2.5370978332239005e-05, 'epoch': 0.95} 2%|▏ | 966/50750 [2:12:01<81:59:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:54:45,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:54:45,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.14 | bwd_microstep: 3849.91 | bwd_inner_microstep: 3842.36 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.13 [2024-11-13 18:54:45,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.14 | bwd: 3849.92 | bwd_inner: 3842.36 | bwd_allreduce: 7.53 | step: 21.14 2%|▏ | 967/50750 [2:12:07<81:57:07, 5.93s/it] {'loss': 0.4213, 'learning_rate': 2.539724228496389e-05, 'epoch': 0.95} 2%|▏ | 967/50750 [2:12:07<81:57:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:54:51,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 18:54:51,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.96 | bwd_microstep: 3848.93 | bwd_inner_microstep: 3841.46 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.22 [2024-11-13 18:54:51,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.96 | bwd: 3848.94 | bwd_inner: 3841.46 | bwd_allreduce: 7.44 | step: 21.22 2%|▏ | 968/50750 [2:12:13<81:56:10, 5.93s/it] {'loss': 1.1703, 'learning_rate': 2.5423506237688776e-05, 'epoch': 0.95} 2%|▏ | 968/50750 [2:12:13<81:56:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:54:57,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.97 [2024-11-13 18:54:57,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.22 | bwd_microstep: 3855.84 | bwd_inner_microstep: 3848.33 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.09 [2024-11-13 18:54:57,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.22 | bwd: 3855.86 | bwd_inner: 3848.33 | bwd_allreduce: 7.48 | step: 21.09 2%|▏ | 969/50750 [2:12:19<81:57:31, 5.93s/it] {'loss': 0.0027, 'learning_rate': 2.544977019041366e-05, 'epoch': 0.95} 2%|▏ | 969/50750 [2:12:19<81:57:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:55:03,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.77 | optimizer_step: 4.92 [2024-11-13 18:55:03,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.31 | bwd_microstep: 3854.29 | bwd_inner_microstep: 3846.16 | bwd_allreduce_microstep: 8.06 | step_microstep: 26.99 [2024-11-13 18:55:03,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.30 | bwd: 3854.32 | bwd_inner: 3846.16 | bwd_allreduce: 8.09 | step: 26.99 2%|▏ | 970/50750 [2:12:25<82:01:31, 5.93s/it] {'loss': 0.0151, 'learning_rate': 2.5476034143138547e-05, 'epoch': 0.96} 2%|▏ | 970/50750 [2:12:25<82:01:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:55:09,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.20 | optimizer_gradients: 3.26 | optimizer_step: 5.09 [2024-11-13 18:55:09,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.20 | bwd_microstep: 3853.39 | bwd_inner_microstep: 3845.73 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.89 [2024-11-13 18:55:09,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.20 | bwd: 3853.40 | bwd_inner: 3845.73 | bwd_allreduce: 7.62 | step: 21.90 2%|▏ | 971/50750 [2:12:31<82:01:29, 5.93s/it] {'loss': 0.0021, 'learning_rate': 2.550229809586343e-05, 'epoch': 0.96} 2%|▏ | 971/50750 [2:12:31<82:01:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:55:15,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:55:15,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.14 | bwd_microstep: 3857.11 | bwd_inner_microstep: 3849.38 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.12 [2024-11-13 18:55:15,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.14 | bwd: 3857.12 | bwd_inner: 3849.38 | bwd_allreduce: 7.70 | step: 21.12 2%|▏ | 972/50750 [2:12:37<82:01:37, 5.93s/it] {'loss': 0.0026, 'learning_rate': 2.5528562048588318e-05, 'epoch': 0.96} 2%|▏ | 972/50750 [2:12:37<82:01:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:55:20,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:55:20,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.19 | bwd_microstep: 3853.17 | bwd_inner_microstep: 3845.67 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.08 [2024-11-13 18:55:20,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.19 | bwd: 3853.19 | bwd_inner: 3845.67 | bwd_allreduce: 7.48 | step: 21.09 2%|▏ | 973/50750 [2:12:42<82:01:41, 5.93s/it] {'loss': 0.0068, 'learning_rate': 2.55548260013132e-05, 'epoch': 0.96} 2%|▏ | 973/50750 [2:12:42<82:01:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:55:26,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 18:55:26,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.51 | bwd_microstep: 3850.10 | bwd_inner_microstep: 3842.14 | bwd_allreduce_microstep: 7.91 | step_microstep: 21.91 [2024-11-13 18:55:26,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.51 | bwd: 3850.11 | bwd_inner: 3842.14 | bwd_allreduce: 7.93 | step: 21.91 2%|▏ | 974/50750 [2:12:48<81:59:36, 5.93s/it] {'loss': 0.3001, 'learning_rate': 2.5581089954038082e-05, 'epoch': 0.96} 2%|▏ | 974/50750 [2:12:48<81:59:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:55:32,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.94 [2024-11-13 18:55:32,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.97 | bwd_microstep: 3846.51 | bwd_inner_microstep: 3838.96 | bwd_allreduce_microstep: 7.51 | step_microstep: 22.80 [2024-11-13 18:55:32,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.97 | bwd: 3846.52 | bwd_inner: 3838.96 | bwd_allreduce: 7.52 | step: 22.80 2%|▏ | 975/50750 [2:12:54<81:58:27, 5.93s/it] {'loss': 0.3486, 'learning_rate': 2.560735390676297e-05, 'epoch': 0.96} 2%|▏ | 975/50750 [2:12:54<81:58:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 18:55:38,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 18:55:38,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.27 | bwd_microstep: 3871.76 | bwd_inner_microstep: 3864.28 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.10 [2024-11-13 18:55:38,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.27 | bwd: 3871.77 | bwd_inner: 3864.28 | bwd_allreduce: 7.45 | step: 21.10 2%|▏ | 976/50750 [2:13:00<82:03:51, 5.94s/it] {'loss': 0.6396, 'learning_rate': 2.5633617859487853e-05, 'epoch': 0.96} 2%|▏ | 976/50750 [2:13:00<82:03:51, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:55:44,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:55:44,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.82 | bwd_microstep: 3843.27 | bwd_inner_microstep: 3835.80 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.13 [2024-11-13 18:55:44,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.82 | bwd: 3843.28 | bwd_inner: 3835.80 | bwd_allreduce: 7.44 | step: 21.14 2%|▏ | 977/50750 [2:13:06<81:58:22, 5.93s/it] {'loss': 0.0066, 'learning_rate': 2.565988181221274e-05, 'epoch': 0.96} 2%|▏ | 977/50750 [2:13:06<81:58:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:55:50,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:55:50,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.30 | bwd_microstep: 3848.26 | bwd_inner_microstep: 3840.76 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.10 [2024-11-13 18:55:50,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.30 | bwd: 3848.27 | bwd_inner: 3840.76 | bwd_allreduce: 7.48 | step: 21.10 2%|▏ | 978/50750 [2:13:12<81:55:14, 5.93s/it] {'loss': 0.0042, 'learning_rate': 2.5686145764937624e-05, 'epoch': 0.96} 2%|▏ | 978/50750 [2:13:12<81:55:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:55:56,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:55:56,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.72 | bwd_microstep: 3852.22 | bwd_inner_microstep: 3844.69 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.39 [2024-11-13 18:55:56,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.70 | bwd: 3852.23 | bwd_inner: 3844.69 | bwd_allreduce: 7.50 | step: 21.39 2%|▏ | 979/50750 [2:13:18<81:55:01, 5.93s/it] {'loss': 0.0031, 'learning_rate': 2.571240971766251e-05, 'epoch': 0.96} 2%|▏ | 979/50750 [2:13:18<81:55:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:56:02,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 18:56:02,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.92 | bwd_microstep: 3849.29 | bwd_inner_microstep: 3841.56 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.76 [2024-11-13 18:56:02,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.92 | bwd: 3849.30 | bwd_inner: 3841.56 | bwd_allreduce: 7.70 | step: 21.77 2%|▏ | 980/50750 [2:13:24<81:56:23, 5.93s/it] {'loss': 0.3593, 'learning_rate': 2.5738673670387395e-05, 'epoch': 0.97} 2%|▏ | 980/50750 [2:13:24<81:56:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:56:08,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 18:56:08,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.44 | bwd_microstep: 3849.29 | bwd_inner_microstep: 3841.76 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.36 [2024-11-13 18:56:08,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.43 | bwd: 3849.31 | bwd_inner: 3841.76 | bwd_allreduce: 7.51 | step: 21.36 2%|▏ | 981/50750 [2:13:30<81:54:32, 5.92s/it] {'loss': 0.0138, 'learning_rate': 2.576493762311228e-05, 'epoch': 0.97} 2%|▏ | 981/50750 [2:13:30<81:54:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:56:14,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:56:14,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.70 | bwd_microstep: 3853.60 | bwd_inner_microstep: 3846.04 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.62 [2024-11-13 18:56:14,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.70 | bwd: 3853.61 | bwd_inner: 3846.04 | bwd_allreduce: 7.53 | step: 21.62 2%|▏ | 982/50750 [2:13:36<81:54:33, 5.92s/it] {'loss': 0.3176, 'learning_rate': 2.5791201575837166e-05, 'epoch': 0.97} 2%|▏ | 982/50750 [2:13:36<81:54:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:56:20,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:56:20,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.27 | bwd_microstep: 3847.41 | bwd_inner_microstep: 3839.52 | bwd_allreduce_microstep: 7.85 | step_microstep: 21.33 [2024-11-13 18:56:20,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.25 | bwd: 3847.43 | bwd_inner: 3839.52 | bwd_allreduce: 7.87 | step: 21.34 2%|▏ | 983/50750 [2:13:42<81:54:17, 5.92s/it] {'loss': 0.241, 'learning_rate': 2.581746552856205e-05, 'epoch': 0.97} 2%|▏ | 983/50750 [2:13:42<81:54:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:56:26,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:56:26,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.44 | bwd_microstep: 3856.18 | bwd_inner_microstep: 3848.66 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.16 [2024-11-13 18:56:26,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.44 | bwd: 3856.19 | bwd_inner: 3848.66 | bwd_allreduce: 7.49 | step: 21.16 2%|▏ | 984/50750 [2:13:48<81:54:47, 5.93s/it] {'loss': 0.0009, 'learning_rate': 2.5843729481286937e-05, 'epoch': 0.97} 2%|▏ | 984/50750 [2:13:48<81:54:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:56:32,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:56:32,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.72 | bwd_microstep: 3841.80 | bwd_inner_microstep: 3834.27 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.23 [2024-11-13 18:56:32,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.72 | bwd: 3841.82 | bwd_inner: 3834.27 | bwd_allreduce: 7.50 | step: 21.23 2%|▏ | 985/50750 [2:13:54<81:52:30, 5.92s/it] {'loss': 0.0505, 'learning_rate': 2.586999343401182e-05, 'epoch': 0.97} 2%|▏ | 985/50750 [2:13:54<81:52:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:56:38,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:56:38,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.97 | bwd_microstep: 3849.63 | bwd_inner_microstep: 3842.05 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.65 [2024-11-13 18:56:38,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.96 | bwd: 3849.64 | bwd_inner: 3842.05 | bwd_allreduce: 7.55 | step: 21.66 2%|▏ | 986/50750 [2:13:59<81:53:32, 5.92s/it] {'loss': 0.0021, 'learning_rate': 2.5896257386736708e-05, 'epoch': 0.97} 2%|▏ | 986/50750 [2:13:59<81:53:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:56:43,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-13 18:56:43,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.14 | bwd_microstep: 3844.37 | bwd_inner_microstep: 3836.89 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.85 [2024-11-13 18:56:43,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.13 | bwd: 3844.39 | bwd_inner: 3836.89 | bwd_allreduce: 7.45 | step: 20.86 2%|▏ | 987/50750 [2:14:05<81:53:03, 5.92s/it] {'loss': 0.0081, 'learning_rate': 2.592252133946159e-05, 'epoch': 0.97} 2%|▏ | 987/50750 [2:14:05<81:53:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:56:49,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:56:49,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.99 | bwd_microstep: 3843.31 | bwd_inner_microstep: 3835.80 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.34 [2024-11-13 18:56:49,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.99 | bwd: 3843.32 | bwd_inner: 3835.80 | bwd_allreduce: 7.48 | step: 21.34 2%|▏ | 988/50750 [2:14:11<81:52:00, 5.92s/it] {'loss': 0.2029, 'learning_rate': 2.594878529218648e-05, 'epoch': 0.97} 2%|▏ | 988/50750 [2:14:11<81:52:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:56:55,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 18:56:55,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.77 | bwd_microstep: 3843.69 | bwd_inner_microstep: 3836.16 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.77 [2024-11-13 18:56:55,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.77 | bwd: 3843.70 | bwd_inner: 3836.16 | bwd_allreduce: 7.50 | step: 21.77 2%|▏ | 989/50750 [2:14:17<81:50:18, 5.92s/it] {'loss': 0.3813, 'learning_rate': 2.5975049244911362e-05, 'epoch': 0.97} 2%|▏ | 989/50750 [2:14:17<81:50:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2194 [2024-11-13 18:57:01,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.92 [2024-11-13 18:57:01,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.83 | bwd_microstep: 3843.16 | bwd_inner_microstep: 3835.60 | bwd_allreduce_microstep: 7.52 | step_microstep: 22.05 [2024-11-13 18:57:01,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.83 | bwd: 3843.18 | bwd_inner: 3835.60 | bwd_allreduce: 7.54 | step: 22.06 2%|▏ | 990/50750 [2:14:23<81:49:32, 5.92s/it] {'loss': 0.0001, 'learning_rate': 2.6001313197636243e-05, 'epoch': 0.98} 2%|▏ | 990/50750 [2:14:23<81:49:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:57:07,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 18:57:07,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.88 | bwd_microstep: 3854.48 | bwd_inner_microstep: 3846.80 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.74 [2024-11-13 18:57:07,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.86 | bwd: 3854.49 | bwd_inner: 3846.80 | bwd_allreduce: 7.65 | step: 21.75 2%|▏ | 991/50750 [2:14:29<81:52:04, 5.92s/it] {'loss': 0.143, 'learning_rate': 2.6027577150361133e-05, 'epoch': 0.98} 2%|▏ | 991/50750 [2:14:29<81:52:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:57:13,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:57:13,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.71 | bwd_microstep: 3839.84 | bwd_inner_microstep: 3832.32 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.18 [2024-11-13 18:57:13,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.69 | bwd: 3839.86 | bwd_inner: 3832.32 | bwd_allreduce: 7.49 | step: 21.19 2%|▏ | 992/50750 [2:14:35<81:49:46, 5.92s/it] {'loss': 0.2355, 'learning_rate': 2.6053841103086014e-05, 'epoch': 0.98} 2%|▏ | 992/50750 [2:14:35<81:49:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:57:19,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:57:19,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.43 | bwd_microstep: 3837.51 | bwd_inner_microstep: 3830.00 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.28 [2024-11-13 18:57:19,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.43 | bwd: 3837.53 | bwd_inner: 3830.00 | bwd_allreduce: 7.49 | step: 21.28 2%|▏ | 993/50750 [2:14:41<81:47:48, 5.92s/it] {'loss': 0.0206, 'learning_rate': 2.6080105055810904e-05, 'epoch': 0.98} 2%|▏ | 993/50750 [2:14:41<81:47:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:57:25,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:57:25,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.39 | bwd_microstep: 3841.98 | bwd_inner_microstep: 3834.38 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.69 [2024-11-13 18:57:25,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.39 | bwd: 3841.99 | bwd_inner: 3834.38 | bwd_allreduce: 7.56 | step: 21.70 2%|▏ | 994/50750 [2:14:47<81:46:08, 5.92s/it] {'loss': 0.0095, 'learning_rate': 2.6106369008535785e-05, 'epoch': 0.98} 2%|▏ | 994/50750 [2:14:47<81:46:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:57:31,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:57:31,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.74 | bwd_microstep: 3840.53 | bwd_inner_microstep: 3833.01 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.30 [2024-11-13 18:57:31,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.74 | bwd: 3840.55 | bwd_inner: 3833.01 | bwd_allreduce: 7.50 | step: 21.30 2%|▏ | 995/50750 [2:14:53<81:44:59, 5.91s/it] {'loss': 0.0626, 'learning_rate': 2.6132632961260672e-05, 'epoch': 0.98} 2%|▏ | 995/50750 [2:14:53<81:44:59, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:57:37,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 18:57:37,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.18 | bwd_microstep: 3838.25 | bwd_inner_microstep: 3830.72 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.29 [2024-11-13 18:57:37,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.18 | bwd: 3838.26 | bwd_inner: 3830.72 | bwd_allreduce: 7.50 | step: 21.30 2%|▏ | 996/50750 [2:14:59<81:43:42, 5.91s/it] {'loss': 0.011, 'learning_rate': 2.6158896913985556e-05, 'epoch': 0.98} 2%|▏ | 996/50750 [2:14:59<81:43:42, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:57:43,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:57:43,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.25 | bwd_microstep: 3847.69 | bwd_inner_microstep: 3840.08 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.36 [2024-11-13 18:57:43,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.23 | bwd: 3847.70 | bwd_inner: 3840.08 | bwd_allreduce: 7.57 | step: 21.36 2%|▏ | 997/50750 [2:15:05<81:46:57, 5.92s/it] {'loss': 0.0071, 'learning_rate': 2.6185160866710443e-05, 'epoch': 0.98} 2%|▏ | 997/50750 [2:15:05<81:46:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:57:49,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:57:49,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.51 | bwd_microstep: 3848.29 | bwd_inner_microstep: 3840.74 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.59 [2024-11-13 18:57:49,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.47 | bwd: 3848.31 | bwd_inner: 3840.74 | bwd_allreduce: 7.52 | step: 21.59 2%|▏ | 998/50750 [2:15:11<81:48:54, 5.92s/it] {'loss': 0.2181, 'learning_rate': 2.6211424819435326e-05, 'epoch': 0.98} 2%|▏ | 998/50750 [2:15:11<81:48:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:57:54,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 18:57:54,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.62 | bwd_microstep: 3845.11 | bwd_inner_microstep: 3836.98 | bwd_allreduce_microstep: 8.03 | step_microstep: 27.58 [2024-11-13 18:57:54,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.62 | bwd: 3845.14 | bwd_inner: 3836.98 | bwd_allreduce: 8.08 | step: 27.57 2%|▏ | 999/50750 [2:15:16<81:51:19, 5.92s/it] {'loss': 0.0149, 'learning_rate': 2.623768877216021e-05, 'epoch': 0.98} 2%|▏ | 999/50750 [2:15:16<81:51:19, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:58:00,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-13 18:58:00,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.05 | bwd_microstep: 3848.04 | bwd_inner_microstep: 3840.16 | bwd_allreduce_microstep: 7.83 | step_microstep: 22.00 [2024-11-13 18:58:00,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.04 | bwd: 3848.06 | bwd_inner: 3840.16 | bwd_allreduce: 7.85 | step: 22.01 2%|▏ | 1000/50750 [2:15:22<81:52:46, 5.92s/it] {'loss': 0.1481, 'learning_rate': 2.6263952724885097e-05, 'epoch': 0.99} 2%|▏ | 1000/50750 [2:15:22<81:52:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:58:06,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:58:06,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.36 | bwd_microstep: 3852.85 | bwd_inner_microstep: 3845.29 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.07 [2024-11-13 18:58:06,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.35 | bwd: 3852.86 | bwd_inner: 3845.29 | bwd_allreduce: 7.53 | step: 21.07 2%|▏ | 1001/50750 [2:15:28<81:55:59, 5.93s/it] {'loss': 0.0071, 'learning_rate': 2.629021667760998e-05, 'epoch': 0.99} 2%|▏ | 1001/50750 [2:15:28<81:55:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:58:12,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.36 | optimizer_step: 4.93 [2024-11-13 18:58:12,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.19 | bwd_microstep: 3845.98 | bwd_inner_microstep: 3838.08 | bwd_allreduce_microstep: 7.85 | step_microstep: 24.18 [2024-11-13 18:58:12,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.16 | bwd: 3845.99 | bwd_inner: 3838.08 | bwd_allreduce: 7.87 | step: 24.19 2%|▏ | 1002/50750 [2:15:34<81:57:11, 5.93s/it] {'loss': 0.042, 'learning_rate': 2.631648063033487e-05, 'epoch': 0.99} 2%|▏ | 1002/50750 [2:15:34<81:57:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:58:18,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 18:58:18,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.70 | bwd_microstep: 3857.91 | bwd_inner_microstep: 3849.78 | bwd_allreduce_microstep: 8.08 | step_microstep: 21.91 [2024-11-13 18:58:18,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.69 | bwd: 3857.93 | bwd_inner: 3849.78 | bwd_allreduce: 8.10 | step: 21.91 2%|▏ | 1003/50750 [2:15:40<81:59:34, 5.93s/it] {'loss': 0.0173, 'learning_rate': 2.6342744583059752e-05, 'epoch': 0.99} 2%|▏ | 1003/50750 [2:15:40<81:59:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:58:24,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.93 [2024-11-13 18:58:24,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.51 | bwd_microstep: 3847.57 | bwd_inner_microstep: 3839.55 | bwd_allreduce_microstep: 7.95 | step_microstep: 22.89 [2024-11-13 18:58:24,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.49 | bwd: 3847.59 | bwd_inner: 3839.55 | bwd_allreduce: 7.98 | step: 22.89 2%|▏ | 1004/50750 [2:15:46<81:58:39, 5.93s/it] {'loss': 0.0966, 'learning_rate': 2.636900853578464e-05, 'epoch': 0.99} 2%|▏ | 1004/50750 [2:15:46<81:58:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 18:58:30,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.30 | optimizer_step: 4.92 [2024-11-13 18:58:30,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.50 | bwd_microstep: 3848.02 | bwd_inner_microstep: 3840.01 | bwd_allreduce_microstep: 7.96 | step_microstep: 22.69 [2024-11-13 18:58:30,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.48 | bwd: 3848.04 | bwd_inner: 3840.01 | bwd_allreduce: 7.98 | step: 22.69 2%|▏ | 1005/50750 [2:15:52<82:00:22, 5.93s/it] {'loss': 0.0954, 'learning_rate': 2.6395272488509523e-05, 'epoch': 0.99} 2%|▏ | 1005/50750 [2:15:52<82:00:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:58:36,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-13 18:58:36,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.48 | bwd_microstep: 3851.37 | bwd_inner_microstep: 3843.25 | bwd_allreduce_microstep: 8.06 | step_microstep: 25.61 [2024-11-13 18:58:36,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.47 | bwd: 3851.40 | bwd_inner: 3843.25 | bwd_allreduce: 8.09 | step: 25.61 2%|▏ | 1006/50750 [2:15:58<82:04:46, 5.94s/it] {'loss': 0.7392, 'learning_rate': 2.642153644123441e-05, 'epoch': 0.99} 2%|▏ | 1006/50750 [2:15:58<82:04:46, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:58:42,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 18:58:42,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.40 | bwd_microstep: 3847.56 | bwd_inner_microstep: 3840.06 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.91 [2024-11-13 18:58:42,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.39 | bwd: 3847.57 | bwd_inner: 3840.06 | bwd_allreduce: 7.46 | step: 20.91 2%|▏ | 1007/50750 [2:16:04<82:00:44, 5.94s/it] {'loss': 0.0194, 'learning_rate': 2.6447800393959294e-05, 'epoch': 0.99} 2%|▏ | 1007/50750 [2:16:04<82:00:44, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:58:48,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 18:58:48,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.69 | bwd_microstep: 3845.53 | bwd_inner_microstep: 3838.02 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.44 [2024-11-13 18:58:48,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.69 | bwd: 3845.55 | bwd_inner: 3838.02 | bwd_allreduce: 7.49 | step: 21.44 2%|▏ | 1008/50750 [2:16:10<81:56:04, 5.93s/it] {'loss': 0.0026, 'learning_rate': 2.6474064346684178e-05, 'epoch': 0.99} 2%|▏ | 1008/50750 [2:16:10<81:56:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:58:54,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:58:54,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.44 | bwd_microstep: 3848.96 | bwd_inner_microstep: 3841.48 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.08 [2024-11-13 18:58:54,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.44 | bwd: 3848.97 | bwd_inner: 3841.48 | bwd_allreduce: 7.45 | step: 21.08 2%|▏ | 1009/50750 [2:16:16<81:53:37, 5.93s/it] {'loss': 0.7289, 'learning_rate': 2.6500328299409065e-05, 'epoch': 0.99} 2%|▏ | 1009/50750 [2:16:16<81:53:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 18:59:00,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 18:59:00,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.52 | bwd_microstep: 3847.42 | bwd_inner_microstep: 3839.87 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.43 [2024-11-13 18:59:00,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.52 | bwd: 3847.44 | bwd_inner: 3839.87 | bwd_allreduce: 7.53 | step: 21.43 2%|▏ | 1010/50750 [2:16:22<81:52:22, 5.93s/it] {'loss': 0.0145, 'learning_rate': 2.652659225213395e-05, 'epoch': 1.0} 2%|▏ | 1010/50750 [2:16:22<81:52:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 18:59:06,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 5.09 [2024-11-13 18:59:06,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.02 | bwd_microstep: 3848.32 | bwd_inner_microstep: 3840.60 | bwd_allreduce_microstep: 7.67 | step_microstep: 22.12 [2024-11-13 18:59:06,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.02 | bwd: 3848.34 | bwd_inner: 3840.60 | bwd_allreduce: 7.69 | step: 22.12 2%|▏ | 1011/50750 [2:16:28<81:52:00, 5.93s/it] {'loss': 0.727, 'learning_rate': 2.6552856204858836e-05, 'epoch': 1.0} 2%|▏ | 1011/50750 [2:16:28<81:52:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 18:59:12,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 18:59:12,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.85 | bwd_microstep: 3851.46 | bwd_inner_microstep: 3843.94 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.10 [2024-11-13 18:59:12,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.83 | bwd: 3851.47 | bwd_inner: 3843.94 | bwd_allreduce: 7.49 | step: 21.11 2%|▏ | 1012/50750 [2:16:34<81:53:07, 5.93s/it] {'loss': 0.239, 'learning_rate': 2.6579120157583716e-05, 'epoch': 1.0} 2%|▏ | 1012/50750 [2:16:34<81:53:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 18:59:17,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:59:17,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.05 | bwd_microstep: 3839.65 | bwd_inner_microstep: 3832.16 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.89 [2024-11-13 18:59:17,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.05 | bwd: 3839.66 | bwd_inner: 3832.16 | bwd_allreduce: 7.46 | step: 20.89 2%|▏ | 1013/50750 [2:16:39<81:48:32, 5.92s/it] {'loss': 0.2949, 'learning_rate': 2.6605384110308607e-05, 'epoch': 1.0} 2%|▏ | 1013/50750 [2:16:39<81:48:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 18:59:23,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 18:59:23,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.14 | bwd_microstep: 3848.06 | bwd_inner_microstep: 3840.41 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.07 [2024-11-13 18:59:23,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.14 | bwd: 3848.08 | bwd_inner: 3840.41 | bwd_allreduce: 7.63 | step: 21.07 2%|▏ | 1014/50750 [2:16:45<81:48:52, 5.92s/it] {'loss': 0.01, 'learning_rate': 2.6631648063033487e-05, 'epoch': 1.0} 2%|▏ | 1014/50750 [2:16:45<81:48:52, 5.92s/it]evaluate! dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B Results saved to qa_abcd_lora.csv Accuracy: 0.9015748031496063 New best accuracy: 0.9015748031496063. Saving model... [INFO|trainer.py:2936] 2024-11-13 19:35:03,894 >> Saving model checkpoint to work_dirs/QA2/qa_abcd_lora [INFO|configuration_utils.py:473] 2024-11-13 19:35:03,895 >> Configuration saved in work_dirs/QA2/qa_abcd_lora/config.json [INFO|configuration_utils.py:594] 2024-11-13 19:35:03,896 >> Configuration saved in work_dirs/QA2/qa_abcd_lora/generation_config.json [INFO|modeling_utils.py:2501] 2024-11-13 19:35:46,267 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at work_dirs/QA2/qa_abcd_lora/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2433] 2024-11-13 19:35:46,268 >> tokenizer config file saved in work_dirs/QA2/qa_abcd_lora/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2024-11-13 19:35:46,268 >> Special tokens file saved in work_dirs/QA2/qa_abcd_lora/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2024-11-13 19:35:46,268 >> added tokens file saved in work_dirs/QA2/qa_abcd_lora/added_tokens.json 11/13/2024 19:35:48 - INFO - __main__ - Saved LoRA weights to work_dirs/QA2/qa_abcd_lora/lora_weights.pth petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 16, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:35:52,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 19:35:52,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1004.52 | bwd_microstep: 1904.97 | bwd_inner_microstep: 1897.19 | bwd_allreduce_microstep: 7.74 | step_microstep: 22.41 [2024-11-13 19:35:52,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1004.49 | bwd: 1904.99 | bwd_inner: 1897.19 | bwd_allreduce: 7.75 | step: 22.42 2%|▏ | 1015/50750 [2:53:14<9128:13:20, 660.73s/it] {'loss': 0.002, 'learning_rate': 2.6657912015758374e-05, 'epoch': 1.0} 2%|▏ | 1015/50750 [2:53:14<9128:13:20, 660.73s/it][2024-11-13 19:35:55,330] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-13 19:36:00,523] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-13 19:36:05,610] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-13 19:36:10,621] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:36:29,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:36:29,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1996.55 | bwd_microstep: 3812.86 | bwd_inner_microstep: 3805.32 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.54 [2024-11-13 19:36:29,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1996.52 | bwd: 3812.88 | bwd_inner: 3805.32 | bwd_allreduce: 7.52 | step: 21.55 2%|▏ | 1016/50750 [2:53:51<6544:08:08, 473.70s/it] {'loss': 0.0053, 'learning_rate': 2.6684175968483258e-05, 'epoch': 1.0} 2%|▏ | 1016/50750 [2:53:51<6544:08:08, 473.70s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:36:35,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.92 [2024-11-13 19:36:35,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2005.61 | bwd_microstep: 3817.45 | bwd_inner_microstep: 3809.73 | bwd_allreduce_microstep: 7.67 | step_microstep: 22.17 [2024-11-13 19:36:35,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2005.59 | bwd: 3817.46 | bwd_inner: 3809.73 | bwd_allreduce: 7.69 | step: 22.17 2%|▏ | 1017/50750 [2:53:57<4605:09:01, 333.35s/it] {'loss': 0.0784, 'learning_rate': 2.6710439921208142e-05, 'epoch': 1.0} 2%|▏ | 1017/50750 [2:53:57<4605:09:01, 333.35s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:36:41,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:36:41,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2010.30 | bwd_microstep: 3836.08 | bwd_inner_microstep: 3828.56 | bwd_allreduce_microstep: 7.47 | step_microstep: 22.10 [2024-11-13 19:36:41,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2010.28 | bwd: 3836.09 | bwd_inner: 3828.56 | bwd_allreduce: 7.49 | step: 22.10 2%|▏ | 1018/50750 [2:54:03<3247:59:28, 235.12s/it] {'loss': 0.0017, 'learning_rate': 2.673670387393303e-05, 'epoch': 1.0} 2%|▏ | 1018/50750 [2:54:03<3247:59:28, 235.12s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:36:47,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 19:36:47,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2016.37 | bwd_microstep: 3846.53 | bwd_inner_microstep: 3838.02 | bwd_allreduce_microstep: 8.46 | step_microstep: 22.90 [2024-11-13 19:36:47,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2016.37 | bwd: 3846.54 | bwd_inner: 3838.02 | bwd_allreduce: 8.48 | step: 22.91 2%|▏ | 1019/50750 [2:54:09<2298:04:34, 166.36s/it] {'loss': 0.0012, 'learning_rate': 2.6762967826657913e-05, 'epoch': 1.0} 2%|▏ | 1019/50750 [2:54:09<2298:04:34, 166.36s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:36:53,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 19:36:53,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.15 | bwd_microstep: 3847.43 | bwd_inner_microstep: 3839.74 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.65 [2024-11-13 19:36:53,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.12 | bwd: 3847.44 | bwd_inner: 3839.74 | bwd_allreduce: 7.67 | step: 21.65 2%|▏ | 1020/50750 [2:54:15<1633:08:03, 118.22s/it] {'loss': 0.0014, 'learning_rate': 2.67892317793828e-05, 'epoch': 1.0} 2%|▏ | 1020/50750 [2:54:15<1633:08:03, 118.22s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:36:59,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 19:36:59,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.00 | bwd_microstep: 3853.87 | bwd_inner_microstep: 3844.27 | bwd_allreduce_microstep: 9.55 | step_microstep: 22.10 [2024-11-13 19:36:59,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.99 | bwd: 3853.88 | bwd_inner: 3844.28 | bwd_allreduce: 9.57 | step: 22.10 2%|▏ | 1021/50750 [2:54:21<1167:44:20, 84.54s/it] {'loss': 0.8943, 'learning_rate': 2.6815495732107684e-05, 'epoch': 1.01} 2%|▏ | 1021/50750 [2:54:21<1167:44:20, 84.54s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:37:05,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 19:37:05,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.06 | bwd_microstep: 3852.87 | bwd_inner_microstep: 3845.37 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.95 [2024-11-13 19:37:05,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.03 | bwd: 3852.88 | bwd_inner: 3845.37 | bwd_allreduce: 7.48 | step: 20.95 2%|▏ | 1022/50750 [2:54:27<841:58:59, 60.95s/it] {'loss': 0.5029, 'learning_rate': 2.684175968483257e-05, 'epoch': 1.01} 2%|▏ | 1022/50750 [2:54:27<841:58:59, 60.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:37:11,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:37:11,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.47 | bwd_microstep: 3852.78 | bwd_inner_microstep: 3844.99 | bwd_allreduce_microstep: 7.73 | step_microstep: 22.54 [2024-11-13 19:37:11,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.47 | bwd: 3852.80 | bwd_inner: 3844.99 | bwd_allreduce: 7.75 | step: 22.54 2%|▏ | 1023/50750 [2:54:33<613:57:33, 44.45s/it] {'loss': 0.0006, 'learning_rate': 2.6868023637557455e-05, 'epoch': 1.01} 2%|▏ | 1023/50750 [2:54:33<613:57:33, 44.45s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:37:17,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 19:37:17,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.06 | bwd_microstep: 3852.73 | bwd_inner_microstep: 3845.23 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.19 [2024-11-13 19:37:17,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.06 | bwd: 3852.75 | bwd_inner: 3845.23 | bwd_allreduce: 7.48 | step: 21.20 2%|▏ | 1024/50750 [2:54:39<454:20:14, 32.89s/it] {'loss': 0.1646, 'learning_rate': 2.689428759028234e-05, 'epoch': 1.01} 2%|▏ | 1024/50750 [2:54:39<454:20:14, 32.89s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 19:37:23,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 19:37:23,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.55 | bwd_microstep: 3866.84 | bwd_inner_microstep: 3859.15 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.93 [2024-11-13 19:37:23,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.54 | bwd: 3866.85 | bwd_inner: 3859.15 | bwd_allreduce: 7.66 | step: 21.94 2%|▏ | 1025/50750 [2:54:45<342:41:14, 24.81s/it] {'loss': 0.6901, 'learning_rate': 2.6920551543007226e-05, 'epoch': 1.01} 2%|▏ | 1025/50750 [2:54:45<342:41:14, 24.81s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:37:29,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:37:29,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.55 | bwd_microstep: 3826.94 | bwd_inner_microstep: 3819.38 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.51 [2024-11-13 19:37:29,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.53 | bwd: 3826.96 | bwd_inner: 3819.38 | bwd_allreduce: 7.54 | step: 21.52 2%|▏ | 1026/50750 [2:54:50<264:23:25, 19.14s/it] {'loss': 0.0151, 'learning_rate': 2.694681549573211e-05, 'epoch': 1.01} 2%|▏ | 1026/50750 [2:54:50<264:23:25, 19.14s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2203 [2024-11-13 19:37:34,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.92 [2024-11-13 19:37:34,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.65 | bwd_microstep: 3845.45 | bwd_inner_microstep: 3837.83 | bwd_allreduce_microstep: 7.58 | step_microstep: 22.12 [2024-11-13 19:37:34,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.64 | bwd: 3845.46 | bwd_inner: 3837.83 | bwd_allreduce: 7.59 | step: 22.13 2%|▏ | 1027/50750 [2:54:56<209:38:09, 15.18s/it] {'loss': 0.3884, 'learning_rate': 2.6973079448456996e-05, 'epoch': 1.01} 2%|▏ | 1027/50750 [2:54:56<209:38:09, 15.18s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:37:40,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:37:40,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.52 | bwd_microstep: 3837.58 | bwd_inner_microstep: 3830.00 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.63 [2024-11-13 19:37:40,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.50 | bwd: 3837.59 | bwd_inner: 3830.00 | bwd_allreduce: 7.55 | step: 21.64 2%|▏ | 1028/50750 [2:55:02<171:15:39, 12.40s/it] {'loss': 0.4029, 'learning_rate': 2.699934340118188e-05, 'epoch': 1.01} 2%|▏ | 1028/50750 [2:55:02<171:15:39, 12.40s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:37:46,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 19:37:46,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.04 | bwd_microstep: 3847.35 | bwd_inner_microstep: 3839.52 | bwd_allreduce_microstep: 7.78 | step_microstep: 21.69 [2024-11-13 19:37:46,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.04 | bwd: 3847.36 | bwd_inner: 3839.52 | bwd_allreduce: 7.80 | step: 21.70 2%|▏ | 1029/50750 [2:55:08<144:25:03, 10.46s/it] {'loss': 0.017, 'learning_rate': 2.7025607353906767e-05, 'epoch': 1.01} 2%|▏ | 1029/50750 [2:55:08<144:25:03, 10.46s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 19:37:52,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-13 19:37:52,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.69 | bwd_microstep: 3840.63 | bwd_inner_microstep: 3832.61 | bwd_allreduce_microstep: 7.97 | step_microstep: 22.56 [2024-11-13 19:37:52,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.69 | bwd: 3840.65 | bwd_inner: 3832.61 | bwd_allreduce: 7.99 | step: 22.56 2%|▏ | 1030/50750 [2:55:14<125:37:57, 9.10s/it] {'loss': 0.1551, 'learning_rate': 2.705187130663165e-05, 'epoch': 1.01} 2%|▏ | 1030/50750 [2:55:14<125:37:57, 9.10s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:37:58,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 19:37:58,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.97 | bwd_microstep: 3841.48 | bwd_inner_microstep: 3834.00 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.79 [2024-11-13 19:37:58,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.95 | bwd: 3841.49 | bwd_inner: 3834.00 | bwd_allreduce: 7.45 | step: 20.80 2%|▏ | 1031/50750 [2:55:20<112:28:29, 8.14s/it] {'loss': 0.0104, 'learning_rate': 2.7078135259356538e-05, 'epoch': 1.02} 2%|▏ | 1031/50750 [2:55:20<112:28:29, 8.14s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:38:04,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 4.93 [2024-11-13 19:38:04,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.25 | bwd_microstep: 3847.21 | bwd_inner_microstep: 3839.55 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.83 [2024-11-13 19:38:04,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.25 | bwd: 3847.22 | bwd_inner: 3839.55 | bwd_allreduce: 7.63 | step: 21.83 2%|▏ | 1032/50750 [2:55:26<103:16:17, 7.48s/it] {'loss': 0.0273, 'learning_rate': 2.710439921208142e-05, 'epoch': 1.02} 2%|▏ | 1032/50750 [2:55:26<103:16:17, 7.48s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:38:10,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.92 [2024-11-13 19:38:10,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.81 | bwd_microstep: 3838.97 | bwd_inner_microstep: 3831.21 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.25 [2024-11-13 19:38:10,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.79 | bwd: 3838.99 | bwd_inner: 3831.22 | bwd_allreduce: 7.73 | step: 22.26 2%|▏ | 1033/50750 [2:55:32<96:48:13, 7.01s/it] {'loss': 0.0046, 'learning_rate': 2.7130663164806302e-05, 'epoch': 1.02} 2%|▏ | 1033/50750 [2:55:32<96:48:13, 7.01s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:38:16,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:38:16,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.37 | bwd_microstep: 3838.32 | bwd_inner_microstep: 3830.79 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.08 [2024-11-13 19:38:16,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.37 | bwd: 3838.33 | bwd_inner: 3830.79 | bwd_allreduce: 7.50 | step: 21.08 2%|▏ | 1034/50750 [2:55:38<92:17:23, 6.68s/it] {'loss': 0.0336, 'learning_rate': 2.715692711753119e-05, 'epoch': 1.02} 2%|▏ | 1034/50750 [2:55:38<92:17:23, 6.68s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:38:22,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:38:22,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.26 | bwd_microstep: 3864.34 | bwd_inner_microstep: 3856.76 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.78 [2024-11-13 19:38:22,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.26 | bwd: 3864.36 | bwd_inner: 3856.76 | bwd_allreduce: 7.55 | step: 21.78 2%|▏ | 1035/50750 [2:55:44<89:11:43, 6.46s/it] {'loss': 0.0334, 'learning_rate': 2.7183191070256073e-05, 'epoch': 1.02} 2%|▏ | 1035/50750 [2:55:44<89:11:43, 6.46s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:38:28,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-13 19:38:28,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.83 | bwd_microstep: 3836.17 | bwd_inner_microstep: 3828.65 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.93 [2024-11-13 19:38:28,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.83 | bwd: 3836.18 | bwd_inner: 3828.65 | bwd_allreduce: 7.50 | step: 21.94 2%|▏ | 1036/50750 [2:55:50<86:54:47, 6.29s/it] {'loss': 0.0031, 'learning_rate': 2.720945502298096e-05, 'epoch': 1.02} 2%|▏ | 1036/50750 [2:55:50<86:54:47, 6.29s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:38:34,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:38:34,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.67 | bwd_microstep: 3850.03 | bwd_inner_microstep: 3842.28 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.02 [2024-11-13 19:38:34,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.66 | bwd: 3850.05 | bwd_inner: 3842.28 | bwd_allreduce: 7.72 | step: 22.02 2%|▏ | 1037/50750 [2:55:56<85:25:54, 6.19s/it] {'loss': 0.0035, 'learning_rate': 2.7235718975705844e-05, 'epoch': 1.02} 2%|▏ | 1037/50750 [2:55:56<85:25:54, 6.19s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:38:40,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 19:38:40,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.05 | bwd_microstep: 3843.53 | bwd_inner_microstep: 3836.05 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.00 [2024-11-13 19:38:40,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.04 | bwd: 3843.55 | bwd_inner: 3836.05 | bwd_allreduce: 7.46 | step: 21.00 2%|▏ | 1038/50750 [2:56:02<84:18:21, 6.11s/it] {'loss': 0.4447, 'learning_rate': 2.726198292843073e-05, 'epoch': 1.02} 2%|▏ | 1038/50750 [2:56:02<84:18:21, 6.11s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:38:46,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 19:38:46,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.75 | bwd_microstep: 3841.13 | bwd_inner_microstep: 3833.39 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.82 [2024-11-13 19:38:46,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.75 | bwd: 3841.14 | bwd_inner: 3833.39 | bwd_allreduce: 7.71 | step: 21.83 2%|▏ | 1039/50750 [2:56:07<83:32:10, 6.05s/it] {'loss': 0.016, 'learning_rate': 2.7288246881155615e-05, 'epoch': 1.02} 2%|▏ | 1039/50750 [2:56:07<83:32:10, 6.05s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:38:51,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 19:38:51,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.28 | bwd_microstep: 3845.16 | bwd_inner_microstep: 3837.56 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.77 [2024-11-13 19:38:51,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.27 | bwd: 3845.17 | bwd_inner: 3837.56 | bwd_allreduce: 7.57 | step: 21.78 2%|▏ | 1040/50750 [2:56:13<82:59:35, 6.01s/it] {'loss': 0.0095, 'learning_rate': 2.7314510833880502e-05, 'epoch': 1.02} 2%|▏ | 1040/50750 [2:56:13<82:59:35, 6.01s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:38:57,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 19:38:57,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.02 | bwd_microstep: 3839.88 | bwd_inner_microstep: 3832.30 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.04 [2024-11-13 19:38:57,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.02 | bwd: 3839.90 | bwd_inner: 3832.30 | bwd_allreduce: 7.56 | step: 21.04 2%|▏ | 1041/50750 [2:56:19<82:34:28, 5.98s/it] {'loss': 0.0009, 'learning_rate': 2.7340774786605386e-05, 'epoch': 1.03} 2%|▏ | 1041/50750 [2:56:19<82:34:28, 5.98s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:39:03,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:39:03,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.14 | bwd_microstep: 3843.32 | bwd_inner_microstep: 3835.81 | bwd_allreduce_microstep: 7.47 | step_microstep: 22.02 [2024-11-13 19:39:03,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.14 | bwd: 3843.34 | bwd_inner: 3835.81 | bwd_allreduce: 7.49 | step: 22.02 2%|▏ | 1042/50750 [2:56:25<82:18:12, 5.96s/it] {'loss': 0.0007, 'learning_rate': 2.736703873933027e-05, 'epoch': 1.03} 2%|▏ | 1042/50750 [2:56:25<82:18:12, 5.96s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:39:09,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:39:09,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.43 | bwd_microstep: 3847.31 | bwd_inner_microstep: 3839.74 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.63 [2024-11-13 19:39:09,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.43 | bwd: 3847.32 | bwd_inner: 3839.74 | bwd_allreduce: 7.54 | step: 21.63 2%|▏ | 1043/50750 [2:56:31<82:08:13, 5.95s/it] {'loss': 0.0061, 'learning_rate': 2.7393302692055157e-05, 'epoch': 1.03} 2%|▏ | 1043/50750 [2:56:31<82:08:13, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:39:15,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 19:39:15,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.01 | bwd_microstep: 3842.27 | bwd_inner_microstep: 3834.76 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.30 [2024-11-13 19:39:15,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.01 | bwd: 3842.29 | bwd_inner: 3834.76 | bwd_allreduce: 7.49 | step: 21.30 2%|▏ | 1044/50750 [2:56:37<82:00:58, 5.94s/it] {'loss': 0.5722, 'learning_rate': 2.741956664478004e-05, 'epoch': 1.03} 2%|▏ | 1044/50750 [2:56:37<82:00:58, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:39:21,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 19:39:21,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.98 | bwd_microstep: 3854.14 | bwd_inner_microstep: 3845.35 | bwd_allreduce_microstep: 8.74 | step_microstep: 21.59 [2024-11-13 19:39:21,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.98 | bwd: 3854.15 | bwd_inner: 3845.35 | bwd_allreduce: 8.76 | step: 21.59 2%|▏ | 1045/50750 [2:56:43<81:57:32, 5.94s/it] {'loss': 0.0317, 'learning_rate': 2.7445830597504928e-05, 'epoch': 1.03} 2%|▏ | 1045/50750 [2:56:43<81:57:32, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:39:27,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 19:39:27,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.19 | bwd_microstep: 3845.80 | bwd_inner_microstep: 3837.97 | bwd_allreduce_microstep: 7.78 | step_microstep: 21.15 [2024-11-13 19:39:27,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.18 | bwd: 3845.81 | bwd_inner: 3837.97 | bwd_allreduce: 7.80 | step: 21.16 2%|▏ | 1046/50750 [2:56:49<81:55:14, 5.93s/it] {'loss': 0.1457, 'learning_rate': 2.7472094550229812e-05, 'epoch': 1.03} 2%|▏ | 1046/50750 [2:56:49<81:55:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:39:33,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:39:33,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.38 | bwd_microstep: 3847.90 | bwd_inner_microstep: 3840.39 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.13 [2024-11-13 19:39:33,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.36 | bwd: 3847.91 | bwd_inner: 3840.39 | bwd_allreduce: 7.49 | step: 21.14 2%|▏ | 1047/50750 [2:56:55<81:51:52, 5.93s/it] {'loss': 0.0011, 'learning_rate': 2.74983585029547e-05, 'epoch': 1.03} 2%|▏ | 1047/50750 [2:56:55<81:51:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:39:39,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 19:39:39,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.73 | bwd_microstep: 3846.05 | bwd_inner_microstep: 3838.44 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.76 [2024-11-13 19:39:39,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.73 | bwd: 3846.07 | bwd_inner: 3838.44 | bwd_allreduce: 7.58 | step: 21.76 2%|▏ | 1048/50750 [2:57:01<81:50:09, 5.93s/it] {'loss': 0.0028, 'learning_rate': 2.7524622455679583e-05, 'epoch': 1.03} 2%|▏ | 1048/50750 [2:57:01<81:50:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:39:45,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 19:39:45,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.86 | bwd_microstep: 3849.30 | bwd_inner_microstep: 3841.78 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.20 [2024-11-13 19:39:45,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.86 | bwd: 3849.31 | bwd_inner: 3841.78 | bwd_allreduce: 7.49 | step: 21.21 2%|▏ | 1049/50750 [2:57:07<81:47:50, 5.92s/it] {'loss': 0.1037, 'learning_rate': 2.755088640840447e-05, 'epoch': 1.03} 2%|▏ | 1049/50750 [2:57:07<81:47:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:39:51,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:39:51,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.42 | bwd_microstep: 3850.44 | bwd_inner_microstep: 3842.76 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.05 [2024-11-13 19:39:51,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.42 | bwd: 3850.45 | bwd_inner: 3842.76 | bwd_allreduce: 7.65 | step: 21.06 2%|▏ | 1050/50750 [2:57:13<81:47:23, 5.92s/it] {'loss': 0.91, 'learning_rate': 2.7577150361129354e-05, 'epoch': 1.03} 2%|▏ | 1050/50750 [2:57:13<81:47:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:39:57,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 19:39:57,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.51 | bwd_microstep: 3849.10 | bwd_inner_microstep: 3841.58 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.10 [2024-11-13 19:39:57,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.51 | bwd: 3849.11 | bwd_inner: 3841.58 | bwd_allreduce: 7.49 | step: 21.10 2%|▏ | 1051/50750 [2:57:19<81:46:03, 5.92s/it] {'loss': 0.3977, 'learning_rate': 2.7603414313854234e-05, 'epoch': 1.04} 2%|▏ | 1051/50750 [2:57:19<81:46:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:40:02,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.92 [2024-11-13 19:40:02,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.52 | bwd_microstep: 3847.06 | bwd_inner_microstep: 3837.69 | bwd_allreduce_microstep: 9.33 | step_microstep: 22.26 [2024-11-13 19:40:02,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.50 | bwd: 3847.08 | bwd_inner: 3837.69 | bwd_allreduce: 9.34 | step: 22.26 2%|▏ | 1052/50750 [2:57:24<81:49:54, 5.93s/it] {'loss': 0.5668, 'learning_rate': 2.7629678266579125e-05, 'epoch': 1.04} 2%|▏ | 1052/50750 [2:57:24<81:49:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:40:08,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:40:08,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.89 | bwd_microstep: 3850.75 | bwd_inner_microstep: 3843.22 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.40 [2024-11-13 19:40:08,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.87 | bwd: 3850.76 | bwd_inner: 3843.22 | bwd_allreduce: 7.50 | step: 21.41 2%|▏ | 1053/50750 [2:57:30<81:50:10, 5.93s/it] {'loss': 0.0066, 'learning_rate': 2.7655942219304005e-05, 'epoch': 1.04} 2%|▏ | 1053/50750 [2:57:30<81:50:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 19:40:14,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-13 19:40:14,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.24 | bwd_microstep: 3850.94 | bwd_inner_microstep: 3843.26 | bwd_allreduce_microstep: 7.63 | step_microstep: 22.88 [2024-11-13 19:40:14,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.24 | bwd: 3850.95 | bwd_inner: 3843.26 | bwd_allreduce: 7.65 | step: 22.89 2%|▏ | 1054/50750 [2:57:36<81:53:09, 5.93s/it] {'loss': 0.3283, 'learning_rate': 2.7682206172028892e-05, 'epoch': 1.04} 2%|▏ | 1054/50750 [2:57:36<81:53:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:40:20,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 19:40:20,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.84 | bwd_microstep: 3848.52 | bwd_inner_microstep: 3840.97 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.06 [2024-11-13 19:40:20,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.83 | bwd: 3848.54 | bwd_inner: 3840.97 | bwd_allreduce: 7.52 | step: 21.06 2%|▏ | 1055/50750 [2:57:42<81:51:11, 5.93s/it] {'loss': 0.0006, 'learning_rate': 2.7708470124753776e-05, 'epoch': 1.04} 2%|▏ | 1055/50750 [2:57:42<81:51:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:40:26,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 19:40:26,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.03 | bwd_microstep: 3853.26 | bwd_inner_microstep: 3845.38 | bwd_allreduce_microstep: 7.84 | step_microstep: 21.87 [2024-11-13 19:40:26,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.03 | bwd: 3853.28 | bwd_inner: 3845.38 | bwd_allreduce: 7.85 | step: 21.87 2%|▏ | 1056/50750 [2:57:48<81:50:26, 5.93s/it] {'loss': 0.33, 'learning_rate': 2.7734734077478663e-05, 'epoch': 1.04} 2%|▏ | 1056/50750 [2:57:48<81:50:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:40:32,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 19:40:32,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.69 | bwd_microstep: 3848.59 | bwd_inner_microstep: 3841.06 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.90 [2024-11-13 19:40:32,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.69 | bwd: 3848.61 | bwd_inner: 3841.06 | bwd_allreduce: 7.51 | step: 20.90 2%|▏ | 1057/50750 [2:57:54<81:50:20, 5.93s/it] {'loss': 0.0091, 'learning_rate': 2.7760998030203547e-05, 'epoch': 1.04} 2%|▏ | 1057/50750 [2:57:54<81:50:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:40:38,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:40:38,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.44 | bwd_microstep: 3846.53 | bwd_inner_microstep: 3839.05 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.84 [2024-11-13 19:40:38,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.44 | bwd: 3846.54 | bwd_inner: 3839.05 | bwd_allreduce: 7.45 | step: 20.84 2%|▏ | 1058/50750 [2:58:00<81:49:39, 5.93s/it] {'loss': 0.0349, 'learning_rate': 2.7787261982928434e-05, 'epoch': 1.04} 2%|▏ | 1058/50750 [2:58:00<81:49:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:40:44,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:40:44,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.67 | bwd_microstep: 3848.27 | bwd_inner_microstep: 3840.75 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.29 [2024-11-13 19:40:44,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.67 | bwd: 3848.29 | bwd_inner: 3840.75 | bwd_allreduce: 7.50 | step: 21.29 2%|▏ | 1059/50750 [2:58:06<81:48:12, 5.93s/it] {'loss': 0.0048, 'learning_rate': 2.7813525935653318e-05, 'epoch': 1.04} 2%|▏ | 1059/50750 [2:58:06<81:48:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:40:50,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 19:40:50,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.01 | bwd_microstep: 3850.74 | bwd_inner_microstep: 3842.83 | bwd_allreduce_microstep: 7.86 | step_microstep: 21.78 [2024-11-13 19:40:50,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.01 | bwd: 3850.75 | bwd_inner: 3842.83 | bwd_allreduce: 7.88 | step: 21.78 2%|▏ | 1060/50750 [2:58:12<81:49:16, 5.93s/it] {'loss': 0.6565, 'learning_rate': 2.78397898883782e-05, 'epoch': 1.04} 2%|▏ | 1060/50750 [2:58:12<81:49:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:40:56,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:40:56,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.81 | bwd_microstep: 3849.57 | bwd_inner_microstep: 3842.08 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.51 [2024-11-13 19:40:56,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.81 | bwd: 3849.59 | bwd_inner: 3842.08 | bwd_allreduce: 7.47 | step: 21.51 2%|▏ | 1061/50750 [2:58:18<81:51:05, 5.93s/it] {'loss': 0.0082, 'learning_rate': 2.786605384110309e-05, 'epoch': 1.05} 2%|▏ | 1061/50750 [2:58:18<81:51:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:41:02,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.92 [2024-11-13 19:41:02,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.89 | bwd_microstep: 3849.79 | bwd_inner_microstep: 3842.27 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.57 [2024-11-13 19:41:02,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.87 | bwd: 3849.81 | bwd_inner: 3842.27 | bwd_allreduce: 7.49 | step: 21.59 2%|▏ | 1062/50750 [2:58:24<81:51:17, 5.93s/it] {'loss': 0.0912, 'learning_rate': 2.7892317793827972e-05, 'epoch': 1.05} 2%|▏ | 1062/50750 [2:58:24<81:51:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:41:08,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 19:41:08,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.98 | bwd_microstep: 3851.29 | bwd_inner_microstep: 3843.67 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.84 [2024-11-13 19:41:08,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.97 | bwd: 3851.31 | bwd_inner: 3843.67 | bwd_allreduce: 7.60 | step: 21.85 2%|▏ | 1063/50750 [2:58:30<81:50:37, 5.93s/it] {'loss': 0.0358, 'learning_rate': 2.791858174655286e-05, 'epoch': 1.05} 2%|▏ | 1063/50750 [2:58:30<81:50:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:41:14,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 19:41:14,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.76 | bwd_microstep: 3850.83 | bwd_inner_microstep: 3843.22 | bwd_allreduce_microstep: 7.57 | step_microstep: 20.98 [2024-11-13 19:41:14,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.76 | bwd: 3850.84 | bwd_inner: 3843.22 | bwd_allreduce: 7.58 | step: 20.98 2%|▏ | 1064/50750 [2:58:36<81:48:33, 5.93s/it] {'loss': 0.0589, 'learning_rate': 2.7944845699277743e-05, 'epoch': 1.05} 2%|▏ | 1064/50750 [2:58:36<81:48:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:41:20,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.49 | optimizer_step: 4.93 [2024-11-13 19:41:20,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.22 | bwd_microstep: 3844.26 | bwd_inner_microstep: 3836.49 | bwd_allreduce_microstep: 7.71 | step_microstep: 25.17 [2024-11-13 19:41:20,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.22 | bwd: 3844.28 | bwd_inner: 3836.49 | bwd_allreduce: 7.73 | step: 25.18 2%|▏ | 1065/50750 [2:58:42<81:46:41, 5.93s/it] {'loss': 0.0062, 'learning_rate': 2.797110965200263e-05, 'epoch': 1.05} 2%|▏ | 1065/50750 [2:58:42<81:46:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:41:25,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.92 [2024-11-13 19:41:25,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.26 | bwd_microstep: 3845.40 | bwd_inner_microstep: 3837.48 | bwd_allreduce_microstep: 7.87 | step_microstep: 22.86 [2024-11-13 19:41:25,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.26 | bwd: 3845.42 | bwd_inner: 3837.48 | bwd_allreduce: 7.89 | step: 22.86 2%|▏ | 1066/50750 [2:58:47<81:45:49, 5.92s/it] {'loss': 0.0052, 'learning_rate': 2.7997373604727514e-05, 'epoch': 1.05} 2%|▏ | 1066/50750 [2:58:47<81:45:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:41:31,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 19:41:31,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.75 | bwd_microstep: 3850.82 | bwd_inner_microstep: 3841.99 | bwd_allreduce_microstep: 8.78 | step_microstep: 21.65 [2024-11-13 19:41:31,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.75 | bwd: 3850.83 | bwd_inner: 3841.99 | bwd_allreduce: 8.80 | step: 21.65 2%|▏ | 1067/50750 [2:58:53<81:46:44, 5.93s/it] {'loss': 0.0152, 'learning_rate': 2.8023637557452398e-05, 'epoch': 1.05} 2%|▏ | 1067/50750 [2:58:53<81:46:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:41:37,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 19:41:37,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.20 | bwd_microstep: 3842.70 | bwd_inner_microstep: 3835.16 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.22 [2024-11-13 19:41:37,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.18 | bwd: 3842.72 | bwd_inner: 3835.16 | bwd_allreduce: 7.52 | step: 21.23 2%|▏ | 1068/50750 [2:58:59<81:44:15, 5.92s/it] {'loss': 0.0748, 'learning_rate': 2.8049901510177285e-05, 'epoch': 1.05} 2%|▏ | 1068/50750 [2:58:59<81:44:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:41:43,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:41:43,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.63 | bwd_microstep: 3849.12 | bwd_inner_microstep: 3841.59 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.34 [2024-11-13 19:41:43,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.63 | bwd: 3849.13 | bwd_inner: 3841.59 | bwd_allreduce: 7.50 | step: 21.34 2%|▏ | 1069/50750 [2:59:05<81:43:34, 5.92s/it] {'loss': 0.0069, 'learning_rate': 2.807616546290217e-05, 'epoch': 1.05} 2%|▏ | 1069/50750 [2:59:05<81:43:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:41:49,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.92 [2024-11-13 19:41:49,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.82 | bwd_microstep: 3852.26 | bwd_inner_microstep: 3844.34 | bwd_allreduce_microstep: 7.87 | step_microstep: 21.76 [2024-11-13 19:41:49,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.82 | bwd: 3852.28 | bwd_inner: 3844.34 | bwd_allreduce: 7.89 | step: 21.76 2%|▏ | 1070/50750 [2:59:11<81:44:53, 5.92s/it] {'loss': 0.2294, 'learning_rate': 2.8102429415627056e-05, 'epoch': 1.05} 2%|▏ | 1070/50750 [2:59:11<81:44:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:41:55,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 19:41:55,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.87 | bwd_microstep: 3849.72 | bwd_inner_microstep: 3842.17 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.81 [2024-11-13 19:41:55,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.87 | bwd: 3849.73 | bwd_inner: 3842.17 | bwd_allreduce: 7.52 | step: 21.81 2%|▏ | 1071/50750 [2:59:17<81:44:01, 5.92s/it] {'loss': 0.2965, 'learning_rate': 2.8128693368351936e-05, 'epoch': 1.06} 2%|▏ | 1071/50750 [2:59:17<81:44:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:42:01,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 19:42:01,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.56 | bwd_microstep: 3854.77 | bwd_inner_microstep: 3847.17 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.77 [2024-11-13 19:42:01,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.55 | bwd: 3854.79 | bwd_inner: 3847.17 | bwd_allreduce: 7.58 | step: 21.77 2%|▏ | 1072/50750 [2:59:23<81:45:23, 5.92s/it] {'loss': 0.4195, 'learning_rate': 2.8154957321076827e-05, 'epoch': 1.06} 2%|▏ | 1072/50750 [2:59:23<81:45:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:42:07,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 5.12 [2024-11-13 19:42:07,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.16 | bwd_microstep: 3848.66 | bwd_inner_microstep: 3841.15 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.44 [2024-11-13 19:42:07,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.15 | bwd: 3848.68 | bwd_inner: 3841.15 | bwd_allreduce: 7.49 | step: 21.45 2%|▏ | 1073/50750 [2:59:29<81:46:40, 5.93s/it] {'loss': 0.0003, 'learning_rate': 2.8181221273801707e-05, 'epoch': 1.06} 2%|▏ | 1073/50750 [2:59:29<81:46:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:42:13,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.94 [2024-11-13 19:42:13,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.70 | bwd_microstep: 3852.48 | bwd_inner_microstep: 3844.68 | bwd_allreduce_microstep: 7.75 | step_microstep: 21.44 [2024-11-13 19:42:13,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.70 | bwd: 3852.49 | bwd_inner: 3844.68 | bwd_allreduce: 7.76 | step: 21.44 2%|▏ | 1074/50750 [2:59:35<81:47:56, 5.93s/it] {'loss': 0.0026, 'learning_rate': 2.8207485226526598e-05, 'epoch': 1.06} 2%|▏ | 1074/50750 [2:59:35<81:47:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:42:19,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.52 | optimizer_step: 4.93 [2024-11-13 19:42:19,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.80 | bwd_microstep: 3849.80 | bwd_inner_microstep: 3842.05 | bwd_allreduce_microstep: 7.69 | step_microstep: 28.57 [2024-11-13 19:42:19,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.80 | bwd: 3849.81 | bwd_inner: 3842.05 | bwd_allreduce: 7.71 | step: 28.57 2%|▏ | 1075/50750 [2:59:41<81:48:04, 5.93s/it] {'loss': 0.0023, 'learning_rate': 2.823374917925148e-05, 'epoch': 1.06} 2%|▏ | 1075/50750 [2:59:41<81:48:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:42:25,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 19:42:25,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.56 | bwd_microstep: 3851.68 | bwd_inner_microstep: 3843.92 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.60 [2024-11-13 19:42:25,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.56 | bwd: 3851.70 | bwd_inner: 3843.92 | bwd_allreduce: 7.73 | step: 22.60 2%|▏ | 1076/50750 [2:59:47<81:47:49, 5.93s/it] {'loss': 0.0013, 'learning_rate': 2.8260013131976362e-05, 'epoch': 1.06} 2%|▏ | 1076/50750 [2:59:47<81:47:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 19:42:31,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 19:42:31,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.81 | bwd_microstep: 3840.27 | bwd_inner_microstep: 3832.82 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.82 [2024-11-13 19:42:31,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.80 | bwd: 3840.28 | bwd_inner: 3832.82 | bwd_allreduce: 7.43 | step: 20.82 2%|▏ | 1077/50750 [2:59:53<81:46:04, 5.93s/it] {'loss': 0.0001, 'learning_rate': 2.828627708470125e-05, 'epoch': 1.06} 2%|▏ | 1077/50750 [2:59:53<81:46:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:42:37,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-13 19:42:37,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.43 | bwd_microstep: 3855.20 | bwd_inner_microstep: 3847.46 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.99 [2024-11-13 19:42:37,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.43 | bwd: 3855.21 | bwd_inner: 3847.46 | bwd_allreduce: 7.71 | step: 22.00 2%|▏ | 1078/50750 [2:59:59<81:49:31, 5.93s/it] {'loss': 0.6588, 'learning_rate': 2.8312541037426133e-05, 'epoch': 1.06} 2%|▏ | 1078/50750 [2:59:59<81:49:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:42:43,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 19:42:43,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.74 | bwd_microstep: 3842.95 | bwd_inner_microstep: 3835.49 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.76 [2024-11-13 19:42:43,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.72 | bwd: 3842.96 | bwd_inner: 3835.49 | bwd_allreduce: 7.43 | step: 20.77 2%|▏ | 1079/50750 [3:00:04<81:48:49, 5.93s/it] {'loss': 0.0493, 'learning_rate': 2.833880499015102e-05, 'epoch': 1.06} 2%|▏ | 1079/50750 [3:00:04<81:48:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:42:48,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 19:42:48,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.61 | bwd_microstep: 3842.47 | bwd_inner_microstep: 3834.90 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.63 [2024-11-13 19:42:48,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.61 | bwd: 3842.49 | bwd_inner: 3834.90 | bwd_allreduce: 7.55 | step: 21.64 2%|▏ | 1080/50750 [3:00:10<81:45:46, 5.93s/it] {'loss': 0.1673, 'learning_rate': 2.8365068942875904e-05, 'epoch': 1.06} 2%|▏ | 1080/50750 [3:00:10<81:45:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:42:54,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:42:54,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.09 | bwd_microstep: 3846.15 | bwd_inner_microstep: 3838.58 | bwd_allreduce_microstep: 7.53 | step_microstep: 22.03 [2024-11-13 19:42:54,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.08 | bwd: 3846.17 | bwd_inner: 3838.58 | bwd_allreduce: 7.55 | step: 22.03 2%|▏ | 1081/50750 [3:00:16<81:46:20, 5.93s/it] {'loss': 0.0056, 'learning_rate': 2.839133289560079e-05, 'epoch': 1.07} 2%|▏ | 1081/50750 [3:00:16<81:46:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:43:00,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 19:43:00,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.79 | bwd_microstep: 3844.30 | bwd_inner_microstep: 3836.66 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.63 [2024-11-13 19:43:00,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.78 | bwd: 3844.31 | bwd_inner: 3836.66 | bwd_allreduce: 7.61 | step: 21.64 2%|▏ | 1082/50750 [3:00:22<81:45:34, 5.93s/it] {'loss': 0.0275, 'learning_rate': 2.8417596848325675e-05, 'epoch': 1.07} 2%|▏ | 1082/50750 [3:00:22<81:45:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:43:06,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.38 | optimizer_step: 4.93 [2024-11-13 19:43:06,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.61 | bwd_microstep: 3859.51 | bwd_inner_microstep: 3851.52 | bwd_allreduce_microstep: 7.94 | step_microstep: 22.43 [2024-11-13 19:43:06,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.56 | bwd: 3859.52 | bwd_inner: 3851.52 | bwd_allreduce: 7.96 | step: 22.44 2%|▏ | 1083/50750 [3:00:28<81:50:49, 5.93s/it] {'loss': 0.2152, 'learning_rate': 2.8443860801050562e-05, 'epoch': 1.07} 2%|▏ | 1083/50750 [3:00:28<81:50:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:43:12,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.92 [2024-11-13 19:43:12,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.63 | bwd_microstep: 3848.51 | bwd_inner_microstep: 3840.83 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.45 [2024-11-13 19:43:12,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.61 | bwd: 3848.53 | bwd_inner: 3840.83 | bwd_allreduce: 7.66 | step: 21.45 2%|▏ | 1084/50750 [3:00:34<81:50:43, 5.93s/it] {'loss': 0.006, 'learning_rate': 2.8470124753775446e-05, 'epoch': 1.07} 2%|▏ | 1084/50750 [3:00:34<81:50:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:43:18,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 19:43:18,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.68 | bwd_microstep: 3845.49 | bwd_inner_microstep: 3838.02 | bwd_allreduce_microstep: 7.43 | step_microstep: 22.15 [2024-11-13 19:43:18,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.67 | bwd: 3845.50 | bwd_inner: 3838.02 | bwd_allreduce: 7.45 | step: 22.15 2%|▏ | 1085/50750 [3:00:40<81:48:24, 5.93s/it] {'loss': 0.0052, 'learning_rate': 2.849638870650033e-05, 'epoch': 1.07} 2%|▏ | 1085/50750 [3:00:40<81:48:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 19:43:24,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:43:24,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.20 | bwd_microstep: 3848.76 | bwd_inner_microstep: 3841.29 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.91 [2024-11-13 19:43:24,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.20 | bwd: 3848.77 | bwd_inner: 3841.29 | bwd_allreduce: 7.44 | step: 20.91 2%|▏ | 1086/50750 [3:00:46<81:46:52, 5.93s/it] {'loss': 0.0031, 'learning_rate': 2.8522652659225217e-05, 'epoch': 1.07} 2%|▏ | 1086/50750 [3:00:46<81:46:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:43:30,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-13 19:43:30,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.81 | bwd_microstep: 3842.84 | bwd_inner_microstep: 3834.67 | bwd_allreduce_microstep: 8.12 | step_microstep: 23.32 [2024-11-13 19:43:30,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.79 | bwd: 3842.86 | bwd_inner: 3834.67 | bwd_allreduce: 8.14 | step: 23.33 2%|▏ | 1087/50750 [3:00:52<81:47:30, 5.93s/it] {'loss': 0.0058, 'learning_rate': 2.85489166119501e-05, 'epoch': 1.07} 2%|▏ | 1087/50750 [3:00:52<81:47:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:43:36,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 19:43:36,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.09 | bwd_microstep: 3849.14 | bwd_inner_microstep: 3841.12 | bwd_allreduce_microstep: 7.98 | step_microstep: 21.71 [2024-11-13 19:43:36,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.08 | bwd: 3849.15 | bwd_inner: 3841.12 | bwd_allreduce: 8.00 | step: 21.72 2%|▏ | 1088/50750 [3:00:58<81:49:09, 5.93s/it] {'loss': 0.0026, 'learning_rate': 2.8575180564674988e-05, 'epoch': 1.07} 2%|▏ | 1088/50750 [3:00:58<81:49:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:43:42,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.58 | optimizer_step: 4.93 [2024-11-13 19:43:42,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.60 | bwd_microstep: 3860.47 | bwd_inner_microstep: 3852.27 | bwd_allreduce_microstep: 8.13 | step_microstep: 28.10 [2024-11-13 19:43:42,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.59 | bwd: 3860.49 | bwd_inner: 3852.27 | bwd_allreduce: 8.16 | step: 28.12 2%|▏ | 1089/50750 [3:01:04<81:54:04, 5.94s/it] {'loss': 0.0035, 'learning_rate': 2.860144451739987e-05, 'epoch': 1.07} 2%|▏ | 1089/50750 [3:01:04<81:54:04, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:43:48,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:43:48,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.11 | bwd_microstep: 3855.38 | bwd_inner_microstep: 3847.50 | bwd_allreduce_microstep: 7.83 | step_microstep: 20.99 [2024-11-13 19:43:48,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.10 | bwd: 3855.39 | bwd_inner: 3847.50 | bwd_allreduce: 7.85 | step: 21.00 2%|▏ | 1090/50750 [3:01:10<81:54:08, 5.94s/it] {'loss': 0.009, 'learning_rate': 2.862770847012476e-05, 'epoch': 1.07} 2%|▏ | 1090/50750 [3:01:10<81:54:08, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:43:54,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.64 | optimizer_step: 4.93 [2024-11-13 19:43:54,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.76 | bwd_microstep: 3849.29 | bwd_inner_microstep: 3841.68 | bwd_allreduce_microstep: 7.57 | step_microstep: 23.94 [2024-11-13 19:43:54,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.77 | bwd: 3849.30 | bwd_inner: 3841.68 | bwd_allreduce: 7.58 | step: 23.95 2%|▏ | 1091/50750 [3:01:16<81:55:08, 5.94s/it] {'loss': 0.0052, 'learning_rate': 2.8653972422849642e-05, 'epoch': 1.07} 2%|▏ | 1091/50750 [3:01:16<81:55:08, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:44:00,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 19:44:00,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.84 | bwd_microstep: 3850.54 | bwd_inner_microstep: 3843.05 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.39 [2024-11-13 19:44:00,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.84 | bwd: 3850.56 | bwd_inner: 3843.05 | bwd_allreduce: 7.47 | step: 21.39 2%|▏ | 1092/50750 [3:01:22<81:52:56, 5.94s/it] {'loss': 0.0107, 'learning_rate': 2.868023637557453e-05, 'epoch': 1.08} 2%|▏ | 1092/50750 [3:01:22<81:52:56, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 19:44:06,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:44:06,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.07 | bwd_microstep: 3858.54 | bwd_inner_microstep: 3850.91 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.34 [2024-11-13 19:44:06,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.07 | bwd: 3858.55 | bwd_inner: 3850.91 | bwd_allreduce: 7.60 | step: 21.34 2%|▏ | 1093/50750 [3:01:28<81:52:32, 5.94s/it] {'loss': 0.0132, 'learning_rate': 2.870650032829941e-05, 'epoch': 1.08} 2%|▏ | 1093/50750 [3:01:28<81:52:32, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:44:12,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:44:12,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.05 | bwd_microstep: 3855.92 | bwd_inner_microstep: 3848.39 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.40 [2024-11-13 19:44:12,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.05 | bwd: 3855.93 | bwd_inner: 3848.39 | bwd_allreduce: 7.50 | step: 21.40 2%|▏ | 1094/50750 [3:01:33<81:50:01, 5.93s/it] {'loss': 0.0033, 'learning_rate': 2.8732764281024294e-05, 'epoch': 1.08} 2%|▏ | 1094/50750 [3:01:33<81:50:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:44:17,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.94 [2024-11-13 19:44:17,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.44 | bwd_microstep: 3870.05 | bwd_inner_microstep: 3862.36 | bwd_allreduce_microstep: 7.65 | step_microstep: 22.69 [2024-11-13 19:44:17,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.42 | bwd: 3870.06 | bwd_inner: 3862.36 | bwd_allreduce: 7.66 | step: 22.69 2%|▏ | 1095/50750 [3:01:39<81:53:44, 5.94s/it] {'loss': 0.0004, 'learning_rate': 2.875902823374918e-05, 'epoch': 1.08} 2%|▏ | 1095/50750 [3:01:39<81:53:44, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:44:23,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 19:44:23,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.96 | bwd_microstep: 3845.92 | bwd_inner_microstep: 3838.25 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.36 [2024-11-13 19:44:23,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.94 | bwd: 3845.93 | bwd_inner: 3838.25 | bwd_allreduce: 7.64 | step: 21.36 2%|▏ | 1096/50750 [3:01:45<81:51:18, 5.93s/it] {'loss': 0.6638, 'learning_rate': 2.8785292186474065e-05, 'epoch': 1.08} 2%|▏ | 1096/50750 [3:01:45<81:51:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:44:29,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 19:44:29,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.71 | bwd_microstep: 3847.68 | bwd_inner_microstep: 3839.64 | bwd_allreduce_microstep: 7.97 | step_microstep: 27.52 [2024-11-13 19:44:29,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.70 | bwd: 3847.70 | bwd_inner: 3839.64 | bwd_allreduce: 8.00 | step: 27.52 2%|▏ | 1097/50750 [3:01:51<81:51:01, 5.93s/it] {'loss': 0.3144, 'learning_rate': 2.8811556139198952e-05, 'epoch': 1.08} 2%|▏ | 1097/50750 [3:01:51<81:51:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:44:35,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 19:44:35,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.33 | bwd_microstep: 3846.63 | bwd_inner_microstep: 3839.08 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.09 [2024-11-13 19:44:35,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.31 | bwd: 3846.64 | bwd_inner: 3839.08 | bwd_allreduce: 7.52 | step: 21.09 2%|▏ | 1098/50750 [3:01:57<81:48:20, 5.93s/it] {'loss': 0.8291, 'learning_rate': 2.8837820091923836e-05, 'epoch': 1.08} 2%|▏ | 1098/50750 [3:01:57<81:48:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:44:41,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 19:44:41,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.53 | bwd_microstep: 3849.74 | bwd_inner_microstep: 3842.23 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.76 [2024-11-13 19:44:41,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.53 | bwd: 3849.75 | bwd_inner: 3842.23 | bwd_allreduce: 7.49 | step: 21.76 2%|▏ | 1099/50750 [3:02:03<81:45:52, 5.93s/it] {'loss': 0.0067, 'learning_rate': 2.8864084044648723e-05, 'epoch': 1.08} 2%|▏ | 1099/50750 [3:02:03<81:45:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:44:47,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 19:44:47,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.92 | bwd_microstep: 3849.45 | bwd_inner_microstep: 3841.80 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.72 [2024-11-13 19:44:47,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.92 | bwd: 3849.46 | bwd_inner: 3841.80 | bwd_allreduce: 7.63 | step: 21.72 2%|▏ | 1100/50750 [3:02:09<81:45:37, 5.93s/it] {'loss': 0.0035, 'learning_rate': 2.8890347997373606e-05, 'epoch': 1.08} 2%|▏ | 1100/50750 [3:02:09<81:45:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:44:53,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-13 19:44:53,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.56 | bwd_microstep: 3854.76 | bwd_inner_microstep: 3847.02 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.62 [2024-11-13 19:44:53,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.55 | bwd: 3854.77 | bwd_inner: 3847.02 | bwd_allreduce: 7.71 | step: 21.62 2%|▏ | 1101/50750 [3:02:15<81:46:57, 5.93s/it] {'loss': 0.9853, 'learning_rate': 2.8916611950098494e-05, 'epoch': 1.08} 2%|▏ | 1101/50750 [3:02:15<81:46:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:44:59,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:44:59,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1992.92 | bwd_microstep: 3784.59 | bwd_inner_microstep: 3776.87 | bwd_allreduce_microstep: 7.67 | step_microstep: 22.06 [2024-11-13 19:44:59,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1992.92 | bwd: 3784.60 | bwd_inner: 3776.87 | bwd_allreduce: 7.69 | step: 22.06 2%|▏ | 1102/50750 [3:02:21<81:21:12, 5.90s/it] {'loss': 0.0015, 'learning_rate': 2.8942875902823377e-05, 'epoch': 1.09} 2%|▏ | 1102/50750 [3:02:21<81:21:12, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:45:05,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 19:45:05,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.39 | bwd_microstep: 3847.24 | bwd_inner_microstep: 3839.54 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.78 [2024-11-13 19:45:05,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.39 | bwd: 3847.25 | bwd_inner: 3839.54 | bwd_allreduce: 7.67 | step: 21.79 2%|▏ | 1103/50750 [3:02:27<81:26:31, 5.91s/it] {'loss': 0.0267, 'learning_rate': 2.896913985554826e-05, 'epoch': 1.09} 2%|▏ | 1103/50750 [3:02:27<81:26:31, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:45:11,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 19:45:11,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.12 | bwd_microstep: 3864.13 | bwd_inner_microstep: 3855.78 | bwd_allreduce_microstep: 8.30 | step_microstep: 22.18 [2024-11-13 19:45:11,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.11 | bwd: 3864.14 | bwd_inner: 3855.78 | bwd_allreduce: 8.32 | step: 22.18 2%|▏ | 1104/50750 [3:02:33<81:36:39, 5.92s/it] {'loss': 0.0003, 'learning_rate': 2.8995403808273148e-05, 'epoch': 1.09} 2%|▏ | 1104/50750 [3:02:33<81:36:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:45:17,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 19:45:17,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.00 | bwd_microstep: 3858.52 | bwd_inner_microstep: 3851.01 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.12 [2024-11-13 19:45:17,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.00 | bwd: 3858.53 | bwd_inner: 3851.01 | bwd_allreduce: 7.47 | step: 21.12 2%|▏ | 1105/50750 [3:02:39<81:42:05, 5.92s/it] {'loss': 0.3125, 'learning_rate': 2.9021667760998032e-05, 'epoch': 1.09} 2%|▏ | 1105/50750 [3:02:39<81:42:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:45:23,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.92 [2024-11-13 19:45:23,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.02 | bwd_microstep: 3849.94 | bwd_inner_microstep: 3842.41 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.29 [2024-11-13 19:45:23,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.02 | bwd: 3849.95 | bwd_inner: 3842.41 | bwd_allreduce: 7.50 | step: 21.29 2%|▏ | 1106/50750 [3:02:45<81:43:27, 5.93s/it] {'loss': 0.3825, 'learning_rate': 2.904793171372292e-05, 'epoch': 1.09} 2%|▏ | 1106/50750 [3:02:45<81:43:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:45:29,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 19:45:29,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.38 | bwd_microstep: 3856.44 | bwd_inner_microstep: 3848.35 | bwd_allreduce_microstep: 8.01 | step_microstep: 21.70 [2024-11-13 19:45:29,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.38 | bwd: 3856.47 | bwd_inner: 3848.35 | bwd_allreduce: 8.05 | step: 21.69 2%|▏ | 1107/50750 [3:02:50<81:44:09, 5.93s/it] {'loss': 0.0029, 'learning_rate': 2.9074195666447803e-05, 'epoch': 1.09} 2%|▏ | 1107/50750 [3:02:50<81:44:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:45:34,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:45:34,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.99 | bwd_microstep: 3847.12 | bwd_inner_microstep: 3838.19 | bwd_allreduce_microstep: 8.89 | step_microstep: 21.90 [2024-11-13 19:45:34,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.99 | bwd: 3847.14 | bwd_inner: 3838.19 | bwd_allreduce: 8.91 | step: 21.90 2%|▏ | 1108/50750 [3:02:56<81:43:31, 5.93s/it] {'loss': 0.2948, 'learning_rate': 2.910045961917269e-05, 'epoch': 1.09} 2%|▏ | 1108/50750 [3:02:56<81:43:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:45:40,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:45:40,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.18 | bwd_microstep: 3854.80 | bwd_inner_microstep: 3847.27 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.44 [2024-11-13 19:45:40,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.18 | bwd: 3854.81 | bwd_inner: 3847.27 | bwd_allreduce: 7.50 | step: 21.44 2%|▏ | 1109/50750 [3:03:02<81:43:58, 5.93s/it] {'loss': 0.5171, 'learning_rate': 2.9126723571897574e-05, 'epoch': 1.09} 2%|▏ | 1109/50750 [3:03:02<81:43:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:45:46,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 19:45:46,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.17 | bwd_microstep: 3860.28 | bwd_inner_microstep: 3852.74 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.12 [2024-11-13 19:45:46,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.17 | bwd: 3860.29 | bwd_inner: 3852.74 | bwd_allreduce: 7.51 | step: 21.12 2%|▏ | 1110/50750 [3:03:08<81:45:44, 5.93s/it] {'loss': 0.7326, 'learning_rate': 2.9152987524622454e-05, 'epoch': 1.09} 2%|▏ | 1110/50750 [3:03:08<81:45:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:45:52,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 19:45:52,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.72 | bwd_microstep: 3859.25 | bwd_inner_microstep: 3851.50 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.08 [2024-11-13 19:45:52,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.72 | bwd: 3859.27 | bwd_inner: 3851.50 | bwd_allreduce: 7.73 | step: 22.08 2%|▏ | 1111/50750 [3:03:14<81:46:56, 5.93s/it] {'loss': 0.0487, 'learning_rate': 2.9179251477347345e-05, 'epoch': 1.09} 2%|▏ | 1111/50750 [3:03:14<81:46:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:45:58,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 19:45:58,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.27 | bwd_microstep: 3861.05 | bwd_inner_microstep: 3853.49 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.90 [2024-11-13 19:45:58,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.26 | bwd: 3861.07 | bwd_inner: 3853.49 | bwd_allreduce: 7.54 | step: 21.90 2%|▏ | 1112/50750 [3:03:20<81:52:40, 5.94s/it] {'loss': 0.1092, 'learning_rate': 2.9205515430072225e-05, 'epoch': 1.1} 2%|▏ | 1112/50750 [3:03:20<81:52:40, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:46:04,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:46:04,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.59 | bwd_microstep: 3865.35 | bwd_inner_microstep: 3857.81 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.27 [2024-11-13 19:46:04,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.58 | bwd: 3865.36 | bwd_inner: 3857.81 | bwd_allreduce: 7.51 | step: 21.28 2%|▏ | 1113/50750 [3:03:26<81:53:40, 5.94s/it] {'loss': 0.0019, 'learning_rate': 2.9231779382797116e-05, 'epoch': 1.1} 2%|▏ | 1113/50750 [3:03:26<81:53:40, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:46:10,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 19:46:10,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.15 | bwd_microstep: 3856.38 | bwd_inner_microstep: 3848.79 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.96 [2024-11-13 19:46:10,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.16 | bwd: 3856.39 | bwd_inner: 3848.79 | bwd_allreduce: 7.56 | step: 21.97 2%|▏ | 1114/50750 [3:03:32<81:52:33, 5.94s/it] {'loss': 0.0135, 'learning_rate': 2.9258043335521996e-05, 'epoch': 1.1} 2%|▏ | 1114/50750 [3:03:32<81:52:33, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:46:16,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:46:16,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.29 | bwd_microstep: 3853.50 | bwd_inner_microstep: 3845.64 | bwd_allreduce_microstep: 7.81 | step_microstep: 21.06 [2024-11-13 19:46:16,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.27 | bwd: 3853.51 | bwd_inner: 3845.64 | bwd_allreduce: 7.83 | step: 21.07 2%|▏ | 1115/50750 [3:03:38<81:49:59, 5.94s/it] {'loss': 0.001, 'learning_rate': 2.9284307288246883e-05, 'epoch': 1.1} 2%|▏ | 1115/50750 [3:03:38<81:49:59, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:46:22,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:46:22,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.59 | bwd_microstep: 3860.71 | bwd_inner_microstep: 3853.02 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.31 [2024-11-13 19:46:22,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.59 | bwd: 3860.72 | bwd_inner: 3853.02 | bwd_allreduce: 7.66 | step: 21.32 2%|▏ | 1116/50750 [3:03:44<81:49:41, 5.94s/it] {'loss': 0.5145, 'learning_rate': 2.9310571240971767e-05, 'epoch': 1.1} 2%|▏ | 1116/50750 [3:03:44<81:49:41, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:46:28,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.99 [2024-11-13 19:46:28,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.68 | bwd_microstep: 3861.98 | bwd_inner_microstep: 3854.42 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.77 [2024-11-13 19:46:28,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.68 | bwd: 3862.00 | bwd_inner: 3854.42 | bwd_allreduce: 7.54 | step: 21.77 2%|▏ | 1117/50750 [3:03:50<81:50:57, 5.94s/it] {'loss': 0.0167, 'learning_rate': 2.9336835193696654e-05, 'epoch': 1.1} 2%|▏ | 1117/50750 [3:03:50<81:50:57, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:46:34,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:46:34,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.17 | bwd_microstep: 3855.51 | bwd_inner_microstep: 3847.99 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.23 [2024-11-13 19:46:34,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.16 | bwd: 3855.52 | bwd_inner: 3847.99 | bwd_allreduce: 7.49 | step: 21.23 2%|▏ | 1118/50750 [3:03:56<81:49:33, 5.94s/it] {'loss': 0.0136, 'learning_rate': 2.9363099146421538e-05, 'epoch': 1.1} 2%|▏ | 1118/50750 [3:03:56<81:49:33, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:46:40,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 19:46:40,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.92 | bwd_microstep: 3856.73 | bwd_inner_microstep: 3849.20 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.11 [2024-11-13 19:46:40,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.92 | bwd: 3856.74 | bwd_inner: 3849.20 | bwd_allreduce: 7.50 | step: 21.12 2%|▏ | 1119/50750 [3:04:02<81:47:57, 5.93s/it] {'loss': 0.0018, 'learning_rate': 2.9389363099146422e-05, 'epoch': 1.1} 2%|▏ | 1119/50750 [3:04:02<81:47:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:46:46,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 19:46:46,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.14 | bwd_microstep: 3854.88 | bwd_inner_microstep: 3847.35 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.21 [2024-11-13 19:46:46,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.14 | bwd: 3854.89 | bwd_inner: 3847.35 | bwd_allreduce: 7.50 | step: 21.21 2%|▏ | 1120/50750 [3:04:08<81:46:59, 5.93s/it] {'loss': 0.4605, 'learning_rate': 2.941562705187131e-05, 'epoch': 1.1} 2%|▏ | 1120/50750 [3:04:08<81:46:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:46:52,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 19:46:52,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.66 | bwd_microstep: 3862.58 | bwd_inner_microstep: 3855.04 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.47 [2024-11-13 19:46:52,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.66 | bwd: 3862.60 | bwd_inner: 3855.04 | bwd_allreduce: 7.51 | step: 21.47 2%|▏ | 1121/50750 [3:04:14<81:48:06, 5.93s/it] {'loss': 0.0084, 'learning_rate': 2.9441891004596193e-05, 'epoch': 1.1} 2%|▏ | 1121/50750 [3:04:14<81:48:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:46:58,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 19:46:58,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.11 | bwd_microstep: 3854.95 | bwd_inner_microstep: 3847.44 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.25 [2024-11-13 19:46:58,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.11 | bwd: 3854.97 | bwd_inner: 3847.44 | bwd_allreduce: 7.49 | step: 21.25 2%|▏ | 1122/50750 [3:04:20<81:45:56, 5.93s/it] {'loss': 0.0066, 'learning_rate': 2.946815495732108e-05, 'epoch': 1.11} 2%|▏ | 1122/50750 [3:04:20<81:45:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:47:03,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:47:03,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.86 | bwd_microstep: 3858.53 | bwd_inner_microstep: 3850.88 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.51 [2024-11-13 19:47:03,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.86 | bwd: 3858.54 | bwd_inner: 3850.88 | bwd_allreduce: 7.62 | step: 21.51 2%|▏ | 1123/50750 [3:04:25<81:47:03, 5.93s/it] {'loss': 0.0022, 'learning_rate': 2.9494418910045964e-05, 'epoch': 1.11} 2%|▏ | 1123/50750 [3:04:25<81:47:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:47:09,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 19:47:09,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.79 | bwd_microstep: 3855.86 | bwd_inner_microstep: 3848.34 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.12 [2024-11-13 19:47:09,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.79 | bwd: 3855.88 | bwd_inner: 3848.34 | bwd_allreduce: 7.50 | step: 21.13 2%|▏ | 1124/50750 [3:04:31<81:46:57, 5.93s/it] {'loss': 0.2054, 'learning_rate': 2.952068286277085e-05, 'epoch': 1.11} 2%|▏ | 1124/50750 [3:04:31<81:46:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:47:15,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 19:47:15,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.59 | bwd_microstep: 3856.49 | bwd_inner_microstep: 3848.96 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.31 [2024-11-13 19:47:15,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.59 | bwd: 3856.50 | bwd_inner: 3848.96 | bwd_allreduce: 7.51 | step: 21.31 2%|▏ | 1125/50750 [3:04:37<81:46:13, 5.93s/it] {'loss': 0.4757, 'learning_rate': 2.9546946815495735e-05, 'epoch': 1.11} 2%|▏ | 1125/50750 [3:04:37<81:46:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 19:47:21,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:47:21,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.28 | bwd_microstep: 3858.15 | bwd_inner_microstep: 3850.66 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.61 [2024-11-13 19:47:21,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.28 | bwd: 3858.16 | bwd_inner: 3850.66 | bwd_allreduce: 7.46 | step: 20.62 2%|▏ | 1126/50750 [3:04:43<81:47:16, 5.93s/it] {'loss': 0.002, 'learning_rate': 2.9573210768220622e-05, 'epoch': 1.11} 2%|▏ | 1126/50750 [3:04:43<81:47:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 19:47:27,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.45 | optimizer_step: 4.93 [2024-11-13 19:47:27,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.93 | bwd_microstep: 3864.19 | bwd_inner_microstep: 3855.81 | bwd_allreduce_microstep: 8.32 | step_microstep: 26.04 [2024-11-13 19:47:27,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.93 | bwd: 3864.21 | bwd_inner: 3855.81 | bwd_allreduce: 8.35 | step: 26.04 2%|▏ | 1127/50750 [3:04:49<81:50:15, 5.94s/it] {'loss': 0.0097, 'learning_rate': 2.9599474720945505e-05, 'epoch': 1.11} 2%|▏ | 1127/50750 [3:04:49<81:50:15, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:47:33,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.48 | optimizer_step: 4.93 [2024-11-13 19:47:33,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.96 | bwd_microstep: 3864.00 | bwd_inner_microstep: 3855.65 | bwd_allreduce_microstep: 8.29 | step_microstep: 25.48 [2024-11-13 19:47:33,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.94 | bwd: 3864.02 | bwd_inner: 3855.65 | bwd_allreduce: 8.32 | step: 25.50 2%|▏ | 1128/50750 [3:04:55<81:57:02, 5.95s/it] {'loss': 0.0228, 'learning_rate': 2.962573867367039e-05, 'epoch': 1.11} 2%|▏ | 1128/50750 [3:04:55<81:57:02, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:47:39,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 19:47:39,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.70 | bwd_microstep: 3859.27 | bwd_inner_microstep: 3851.55 | bwd_allreduce_microstep: 7.67 | step_microstep: 22.11 [2024-11-13 19:47:39,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.69 | bwd: 3859.29 | bwd_inner: 3851.55 | bwd_allreduce: 7.69 | step: 22.12 2%|▏ | 1129/50750 [3:05:01<81:57:50, 5.95s/it] {'loss': 0.1059, 'learning_rate': 2.9652002626395276e-05, 'epoch': 1.11} 2%|▏ | 1129/50750 [3:05:01<81:57:50, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:47:45,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 19:47:45,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.83 | bwd_microstep: 3855.22 | bwd_inner_microstep: 3847.52 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.15 [2024-11-13 19:47:45,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.82 | bwd: 3855.23 | bwd_inner: 3847.52 | bwd_allreduce: 7.67 | step: 21.15 2%|▏ | 1130/50750 [3:05:07<81:54:57, 5.94s/it] {'loss': 0.0056, 'learning_rate': 2.967826657912016e-05, 'epoch': 1.11} 2%|▏ | 1130/50750 [3:05:07<81:54:57, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:47:51,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 19:47:51,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.85 | bwd_microstep: 3864.42 | bwd_inner_microstep: 3856.63 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.72 [2024-11-13 19:47:51,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.83 | bwd: 3864.44 | bwd_inner: 3856.63 | bwd_allreduce: 7.76 | step: 21.73 2%|▏ | 1131/50750 [3:05:13<81:55:57, 5.94s/it] {'loss': 0.0103, 'learning_rate': 2.9704530531845047e-05, 'epoch': 1.11} 2%|▏ | 1131/50750 [3:05:13<81:55:57, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:47:57,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:47:57,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.03 | bwd_microstep: 3861.90 | bwd_inner_microstep: 3854.34 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.28 [2024-11-13 19:47:57,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.02 | bwd: 3861.91 | bwd_inner: 3854.34 | bwd_allreduce: 7.53 | step: 21.28 2%|▏ | 1132/50750 [3:05:19<81:56:12, 5.94s/it] {'loss': 0.0075, 'learning_rate': 2.9730794484569928e-05, 'epoch': 1.12} 2%|▏ | 1132/50750 [3:05:19<81:56:12, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:48:03,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 19:48:03,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.53 | bwd_microstep: 3860.32 | bwd_inner_microstep: 3852.09 | bwd_allreduce_microstep: 8.18 | step_microstep: 23.73 [2024-11-13 19:48:03,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.53 | bwd: 3860.35 | bwd_inner: 3852.09 | bwd_allreduce: 8.20 | step: 23.73 2%|▏ | 1133/50750 [3:05:25<81:56:06, 5.94s/it] {'loss': 0.0106, 'learning_rate': 2.9757058437294818e-05, 'epoch': 1.12} 2%|▏ | 1133/50750 [3:05:25<81:56:06, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:48:09,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 19:48:09,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.95 | bwd_microstep: 3853.26 | bwd_inner_microstep: 3845.77 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.78 [2024-11-13 19:48:09,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.92 | bwd: 3853.28 | bwd_inner: 3845.77 | bwd_allreduce: 7.47 | step: 20.78 2%|▏ | 1134/50750 [3:05:31<81:55:06, 5.94s/it] {'loss': 0.1133, 'learning_rate': 2.97833223900197e-05, 'epoch': 1.12} 2%|▏ | 1134/50750 [3:05:31<81:55:06, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:48:15,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 19:48:15,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.48 | bwd_microstep: 3859.51 | bwd_inner_microstep: 3851.87 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.30 [2024-11-13 19:48:15,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.48 | bwd: 3859.52 | bwd_inner: 3851.87 | bwd_allreduce: 7.61 | step: 21.31 2%|▏ | 1135/50750 [3:05:37<81:52:21, 5.94s/it] {'loss': 0.0049, 'learning_rate': 2.980958634274459e-05, 'epoch': 1.12} 2%|▏ | 1135/50750 [3:05:37<81:52:21, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:48:21,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 19:48:21,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.41 | bwd_microstep: 3856.77 | bwd_inner_microstep: 3848.85 | bwd_allreduce_microstep: 7.87 | step_microstep: 21.51 [2024-11-13 19:48:21,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.40 | bwd: 3856.79 | bwd_inner: 3848.85 | bwd_allreduce: 7.89 | step: 21.52 2%|▏ | 1136/50750 [3:05:43<81:50:42, 5.94s/it] {'loss': 0.0005, 'learning_rate': 2.983585029546947e-05, 'epoch': 1.12} 2%|▏ | 1136/50750 [3:05:43<81:50:42, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:48:27,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 19:48:27,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.66 | bwd_microstep: 3858.88 | bwd_inner_microstep: 3850.92 | bwd_allreduce_microstep: 7.91 | step_microstep: 21.84 [2024-11-13 19:48:27,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.65 | bwd: 3858.89 | bwd_inner: 3850.92 | bwd_allreduce: 7.93 | step: 21.84 2%|▏ | 1137/50750 [3:05:49<81:52:09, 5.94s/it] {'loss': 0.0127, 'learning_rate': 2.9862114248194353e-05, 'epoch': 1.12} 2%|▏ | 1137/50750 [3:05:49<81:52:09, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:48:33,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 5.04 [2024-11-13 19:48:33,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.15 | bwd_microstep: 3863.30 | bwd_inner_microstep: 3855.11 | bwd_allreduce_microstep: 7.95 | step_microstep: 23.72 [2024-11-13 19:48:33,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.13 | bwd: 3863.30 | bwd_inner: 3855.11 | bwd_allreduce: 7.97 | step: 23.72 2%|▏ | 1138/50750 [3:05:55<81:54:36, 5.94s/it] {'loss': 0.0048, 'learning_rate': 2.988837820091924e-05, 'epoch': 1.12} 2%|▏ | 1138/50750 [3:05:55<81:54:36, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 19:48:39,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 19:48:39,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.15 | bwd_microstep: 3850.27 | bwd_inner_microstep: 3842.76 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.09 [2024-11-13 19:48:39,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.14 | bwd: 3850.28 | bwd_inner: 3842.76 | bwd_allreduce: 7.49 | step: 21.09 2%|▏ | 1139/50750 [3:06:01<81:51:14, 5.94s/it] {'loss': 0.0049, 'learning_rate': 2.9914642153644124e-05, 'epoch': 1.12} 2%|▏ | 1139/50750 [3:06:01<81:51:14, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:48:44,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 19:48:44,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.80 | bwd_microstep: 3848.76 | bwd_inner_microstep: 3840.72 | bwd_allreduce_microstep: 7.97 | step_microstep: 24.56 [2024-11-13 19:48:44,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.81 | bwd: 3848.78 | bwd_inner: 3840.72 | bwd_allreduce: 8.00 | step: 24.56 2%|▏ | 1140/50750 [3:06:06<81:47:58, 5.94s/it] {'loss': 0.2794, 'learning_rate': 2.994090610636901e-05, 'epoch': 1.12} 2%|▏ | 1140/50750 [3:06:06<81:47:58, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:48:50,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 19:48:50,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.89 | bwd_microstep: 3848.90 | bwd_inner_microstep: 3841.32 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.76 [2024-11-13 19:48:50,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.89 | bwd: 3848.91 | bwd_inner: 3841.32 | bwd_allreduce: 7.55 | step: 21.76 2%|▏ | 1141/50750 [3:06:12<81:45:11, 5.93s/it] {'loss': 0.0001, 'learning_rate': 2.9967170059093895e-05, 'epoch': 1.12} 2%|▏ | 1141/50750 [3:06:12<81:45:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:48:56,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 19:48:56,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.09 | bwd_microstep: 3858.07 | bwd_inner_microstep: 3850.59 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.90 [2024-11-13 19:48:56,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.09 | bwd: 3858.09 | bwd_inner: 3850.59 | bwd_allreduce: 7.46 | step: 20.90 2%|▏ | 1142/50750 [3:06:18<81:44:50, 5.93s/it] {'loss': 0.002, 'learning_rate': 2.9993434011818782e-05, 'epoch': 1.13} 2%|▏ | 1142/50750 [3:06:18<81:44:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:49:02,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 19:49:02,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.39 | bwd_microstep: 3851.44 | bwd_inner_microstep: 3843.99 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.88 [2024-11-13 19:49:02,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.39 | bwd: 3851.45 | bwd_inner: 3843.99 | bwd_allreduce: 7.42 | step: 20.88 2%|▏ | 1143/50750 [3:06:24<81:42:44, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.0019697964543666e-05, 'epoch': 1.13} 2%|▏ | 1143/50750 [3:06:24<81:42:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:49:08,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 19:49:08,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.16 | bwd_microstep: 3848.02 | bwd_inner_microstep: 3840.53 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.16 [2024-11-13 19:49:08,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.13 | bwd: 3848.04 | bwd_inner: 3840.53 | bwd_allreduce: 7.47 | step: 21.17 2%|▏ | 1144/50750 [3:06:30<81:41:11, 5.93s/it] {'loss': 0.0033, 'learning_rate': 3.0045961917268553e-05, 'epoch': 1.13} 2%|▏ | 1144/50750 [3:06:30<81:41:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:49:14,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 19:49:14,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.82 | bwd_microstep: 3850.29 | bwd_inner_microstep: 3842.74 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.94 [2024-11-13 19:49:14,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.82 | bwd: 3850.30 | bwd_inner: 3842.74 | bwd_allreduce: 7.52 | step: 21.94 2%|▏ | 1145/50750 [3:06:36<81:42:46, 5.93s/it] {'loss': 0.7572, 'learning_rate': 3.0072225869993437e-05, 'epoch': 1.13} 2%|▏ | 1145/50750 [3:06:36<81:42:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:49:20,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 19:49:20,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.46 | bwd_microstep: 3854.43 | bwd_inner_microstep: 3846.95 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.36 [2024-11-13 19:49:20,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.45 | bwd: 3854.44 | bwd_inner: 3846.95 | bwd_allreduce: 7.46 | step: 21.37 2%|▏ | 1146/50750 [3:06:42<81:43:31, 5.93s/it] {'loss': 0.0014, 'learning_rate': 3.009848982271832e-05, 'epoch': 1.13} 2%|▏ | 1146/50750 [3:06:42<81:43:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:49:26,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 19:49:26,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.26 | bwd_microstep: 3851.88 | bwd_inner_microstep: 3844.41 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.07 [2024-11-13 19:49:26,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.26 | bwd: 3851.89 | bwd_inner: 3844.41 | bwd_allreduce: 7.45 | step: 21.08 2%|▏ | 1147/50750 [3:06:48<81:41:49, 5.93s/it] {'loss': 0.0, 'learning_rate': 3.0124753775443208e-05, 'epoch': 1.13} 2%|▏ | 1147/50750 [3:06:48<81:41:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:49:32,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 19:49:32,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.22 | bwd_microstep: 3857.16 | bwd_inner_microstep: 3849.66 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.05 [2024-11-13 19:49:32,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.22 | bwd: 3857.17 | bwd_inner: 3849.66 | bwd_allreduce: 7.48 | step: 21.05 2%|▏ | 1148/50750 [3:06:54<81:40:55, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.0151017728168092e-05, 'epoch': 1.13} 2%|▏ | 1148/50750 [3:06:54<81:40:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:49:38,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 19:49:38,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.69 | bwd_microstep: 3852.28 | bwd_inner_microstep: 3844.74 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.30 [2024-11-13 19:49:38,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.67 | bwd: 3852.30 | bwd_inner: 3844.74 | bwd_allreduce: 7.51 | step: 21.31 2%|▏ | 1149/50750 [3:07:00<81:41:48, 5.93s/it] {'loss': 1.3451, 'learning_rate': 3.017728168089298e-05, 'epoch': 1.13} 2%|▏ | 1149/50750 [3:07:00<81:41:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 19:49:44,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 19:49:44,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.34 | bwd_microstep: 3856.05 | bwd_inner_microstep: 3848.52 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.18 [2024-11-13 19:49:44,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.34 | bwd: 3856.07 | bwd_inner: 3848.52 | bwd_allreduce: 7.51 | step: 21.19 2%|▏ | 1150/50750 [3:07:06<81:42:35, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.0203545633617863e-05, 'epoch': 1.13} 2%|▏ | 1150/50750 [3:07:06<81:42:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:49:50,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 19:49:50,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.88 | bwd_microstep: 3859.80 | bwd_inner_microstep: 3852.20 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.64 [2024-11-13 19:49:50,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.87 | bwd: 3859.82 | bwd_inner: 3852.20 | bwd_allreduce: 7.58 | step: 21.64 2%|▏ | 1151/50750 [3:07:12<81:43:17, 5.93s/it] {'loss': 0.0039, 'learning_rate': 3.022980958634275e-05, 'epoch': 1.13} 2%|▏ | 1151/50750 [3:07:12<81:43:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:49:56,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 19:49:56,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.72 | bwd_microstep: 3858.79 | bwd_inner_microstep: 3851.31 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.94 [2024-11-13 19:49:56,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.70 | bwd: 3858.80 | bwd_inner: 3851.31 | bwd_allreduce: 7.46 | step: 20.94 2%|▏ | 1152/50750 [3:07:18<81:43:40, 5.93s/it] {'loss': 1.1743, 'learning_rate': 3.025607353906763e-05, 'epoch': 1.13} 2%|▏ | 1152/50750 [3:07:18<81:43:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:50:02,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:50:02,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.30 | bwd_microstep: 3853.45 | bwd_inner_microstep: 3845.96 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.17 [2024-11-13 19:50:02,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.30 | bwd: 3853.47 | bwd_inner: 3845.96 | bwd_allreduce: 7.47 | step: 21.17 2%|▏ | 1153/50750 [3:07:24<81:41:42, 5.93s/it] {'loss': 0.0017, 'learning_rate': 3.028233749179252e-05, 'epoch': 1.14} 2%|▏ | 1153/50750 [3:07:24<81:41:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:50:07,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 19:50:07,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.36 | bwd_microstep: 3859.55 | bwd_inner_microstep: 3851.89 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.41 [2024-11-13 19:50:07,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.35 | bwd: 3859.56 | bwd_inner: 3851.89 | bwd_allreduce: 7.63 | step: 21.41 2%|▏ | 1154/50750 [3:07:29<81:43:20, 5.93s/it] {'loss': 0.0188, 'learning_rate': 3.03086014445174e-05, 'epoch': 1.14} 2%|▏ | 1154/50750 [3:07:29<81:43:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:50:13,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 19:50:13,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.99 | bwd_microstep: 3859.75 | bwd_inner_microstep: 3852.07 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.33 [2024-11-13 19:50:13,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.98 | bwd: 3859.76 | bwd_inner: 3852.07 | bwd_allreduce: 7.65 | step: 21.34 2%|▏ | 1155/50750 [3:07:35<81:43:32, 5.93s/it] {'loss': 0.4272, 'learning_rate': 3.0334865397242285e-05, 'epoch': 1.14} 2%|▏ | 1155/50750 [3:07:35<81:43:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:50:19,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 19:50:19,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.89 | bwd_microstep: 3851.04 | bwd_inner_microstep: 3843.56 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.00 [2024-11-13 19:50:19,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.87 | bwd: 3851.05 | bwd_inner: 3843.56 | bwd_allreduce: 7.45 | step: 21.00 2%|▏ | 1156/50750 [3:07:41<81:43:12, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.0361129349967172e-05, 'epoch': 1.14} 2%|▏ | 1156/50750 [3:07:41<81:43:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:50:25,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:50:25,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.51 | bwd_microstep: 3855.57 | bwd_inner_microstep: 3847.70 | bwd_allreduce_microstep: 7.83 | step_microstep: 21.43 [2024-11-13 19:50:25,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.51 | bwd: 3855.59 | bwd_inner: 3847.70 | bwd_allreduce: 7.85 | step: 21.43 2%|▏ | 1157/50750 [3:07:47<81:42:47, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.0387393302692056e-05, 'epoch': 1.14} 2%|▏ | 1157/50750 [3:07:47<81:42:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:50:31,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.94 [2024-11-13 19:50:31,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.52 | bwd_microstep: 3859.41 | bwd_inner_microstep: 3851.30 | bwd_allreduce_microstep: 8.05 | step_microstep: 22.66 [2024-11-13 19:50:31,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.51 | bwd: 3859.43 | bwd_inner: 3851.30 | bwd_allreduce: 8.07 | step: 22.66 2%|▏ | 1158/50750 [3:07:53<81:44:24, 5.93s/it] {'loss': 0.2683, 'learning_rate': 3.0413657255416943e-05, 'epoch': 1.14} 2%|▏ | 1158/50750 [3:07:53<81:44:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:50:37,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-13 19:50:37,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.92 | bwd_microstep: 3852.66 | bwd_inner_microstep: 3844.96 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.41 [2024-11-13 19:50:37,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.91 | bwd: 3852.68 | bwd_inner: 3844.96 | bwd_allreduce: 7.67 | step: 21.41 2%|▏ | 1159/50750 [3:07:59<81:42:39, 5.93s/it] {'loss': 0.6581, 'learning_rate': 3.0439921208141827e-05, 'epoch': 1.14} 2%|▏ | 1159/50750 [3:07:59<81:42:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:50:43,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 5.05 [2024-11-13 19:50:43,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.75 | bwd_microstep: 3848.78 | bwd_inner_microstep: 3840.95 | bwd_allreduce_microstep: 7.77 | step_microstep: 25.00 [2024-11-13 19:50:43,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.74 | bwd: 3848.80 | bwd_inner: 3840.95 | bwd_allreduce: 7.79 | step: 25.01 2%|▏ | 1160/50750 [3:08:05<81:44:00, 5.93s/it] {'loss': 0.0059, 'learning_rate': 3.0466185160866714e-05, 'epoch': 1.14} 2%|▏ | 1160/50750 [3:08:05<81:44:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:50:49,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 19:50:49,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.85 | bwd_microstep: 3857.74 | bwd_inner_microstep: 3850.05 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.39 [2024-11-13 19:50:49,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.85 | bwd: 3857.76 | bwd_inner: 3850.05 | bwd_allreduce: 7.66 | step: 21.39 2%|▏ | 1161/50750 [3:08:11<81:43:55, 5.93s/it] {'loss': 0.3466, 'learning_rate': 3.0492449113591598e-05, 'epoch': 1.14} 2%|▏ | 1161/50750 [3:08:11<81:43:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:50:55,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 19:50:55,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.03 | bwd_microstep: 3855.44 | bwd_inner_microstep: 3847.70 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.35 [2024-11-13 19:50:55,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.02 | bwd: 3855.46 | bwd_inner: 3847.70 | bwd_allreduce: 7.71 | step: 21.36 2%|▏ | 1162/50750 [3:08:17<81:42:20, 5.93s/it] {'loss': 0.0036, 'learning_rate': 3.051871306631648e-05, 'epoch': 1.14} 2%|▏ | 1162/50750 [3:08:17<81:42:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:51:01,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.54 | optimizer_step: 4.92 [2024-11-13 19:51:01,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.47 | bwd_microstep: 3856.39 | bwd_inner_microstep: 3848.51 | bwd_allreduce_microstep: 7.81 | step_microstep: 29.82 [2024-11-13 19:51:01,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.47 | bwd: 3856.41 | bwd_inner: 3848.51 | bwd_allreduce: 7.84 | step: 29.82 2%|▏ | 1163/50750 [3:08:23<81:45:33, 5.94s/it] {'loss': 0.0194, 'learning_rate': 3.0544977019041365e-05, 'epoch': 1.15} 2%|▏ | 1163/50750 [3:08:23<81:45:33, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:51:07,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 19:51:07,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.71 | bwd_microstep: 3850.46 | bwd_inner_microstep: 3842.83 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.17 [2024-11-13 19:51:07,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.69 | bwd: 3850.47 | bwd_inner: 3842.83 | bwd_allreduce: 7.60 | step: 21.18 2%|▏ | 1164/50750 [3:08:29<81:43:54, 5.93s/it] {'loss': 0.0191, 'learning_rate': 3.057124097176625e-05, 'epoch': 1.15} 2%|▏ | 1164/50750 [3:08:29<81:43:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:51:13,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 19:51:13,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.77 | bwd_microstep: 3850.83 | bwd_inner_microstep: 3843.20 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.10 [2024-11-13 19:51:13,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.76 | bwd: 3850.84 | bwd_inner: 3843.20 | bwd_allreduce: 7.60 | step: 21.10 2%|▏ | 1165/50750 [3:08:35<81:43:31, 5.93s/it] {'loss': 0.3337, 'learning_rate': 3.059750492449114e-05, 'epoch': 1.15} 2%|▏ | 1165/50750 [3:08:35<81:43:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:51:19,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.56 | optimizer_step: 4.93 [2024-11-13 19:51:19,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.53 | bwd_microstep: 3855.09 | bwd_inner_microstep: 3847.34 | bwd_allreduce_microstep: 7.70 | step_microstep: 28.39 [2024-11-13 19:51:19,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.53 | bwd: 3855.10 | bwd_inner: 3847.34 | bwd_allreduce: 7.72 | step: 28.38 2%|▏ | 1166/50750 [3:08:41<81:44:47, 5.94s/it] {'loss': 0.6619, 'learning_rate': 3.062376887721602e-05, 'epoch': 1.15} 2%|▏ | 1166/50750 [3:08:41<81:44:47, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:51:25,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 19:51:25,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.13 | bwd_microstep: 3853.98 | bwd_inner_microstep: 3846.42 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.36 [2024-11-13 19:51:25,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.13 | bwd: 3853.99 | bwd_inner: 3846.42 | bwd_allreduce: 7.54 | step: 21.37 2%|▏ | 1167/50750 [3:08:47<81:44:03, 5.93s/it] {'loss': 0.0025, 'learning_rate': 3.065003282994091e-05, 'epoch': 1.15} 2%|▏ | 1167/50750 [3:08:47<81:44:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:51:31,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 19:51:31,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.75 | bwd_microstep: 3857.28 | bwd_inner_microstep: 3849.74 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.96 [2024-11-13 19:51:31,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.75 | bwd: 3857.30 | bwd_inner: 3849.74 | bwd_allreduce: 7.51 | step: 21.96 2%|▏ | 1168/50750 [3:08:53<81:43:20, 5.93s/it] {'loss': 0.0191, 'learning_rate': 3.0676296782665794e-05, 'epoch': 1.15} 2%|▏ | 1168/50750 [3:08:53<81:43:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:51:36,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:51:36,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.90 | bwd_microstep: 3851.73 | bwd_inner_microstep: 3844.16 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.36 [2024-11-13 19:51:36,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.90 | bwd: 3851.74 | bwd_inner: 3844.16 | bwd_allreduce: 7.54 | step: 21.36 2%|▏ | 1169/50750 [3:08:58<81:43:19, 5.93s/it] {'loss': 0.444, 'learning_rate': 3.070256073539068e-05, 'epoch': 1.15} 2%|▏ | 1169/50750 [3:08:58<81:43:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:51:42,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.92 [2024-11-13 19:51:42,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.31 | bwd_microstep: 3851.74 | bwd_inner_microstep: 3843.91 | bwd_allreduce_microstep: 7.78 | step_microstep: 21.63 [2024-11-13 19:51:42,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.30 | bwd: 3851.75 | bwd_inner: 3843.91 | bwd_allreduce: 7.80 | step: 21.63 2%|▏ | 1170/50750 [3:09:04<81:41:29, 5.93s/it] {'loss': 0.1035, 'learning_rate': 3.072882468811556e-05, 'epoch': 1.15} 2%|▏ | 1170/50750 [3:09:04<81:41:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:51:48,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 19:51:48,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.92 | bwd_microstep: 3849.37 | bwd_inner_microstep: 3841.56 | bwd_allreduce_microstep: 7.76 | step_microstep: 21.69 [2024-11-13 19:51:48,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.90 | bwd: 3849.38 | bwd_inner: 3841.56 | bwd_allreduce: 7.78 | step: 21.70 2%|▏ | 1171/50750 [3:09:10<81:42:42, 5.93s/it] {'loss': 0.1454, 'learning_rate': 3.075508864084045e-05, 'epoch': 1.15} 2%|▏ | 1171/50750 [3:09:10<81:42:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:51:54,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 4.92 [2024-11-13 19:51:54,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.05 | bwd_microstep: 3853.23 | bwd_inner_microstep: 3845.33 | bwd_allreduce_microstep: 7.85 | step_microstep: 21.89 [2024-11-13 19:51:54,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.03 | bwd: 3853.25 | bwd_inner: 3845.33 | bwd_allreduce: 7.87 | step: 21.90 2%|▏ | 1172/50750 [3:09:16<81:42:29, 5.93s/it] {'loss': 0.0236, 'learning_rate': 3.0781352593565336e-05, 'epoch': 1.15} 2%|▏ | 1172/50750 [3:09:16<81:42:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:52:00,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 19:52:00,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.07 | bwd_microstep: 3848.49 | bwd_inner_microstep: 3840.99 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.19 [2024-11-13 19:52:00,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.05 | bwd: 3848.51 | bwd_inner: 3840.99 | bwd_allreduce: 7.47 | step: 21.19 2%|▏ | 1173/50750 [3:09:22<81:41:27, 5.93s/it] {'loss': 0.0026, 'learning_rate': 3.0807616546290216e-05, 'epoch': 1.16} 2%|▏ | 1173/50750 [3:09:22<81:41:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:52:06,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 19:52:06,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.39 | bwd_microstep: 3850.33 | bwd_inner_microstep: 3842.48 | bwd_allreduce_microstep: 7.81 | step_microstep: 22.23 [2024-11-13 19:52:06,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.38 | bwd: 3850.34 | bwd_inner: 3842.48 | bwd_allreduce: 7.83 | step: 22.24 2%|▏ | 1174/50750 [3:09:28<81:43:06, 5.93s/it] {'loss': 0.005, 'learning_rate': 3.0833880499015104e-05, 'epoch': 1.16} 2%|▏ | 1174/50750 [3:09:28<81:43:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:52:12,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.31 | optimizer_step: 5.22 [2024-11-13 19:52:12,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.51 | bwd_microstep: 3850.04 | bwd_inner_microstep: 3842.54 | bwd_allreduce_microstep: 7.46 | step_microstep: 22.57 [2024-11-13 19:52:12,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.50 | bwd: 3850.06 | bwd_inner: 3842.54 | bwd_allreduce: 7.48 | step: 22.57 2%|▏ | 1175/50750 [3:09:34<81:43:52, 5.94s/it] {'loss': 0.0027, 'learning_rate': 3.086014445173999e-05, 'epoch': 1.16} 2%|▏ | 1175/50750 [3:09:34<81:43:52, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:52:18,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.94 [2024-11-13 19:52:18,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.17 | bwd_microstep: 3861.34 | bwd_inner_microstep: 3853.83 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.09 [2024-11-13 19:52:18,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.16 | bwd: 3861.35 | bwd_inner: 3853.83 | bwd_allreduce: 7.48 | step: 21.10 2%|▏ | 1176/50750 [3:09:40<81:45:36, 5.94s/it] {'loss': 0.1002, 'learning_rate': 3.088640840446488e-05, 'epoch': 1.16} 2%|▏ | 1176/50750 [3:09:40<81:45:36, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:52:24,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 19:52:24,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.49 | bwd_microstep: 3858.30 | bwd_inner_microstep: 3850.46 | bwd_allreduce_microstep: 7.79 | step_microstep: 22.37 [2024-11-13 19:52:24,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.49 | bwd: 3858.32 | bwd_inner: 3850.46 | bwd_allreduce: 7.82 | step: 22.37 2%|▏ | 1177/50750 [3:09:46<81:45:47, 5.94s/it] {'loss': 0.2337, 'learning_rate': 3.091267235718976e-05, 'epoch': 1.16} 2%|▏ | 1177/50750 [3:09:46<81:45:47, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:52:30,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 19:52:30,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.37 | bwd_microstep: 3857.14 | bwd_inner_microstep: 3849.63 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.68 [2024-11-13 19:52:30,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.36 | bwd: 3857.15 | bwd_inner: 3849.63 | bwd_allreduce: 7.48 | step: 20.68 2%|▏ | 1178/50750 [3:09:52<81:46:42, 5.94s/it] {'loss': 0.4288, 'learning_rate': 3.0938936309914645e-05, 'epoch': 1.16} 2%|▏ | 1178/50750 [3:09:52<81:46:42, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:52:36,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.92 [2024-11-13 19:52:36,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.63 | bwd_microstep: 3861.77 | bwd_inner_microstep: 3854.30 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.77 [2024-11-13 19:52:36,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.63 | bwd: 3861.78 | bwd_inner: 3854.30 | bwd_allreduce: 7.45 | step: 20.78 2%|▏ | 1179/50750 [3:09:58<81:47:48, 5.94s/it] {'loss': 0.0195, 'learning_rate': 3.096520026263953e-05, 'epoch': 1.16} 2%|▏ | 1179/50750 [3:09:58<81:47:48, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:52:42,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 19:52:42,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.67 | bwd_microstep: 3856.67 | bwd_inner_microstep: 3848.21 | bwd_allreduce_microstep: 8.42 | step_microstep: 23.26 [2024-11-13 19:52:42,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.67 | bwd: 3856.68 | bwd_inner: 3848.21 | bwd_allreduce: 8.44 | step: 23.27 2%|▏ | 1180/50750 [3:10:04<81:47:07, 5.94s/it] {'loss': 0.2144, 'learning_rate': 3.099146421536441e-05, 'epoch': 1.16} 2%|▏ | 1180/50750 [3:10:04<81:47:07, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 19:52:48,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 19:52:48,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.40 | bwd_microstep: 3857.42 | bwd_inner_microstep: 3849.83 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.63 [2024-11-13 19:52:48,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.39 | bwd: 3857.44 | bwd_inner: 3849.83 | bwd_allreduce: 7.57 | step: 21.63 2%|▏ | 1181/50750 [3:10:10<81:47:31, 5.94s/it] {'loss': 0.3511, 'learning_rate': 3.10177281680893e-05, 'epoch': 1.16} 2%|▏ | 1181/50750 [3:10:10<81:47:31, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:52:54,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 19:52:54,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.46 | bwd_microstep: 3860.86 | bwd_inner_microstep: 3852.82 | bwd_allreduce_microstep: 7.99 | step_microstep: 21.95 [2024-11-13 19:52:54,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.46 | bwd: 3860.87 | bwd_inner: 3852.82 | bwd_allreduce: 8.01 | step: 21.96 2%|▏ | 1182/50750 [3:10:16<81:46:16, 5.94s/it] {'loss': 0.0018, 'learning_rate': 3.104399212081418e-05, 'epoch': 1.16} 2%|▏ | 1182/50750 [3:10:16<81:46:16, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:53:00,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 19:53:00,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.26 | bwd_microstep: 3854.17 | bwd_inner_microstep: 3846.03 | bwd_allreduce_microstep: 8.07 | step_microstep: 26.98 [2024-11-13 19:53:00,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.24 | bwd: 3854.19 | bwd_inner: 3846.03 | bwd_allreduce: 8.10 | step: 26.98 2%|▏ | 1183/50750 [3:10:22<81:49:20, 5.94s/it] {'loss': 0.6898, 'learning_rate': 3.1070256073539074e-05, 'epoch': 1.17} 2%|▏ | 1183/50750 [3:10:22<81:49:20, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:53:06,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:53:06,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.77 | bwd_microstep: 3861.67 | bwd_inner_microstep: 3854.12 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.08 [2024-11-13 19:53:06,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.75 | bwd: 3861.68 | bwd_inner: 3854.12 | bwd_allreduce: 7.52 | step: 21.08 2%|▏ | 1184/50750 [3:10:28<81:49:43, 5.94s/it] {'loss': 0.5691, 'learning_rate': 3.1096520026263955e-05, 'epoch': 1.17} 2%|▏ | 1184/50750 [3:10:28<81:49:43, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:53:11,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-13 19:53:12,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.24 | bwd_microstep: 3850.25 | bwd_inner_microstep: 3842.68 | bwd_allreduce_microstep: 7.53 | step_microstep: 22.23 [2024-11-13 19:53:12,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.24 | bwd: 3850.27 | bwd_inner: 3842.68 | bwd_allreduce: 7.54 | step: 22.23 2%|▏ | 1185/50750 [3:10:33<81:45:45, 5.94s/it] {'loss': 0.0007, 'learning_rate': 3.112278397898884e-05, 'epoch': 1.17} 2%|▏ | 1185/50750 [3:10:33<81:45:45, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:53:17,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 19:53:17,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.69 | bwd_microstep: 3852.51 | bwd_inner_microstep: 3844.71 | bwd_allreduce_microstep: 7.75 | step_microstep: 22.04 [2024-11-13 19:53:17,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.67 | bwd: 3852.52 | bwd_inner: 3844.71 | bwd_allreduce: 7.77 | step: 22.05 2%|▏ | 1186/50750 [3:10:39<81:44:00, 5.94s/it] {'loss': 0.155, 'learning_rate': 3.114904793171372e-05, 'epoch': 1.17} 2%|▏ | 1186/50750 [3:10:39<81:44:00, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:53:23,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 19:53:23,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.14 | bwd_microstep: 3855.45 | bwd_inner_microstep: 3847.75 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.57 [2024-11-13 19:53:23,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.13 | bwd: 3855.47 | bwd_inner: 3847.75 | bwd_allreduce: 7.67 | step: 21.57 2%|▏ | 1187/50750 [3:10:45<81:41:47, 5.93s/it] {'loss': 0.102, 'learning_rate': 3.117531188443861e-05, 'epoch': 1.17} 2%|▏ | 1187/50750 [3:10:45<81:41:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:53:29,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:53:29,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.48 | bwd_microstep: 3851.85 | bwd_inner_microstep: 3844.05 | bwd_allreduce_microstep: 7.74 | step_microstep: 22.10 [2024-11-13 19:53:29,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.47 | bwd: 3851.87 | bwd_inner: 3844.05 | bwd_allreduce: 7.77 | step: 22.10 2%|▏ | 1188/50750 [3:10:51<81:41:54, 5.93s/it] {'loss': 0.0022, 'learning_rate': 3.12015758371635e-05, 'epoch': 1.17} 2%|▏ | 1188/50750 [3:10:51<81:41:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:53:35,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:53:35,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.74 | bwd_microstep: 3855.15 | bwd_inner_microstep: 3847.54 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.68 [2024-11-13 19:53:35,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.73 | bwd: 3855.16 | bwd_inner: 3847.54 | bwd_allreduce: 7.58 | step: 21.68 2%|▏ | 1189/50750 [3:10:57<81:42:19, 5.93s/it] {'loss': 0.0093, 'learning_rate': 3.122783978988838e-05, 'epoch': 1.17} 2%|▏ | 1189/50750 [3:10:57<81:42:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:53:41,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.92 [2024-11-13 19:53:41,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.96 | bwd_microstep: 3859.32 | bwd_inner_microstep: 3851.62 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.26 [2024-11-13 19:53:41,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.95 | bwd: 3859.33 | bwd_inner: 3851.62 | bwd_allreduce: 7.67 | step: 21.27 2%|▏ | 1190/50750 [3:11:03<81:42:32, 5.94s/it] {'loss': 0.1191, 'learning_rate': 3.1254103742613264e-05, 'epoch': 1.17} 2%|▏ | 1190/50750 [3:11:03<81:42:32, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:53:47,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:53:47,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.13 | bwd_microstep: 3856.89 | bwd_inner_microstep: 3849.38 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.24 [2024-11-13 19:53:47,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.13 | bwd: 3856.90 | bwd_inner: 3849.38 | bwd_allreduce: 7.48 | step: 21.24 2%|▏ | 1191/50750 [3:11:09<81:40:49, 5.93s/it] {'loss': 0.0135, 'learning_rate': 3.128036769533815e-05, 'epoch': 1.17} 2%|▏ | 1191/50750 [3:11:09<81:40:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:53:53,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 19:53:53,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.57 | bwd_microstep: 3861.28 | bwd_inner_microstep: 3853.41 | bwd_allreduce_microstep: 7.83 | step_microstep: 21.83 [2024-11-13 19:53:53,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.57 | bwd: 3861.30 | bwd_inner: 3853.41 | bwd_allreduce: 7.84 | step: 21.83 2%|▏ | 1192/50750 [3:11:15<81:42:52, 5.94s/it] {'loss': 0.0066, 'learning_rate': 3.130663164806304e-05, 'epoch': 1.17} 2%|▏ | 1192/50750 [3:11:15<81:42:52, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 19:53:59,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:53:59,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.09 | bwd_microstep: 3856.49 | bwd_inner_microstep: 3848.94 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.62 [2024-11-13 19:53:59,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.08 | bwd: 3856.50 | bwd_inner: 3848.94 | bwd_allreduce: 7.52 | step: 21.62 2%|▏ | 1193/50750 [3:11:21<81:43:19, 5.94s/it] {'loss': 0.4202, 'learning_rate': 3.133289560078792e-05, 'epoch': 1.18} 2%|▏ | 1193/50750 [3:11:21<81:43:19, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:54:05,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 19:54:05,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.77 | bwd_microstep: 3847.54 | bwd_inner_microstep: 3839.94 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.10 [2024-11-13 19:54:05,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.75 | bwd: 3847.55 | bwd_inner: 3839.94 | bwd_allreduce: 7.57 | step: 21.10 2%|▏ | 1194/50750 [3:11:27<81:39:11, 5.93s/it] {'loss': 0.0041, 'learning_rate': 3.1359159553512806e-05, 'epoch': 1.18} 2%|▏ | 1194/50750 [3:11:27<81:39:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2193 [2024-11-13 19:54:11,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.45 | optimizer_step: 4.93 [2024-11-13 19:54:11,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.00 | bwd_microstep: 3847.12 | bwd_inner_microstep: 3838.99 | bwd_allreduce_microstep: 8.06 | step_microstep: 32.09 [2024-11-13 19:54:11,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.99 | bwd: 3847.14 | bwd_inner: 3838.99 | bwd_allreduce: 8.09 | step: 32.09 2%|▏ | 1195/50750 [3:11:33<81:41:34, 5.93s/it] {'loss': 0.0, 'learning_rate': 3.138542350623769e-05, 'epoch': 1.18} 2%|▏ | 1195/50750 [3:11:33<81:41:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:54:17,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:54:17,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.33 | bwd_microstep: 3861.83 | bwd_inner_microstep: 3854.34 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.04 [2024-11-13 19:54:17,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.33 | bwd: 3861.84 | bwd_inner: 3854.33 | bwd_allreduce: 7.46 | step: 21.05 2%|▏ | 1196/50750 [3:11:39<81:40:48, 5.93s/it] {'loss': 0.447, 'learning_rate': 3.141168745896258e-05, 'epoch': 1.18} 2%|▏ | 1196/50750 [3:11:39<81:40:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:54:23,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.34 | optimizer_step: 4.93 [2024-11-13 19:54:23,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.64 | bwd_microstep: 3855.68 | bwd_inner_microstep: 3847.87 | bwd_allreduce_microstep: 7.76 | step_microstep: 22.76 [2024-11-13 19:54:23,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.64 | bwd: 3855.70 | bwd_inner: 3847.87 | bwd_allreduce: 7.78 | step: 22.76 2%|▏ | 1197/50750 [3:11:45<81:40:30, 5.93s/it] {'loss': 0.0167, 'learning_rate': 3.143795141168746e-05, 'epoch': 1.18} 2%|▏ | 1197/50750 [3:11:45<81:40:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:54:29,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 19:54:29,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.23 | bwd_microstep: 3851.32 | bwd_inner_microstep: 3843.81 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.96 [2024-11-13 19:54:29,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.22 | bwd: 3851.33 | bwd_inner: 3843.81 | bwd_allreduce: 7.49 | step: 20.96 2%|▏ | 1198/50750 [3:11:51<81:39:44, 5.93s/it] {'loss': 0.0017, 'learning_rate': 3.146421536441235e-05, 'epoch': 1.18} 2%|▏ | 1198/50750 [3:11:51<81:39:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 19:54:35,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 19:54:35,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.85 | bwd_microstep: 3842.91 | bwd_inner_microstep: 3835.40 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.85 [2024-11-13 19:54:35,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.85 | bwd: 3842.92 | bwd_inner: 3835.40 | bwd_allreduce: 7.48 | step: 20.86 2%|▏ | 1199/50750 [3:11:57<81:35:32, 5.93s/it] {'loss': 0.0137, 'learning_rate': 3.1490479317137235e-05, 'epoch': 1.18} 2%|▏ | 1199/50750 [3:11:57<81:35:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:54:40,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 19:54:40,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.66 | bwd_microstep: 3855.52 | bwd_inner_microstep: 3848.04 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.54 [2024-11-13 19:54:40,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.66 | bwd: 3855.53 | bwd_inner: 3848.04 | bwd_allreduce: 7.45 | step: 21.54 2%|▏ | 1200/50750 [3:12:02<81:37:11, 5.93s/it] {'loss': 0.0911, 'learning_rate': 3.1516743269862116e-05, 'epoch': 1.18} 2%|▏ | 1200/50750 [3:12:02<81:37:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:54:46,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 19:54:46,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.06 | bwd_microstep: 3858.34 | bwd_inner_microstep: 3850.48 | bwd_allreduce_microstep: 7.82 | step_microstep: 21.09 [2024-11-13 19:54:46,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.06 | bwd: 3858.35 | bwd_inner: 3850.48 | bwd_allreduce: 7.83 | step: 21.09 2%|▏ | 1201/50750 [3:12:08<81:38:51, 5.93s/it] {'loss': 0.2415, 'learning_rate': 3.1543007222587e-05, 'epoch': 1.18} 2%|▏ | 1201/50750 [3:12:08<81:38:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:54:52,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.33 | optimizer_step: 5.12 [2024-11-13 19:54:52,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.06 | bwd_microstep: 3855.25 | bwd_inner_microstep: 3847.50 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.82 [2024-11-13 19:54:52,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.05 | bwd: 3855.26 | bwd_inner: 3847.50 | bwd_allreduce: 7.72 | step: 21.82 2%|▏ | 1202/50750 [3:12:14<81:41:02, 5.93s/it] {'loss': 0.5973, 'learning_rate': 3.156927117531188e-05, 'epoch': 1.18} 2%|▏ | 1202/50750 [3:12:14<81:41:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:54:58,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-13 19:54:58,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.12 | bwd_microstep: 3856.50 | bwd_inner_microstep: 3848.52 | bwd_allreduce_microstep: 7.94 | step_microstep: 21.84 [2024-11-13 19:54:58,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.08 | bwd: 3856.52 | bwd_inner: 3848.52 | bwd_allreduce: 7.96 | step: 21.84 2%|▏ | 1203/50750 [3:12:20<81:42:49, 5.94s/it] {'loss': 0.6019, 'learning_rate': 3.159553512803678e-05, 'epoch': 1.19} 2%|▏ | 1203/50750 [3:12:20<81:42:49, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:55:04,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:55:04,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.67 | bwd_microstep: 3858.35 | bwd_inner_microstep: 3850.83 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.96 [2024-11-13 19:55:04,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.67 | bwd: 3858.36 | bwd_inner: 3850.83 | bwd_allreduce: 7.48 | step: 20.97 2%|▏ | 1204/50750 [3:12:26<81:42:48, 5.94s/it] {'loss': 0.0064, 'learning_rate': 3.162179908076166e-05, 'epoch': 1.19} 2%|▏ | 1204/50750 [3:12:26<81:42:48, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:55:10,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 19:55:10,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.56 | bwd_microstep: 3861.26 | bwd_inner_microstep: 3850.88 | bwd_allreduce_microstep: 10.33 | step_microstep: 22.40 [2024-11-13 19:55:10,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.56 | bwd: 3861.28 | bwd_inner: 3850.88 | bwd_allreduce: 10.35 | step: 22.40 2%|▏ | 1205/50750 [3:12:32<81:43:39, 5.94s/it] {'loss': 0.1606, 'learning_rate': 3.164806303348654e-05, 'epoch': 1.19} 2%|▏ | 1205/50750 [3:12:32<81:43:39, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 19:55:16,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 19:55:16,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.30 | bwd_microstep: 3858.62 | bwd_inner_microstep: 3851.08 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.37 [2024-11-13 19:55:16,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.29 | bwd: 3858.63 | bwd_inner: 3851.08 | bwd_allreduce: 7.51 | step: 21.37 2%|▏ | 1206/50750 [3:12:38<81:42:02, 5.94s/it] {'loss': 0.0402, 'learning_rate': 3.1674326986211425e-05, 'epoch': 1.19} 2%|▏ | 1206/50750 [3:12:38<81:42:02, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:55:22,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 19:55:22,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.03 | bwd_microstep: 3849.49 | bwd_inner_microstep: 3841.97 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.59 [2024-11-13 19:55:22,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.03 | bwd: 3849.50 | bwd_inner: 3841.97 | bwd_allreduce: 7.48 | step: 21.59 2%|▏ | 1207/50750 [3:12:44<81:39:28, 5.93s/it] {'loss': 0.015, 'learning_rate': 3.170059093893631e-05, 'epoch': 1.19} 2%|▏ | 1207/50750 [3:12:44<81:39:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:55:28,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.92 [2024-11-13 19:55:28,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.71 | bwd_microstep: 3858.87 | bwd_inner_microstep: 3851.09 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.47 [2024-11-13 19:55:28,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.71 | bwd: 3858.89 | bwd_inner: 3851.09 | bwd_allreduce: 7.76 | step: 21.48 2%|▏ | 1208/50750 [3:12:50<81:39:10, 5.93s/it] {'loss': 0.1687, 'learning_rate': 3.17268548916612e-05, 'epoch': 1.19} 2%|▏ | 1208/50750 [3:12:50<81:39:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:55:34,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-13 19:55:34,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.21 | bwd_microstep: 3854.95 | bwd_inner_microstep: 3847.08 | bwd_allreduce_microstep: 7.83 | step_microstep: 22.64 [2024-11-13 19:55:34,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.21 | bwd: 3854.97 | bwd_inner: 3847.08 | bwd_allreduce: 7.85 | step: 22.65 2%|▏ | 1209/50750 [3:12:56<81:40:40, 5.94s/it] {'loss': 0.0032, 'learning_rate': 3.175311884438608e-05, 'epoch': 1.19} 2%|▏ | 1209/50750 [3:12:56<81:40:40, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:55:40,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:55:40,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.67 | bwd_microstep: 3861.76 | bwd_inner_microstep: 3854.14 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.22 [2024-11-13 19:55:40,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.66 | bwd: 3861.78 | bwd_inner: 3854.14 | bwd_allreduce: 7.60 | step: 21.22 2%|▏ | 1210/50750 [3:13:02<81:42:52, 5.94s/it] {'loss': 0.0316, 'learning_rate': 3.177938279711097e-05, 'epoch': 1.19} 2%|▏ | 1210/50750 [3:13:02<81:42:52, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:55:46,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:55:46,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.39 | bwd_microstep: 3856.54 | bwd_inner_microstep: 3849.00 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.27 [2024-11-13 19:55:46,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.39 | bwd: 3856.56 | bwd_inner: 3849.00 | bwd_allreduce: 7.52 | step: 21.27 2%|▏ | 1211/50750 [3:13:08<81:41:17, 5.94s/it] {'loss': 0.0039, 'learning_rate': 3.1805646749835854e-05, 'epoch': 1.19} 2%|▏ | 1211/50750 [3:13:08<81:41:17, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:55:52,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:55:52,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.86 | bwd_microstep: 3857.45 | bwd_inner_microstep: 3849.93 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.44 [2024-11-13 19:55:52,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.86 | bwd: 3857.46 | bwd_inner: 3849.93 | bwd_allreduce: 7.49 | step: 21.45 2%|▏ | 1212/50750 [3:13:14<81:39:12, 5.93s/it] {'loss': 0.0951, 'learning_rate': 3.183191070256074e-05, 'epoch': 1.19} 2%|▏ | 1212/50750 [3:13:14<81:39:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:55:58,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 19:55:58,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.11 | bwd_microstep: 3858.81 | bwd_inner_microstep: 3851.23 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.85 [2024-11-13 19:55:58,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.09 | bwd: 3858.83 | bwd_inner: 3851.23 | bwd_allreduce: 7.55 | step: 21.85 2%|▏ | 1213/50750 [3:13:20<81:39:06, 5.93s/it] {'loss': 0.0233, 'learning_rate': 3.185817465528562e-05, 'epoch': 1.2} 2%|▏ | 1213/50750 [3:13:20<81:39:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:56:04,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 19:56:04,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.57 | bwd_microstep: 3855.07 | bwd_inner_microstep: 3847.44 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.04 [2024-11-13 19:56:04,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.57 | bwd: 3855.08 | bwd_inner: 3847.44 | bwd_allreduce: 7.60 | step: 21.04 2%|▏ | 1214/50750 [3:13:26<81:38:18, 5.93s/it] {'loss': 0.0032, 'learning_rate': 3.188443860801051e-05, 'epoch': 1.2} 2%|▏ | 1214/50750 [3:13:26<81:38:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:56:10,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 19:56:10,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.00 | bwd_microstep: 3851.45 | bwd_inner_microstep: 3843.73 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.76 [2024-11-13 19:56:10,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.00 | bwd: 3851.47 | bwd_inner: 3843.73 | bwd_allreduce: 7.69 | step: 21.77 2%|▏ | 1215/50750 [3:13:31<81:37:03, 5.93s/it] {'loss': 0.0147, 'learning_rate': 3.1910702560735396e-05, 'epoch': 1.2} 2%|▏ | 1215/50750 [3:13:31<81:37:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:56:15,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 19:56:15,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.77 | bwd_microstep: 3862.26 | bwd_inner_microstep: 3854.71 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.05 [2024-11-13 19:56:15,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.76 | bwd: 3862.28 | bwd_inner: 3854.71 | bwd_allreduce: 7.52 | step: 21.06 2%|▏ | 1216/50750 [3:13:37<81:40:45, 5.94s/it] {'loss': 0.0161, 'learning_rate': 3.1936966513460276e-05, 'epoch': 1.2} 2%|▏ | 1216/50750 [3:13:37<81:40:45, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:56:21,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.30 | optimizer_step: 4.93 [2024-11-13 19:56:21,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.14 | bwd_microstep: 3866.17 | bwd_inner_microstep: 3858.28 | bwd_allreduce_microstep: 7.85 | step_microstep: 22.60 [2024-11-13 19:56:21,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.14 | bwd: 3866.19 | bwd_inner: 3858.28 | bwd_allreduce: 7.86 | step: 22.61 2%|▏ | 1217/50750 [3:13:43<81:44:25, 5.94s/it] {'loss': 0.1079, 'learning_rate': 3.196323046618516e-05, 'epoch': 1.2} 2%|▏ | 1217/50750 [3:13:43<81:44:25, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:56:27,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 19:56:27,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.24 | bwd_microstep: 3860.88 | bwd_inner_microstep: 3852.86 | bwd_allreduce_microstep: 7.95 | step_microstep: 25.63 [2024-11-13 19:56:27,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.23 | bwd: 3860.89 | bwd_inner: 3852.86 | bwd_allreduce: 7.98 | step: 25.62 2%|▏ | 1218/50750 [3:13:49<81:46:18, 5.94s/it] {'loss': 0.0007, 'learning_rate': 3.198949441891005e-05, 'epoch': 1.2} 2%|▏ | 1218/50750 [3:13:49<81:46:18, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:56:33,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 19:56:33,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.82 | bwd_microstep: 3844.15 | bwd_inner_microstep: 3836.11 | bwd_allreduce_microstep: 8.00 | step_microstep: 21.31 [2024-11-13 19:56:33,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.81 | bwd: 3844.17 | bwd_inner: 3836.11 | bwd_allreduce: 8.02 | step: 21.31 2%|▏ | 1219/50750 [3:13:55<81:41:49, 5.94s/it] {'loss': 0.0003, 'learning_rate': 3.201575837163494e-05, 'epoch': 1.2} 2%|▏ | 1219/50750 [3:13:55<81:41:49, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:56:39,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:56:39,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.61 | bwd_microstep: 3854.41 | bwd_inner_microstep: 3846.89 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.06 [2024-11-13 19:56:39,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.61 | bwd: 3854.43 | bwd_inner: 3846.89 | bwd_allreduce: 7.50 | step: 21.06 2%|▏ | 1220/50750 [3:14:01<81:38:56, 5.93s/it] {'loss': 0.3532, 'learning_rate': 3.204202232435982e-05, 'epoch': 1.2} 2%|▏ | 1220/50750 [3:14:01<81:38:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:56:45,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 19:56:45,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.93 | bwd_microstep: 3858.69 | bwd_inner_microstep: 3851.13 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.51 [2024-11-13 19:56:45,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.93 | bwd: 3858.71 | bwd_inner: 3851.13 | bwd_allreduce: 7.53 | step: 21.52 2%|▏ | 1221/50750 [3:14:07<81:38:27, 5.93s/it] {'loss': 0.0029, 'learning_rate': 3.2068286277084705e-05, 'epoch': 1.2} 2%|▏ | 1221/50750 [3:14:07<81:38:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:56:51,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 19:56:51,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.65 | bwd_microstep: 3856.87 | bwd_inner_microstep: 3849.36 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.04 [2024-11-13 19:56:51,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.64 | bwd: 3856.88 | bwd_inner: 3849.36 | bwd_allreduce: 7.48 | step: 21.04 2%|▏ | 1222/50750 [3:14:13<81:37:09, 5.93s/it] {'loss': 0.2684, 'learning_rate': 3.209455022980959e-05, 'epoch': 1.2} 2%|▏ | 1222/50750 [3:14:13<81:37:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:56:57,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.53 | optimizer_step: 4.93 [2024-11-13 19:56:57,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.75 | bwd_microstep: 3853.62 | bwd_inner_microstep: 3845.92 | bwd_allreduce_microstep: 7.65 | step_microstep: 22.00 [2024-11-13 19:56:57,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.75 | bwd: 3853.63 | bwd_inner: 3845.92 | bwd_allreduce: 7.67 | step: 22.01 2%|▏ | 1223/50750 [3:14:19<81:37:12, 5.93s/it] {'loss': 0.0438, 'learning_rate': 3.212081418253447e-05, 'epoch': 1.2} 2%|▏ | 1223/50750 [3:14:19<81:37:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:57:03,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 19:57:03,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.58 | bwd_microstep: 3866.86 | bwd_inner_microstep: 3859.34 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.00 [2024-11-13 19:57:03,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.57 | bwd: 3866.87 | bwd_inner: 3859.34 | bwd_allreduce: 7.49 | step: 21.00 2%|▏ | 1224/50750 [3:14:25<81:39:13, 5.94s/it] {'loss': 0.0005, 'learning_rate': 3.214707813525936e-05, 'epoch': 1.21} 2%|▏ | 1224/50750 [3:14:25<81:39:13, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:57:09,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 19:57:09,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.89 | bwd_microstep: 3855.14 | bwd_inner_microstep: 3847.31 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.18 [2024-11-13 19:57:09,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.89 | bwd: 3855.16 | bwd_inner: 3847.31 | bwd_allreduce: 7.81 | step: 22.18 2%|▏ | 1225/50750 [3:14:31<81:38:51, 5.94s/it] {'loss': 0.005, 'learning_rate': 3.217334208798424e-05, 'epoch': 1.21} 2%|▏ | 1225/50750 [3:14:31<81:38:51, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:57:15,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 19:57:15,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.04 | bwd_microstep: 3858.31 | bwd_inner_microstep: 3850.48 | bwd_allreduce_microstep: 7.78 | step_microstep: 24.59 [2024-11-13 19:57:15,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.03 | bwd: 3858.33 | bwd_inner: 3850.48 | bwd_allreduce: 7.80 | step: 24.60 2%|▏ | 1226/50750 [3:14:37<81:39:50, 5.94s/it] {'loss': 0.0012, 'learning_rate': 3.219960604070913e-05, 'epoch': 1.21} 2%|▏ | 1226/50750 [3:14:37<81:39:50, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:57:21,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.92 [2024-11-13 19:57:21,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.72 | bwd_microstep: 3852.41 | bwd_inner_microstep: 3844.47 | bwd_allreduce_microstep: 7.90 | step_microstep: 21.88 [2024-11-13 19:57:21,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.70 | bwd: 3852.43 | bwd_inner: 3844.47 | bwd_allreduce: 7.92 | step: 21.88 2%|▏ | 1227/50750 [3:14:43<81:37:27, 5.93s/it] {'loss': 0.3678, 'learning_rate': 3.2225869993434015e-05, 'epoch': 1.21} 2%|▏ | 1227/50750 [3:14:43<81:37:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:57:27,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.31 | optimizer_step: 4.93 [2024-11-13 19:57:27,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.08 | bwd_microstep: 3857.98 | bwd_inner_microstep: 3849.92 | bwd_allreduce_microstep: 7.98 | step_microstep: 25.73 [2024-11-13 19:57:27,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.07 | bwd: 3858.00 | bwd_inner: 3849.92 | bwd_allreduce: 8.01 | step: 25.73 2%|▏ | 1228/50750 [3:14:49<81:40:08, 5.94s/it] {'loss': 0.0018, 'learning_rate': 3.22521339461589e-05, 'epoch': 1.21} 2%|▏ | 1228/50750 [3:14:49<81:40:08, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:57:33,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 19:57:33,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.09 | bwd_microstep: 3856.49 | bwd_inner_microstep: 3848.91 | bwd_allreduce_microstep: 7.54 | step_microstep: 22.40 [2024-11-13 19:57:33,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.07 | bwd: 3856.50 | bwd_inner: 3848.91 | bwd_allreduce: 7.55 | step: 22.40 2%|▏ | 1229/50750 [3:14:55<81:41:12, 5.94s/it] {'loss': 0.0154, 'learning_rate': 3.227839789888378e-05, 'epoch': 1.21} 2%|▏ | 1229/50750 [3:14:55<81:41:12, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:57:39,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 19:57:39,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.23 | bwd_microstep: 3851.94 | bwd_inner_microstep: 3844.32 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.68 [2024-11-13 19:57:39,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.23 | bwd: 3851.95 | bwd_inner: 3844.32 | bwd_allreduce: 7.59 | step: 21.68 2%|▏ | 1230/50750 [3:15:01<81:40:56, 5.94s/it] {'loss': 0.0204, 'learning_rate': 3.230466185160867e-05, 'epoch': 1.21} 2%|▏ | 1230/50750 [3:15:01<81:40:56, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:57:45,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 19:57:45,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.03 | bwd_microstep: 3859.19 | bwd_inner_microstep: 3851.50 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.17 [2024-11-13 19:57:45,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.02 | bwd: 3859.20 | bwd_inner: 3851.49 | bwd_allreduce: 7.67 | step: 21.18 2%|▏ | 1231/50750 [3:15:06<81:42:23, 5.94s/it] {'loss': 0.011, 'learning_rate': 3.2330925804333556e-05, 'epoch': 1.21} 2%|▏ | 1231/50750 [3:15:06<81:42:23, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:57:50,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 19:57:50,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.35 | bwd_microstep: 3850.44 | bwd_inner_microstep: 3841.98 | bwd_allreduce_microstep: 8.40 | step_microstep: 21.79 [2024-11-13 19:57:50,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.35 | bwd: 3850.45 | bwd_inner: 3841.98 | bwd_allreduce: 8.43 | step: 21.79 2%|▏ | 1232/50750 [3:15:12<81:39:58, 5.94s/it] {'loss': 0.6588, 'learning_rate': 3.235718975705844e-05, 'epoch': 1.21} 2%|▏ | 1232/50750 [3:15:12<81:39:58, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:57:56,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 19:57:56,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.24 | bwd_microstep: 3850.19 | bwd_inner_microstep: 3842.59 | bwd_allreduce_microstep: 7.55 | step_microstep: 22.21 [2024-11-13 19:57:56,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.22 | bwd: 3850.20 | bwd_inner: 3842.59 | bwd_allreduce: 7.57 | step: 22.21 2%|▏ | 1233/50750 [3:15:18<81:38:28, 5.94s/it] {'loss': 0.3928, 'learning_rate': 3.2383453709783324e-05, 'epoch': 1.21} 2%|▏ | 1233/50750 [3:15:18<81:38:28, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:58:02,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 19:58:02,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.21 | bwd_microstep: 3860.74 | bwd_inner_microstep: 3853.24 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.94 [2024-11-13 19:58:02,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.21 | bwd: 3860.75 | bwd_inner: 3853.24 | bwd_allreduce: 7.48 | step: 20.94 2%|▏ | 1234/50750 [3:15:24<81:38:07, 5.94s/it] {'loss': 0.0059, 'learning_rate': 3.240971766250821e-05, 'epoch': 1.22} 2%|▏ | 1234/50750 [3:15:24<81:38:07, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:58:08,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 19:58:08,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.20 | bwd_microstep: 3860.51 | bwd_inner_microstep: 3852.95 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.56 [2024-11-13 19:58:08,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.20 | bwd: 3860.52 | bwd_inner: 3852.95 | bwd_allreduce: 7.54 | step: 21.56 2%|▏ | 1235/50750 [3:15:30<81:38:16, 5.94s/it] {'loss': 0.5037, 'learning_rate': 3.24359816152331e-05, 'epoch': 1.22} 2%|▏ | 1235/50750 [3:15:30<81:38:16, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:58:14,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 19:58:14,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.07 | bwd_microstep: 3848.56 | bwd_inner_microstep: 3841.05 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.00 [2024-11-13 19:58:14,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.05 | bwd: 3848.58 | bwd_inner: 3841.05 | bwd_allreduce: 7.48 | step: 21.01 2%|▏ | 1236/50750 [3:15:36<81:35:39, 5.93s/it] {'loss': 0.6432, 'learning_rate': 3.246224556795798e-05, 'epoch': 1.22} 2%|▏ | 1236/50750 [3:15:36<81:35:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:58:20,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 19:58:20,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.01 | bwd_microstep: 3854.54 | bwd_inner_microstep: 3846.95 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.61 [2024-11-13 19:58:20,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.01 | bwd: 3854.56 | bwd_inner: 3846.95 | bwd_allreduce: 7.56 | step: 21.61 2%|▏ | 1237/50750 [3:15:42<81:34:09, 5.93s/it] {'loss': 0.0015, 'learning_rate': 3.2488509520682866e-05, 'epoch': 1.22} 2%|▏ | 1237/50750 [3:15:42<81:34:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:58:26,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 19:58:26,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.88 | bwd_microstep: 3846.45 | bwd_inner_microstep: 3838.91 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.14 [2024-11-13 19:58:26,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.88 | bwd: 3846.46 | bwd_inner: 3838.91 | bwd_allreduce: 7.51 | step: 21.14 2%|▏ | 1238/50750 [3:15:48<81:32:22, 5.93s/it] {'loss': 0.1688, 'learning_rate': 3.251477347340775e-05, 'epoch': 1.22} 2%|▏ | 1238/50750 [3:15:48<81:32:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:58:32,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 19:58:32,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.98 | bwd_microstep: 3849.70 | bwd_inner_microstep: 3842.24 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.11 [2024-11-13 19:58:32,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.98 | bwd: 3849.72 | bwd_inner: 3842.24 | bwd_allreduce: 7.43 | step: 21.12 2%|▏ | 1239/50750 [3:15:54<81:30:49, 5.93s/it] {'loss': 0.0018, 'learning_rate': 3.254103742613264e-05, 'epoch': 1.22} 2%|▏ | 1239/50750 [3:15:54<81:30:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:58:38,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 19:58:38,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.01 | bwd_microstep: 3846.96 | bwd_inner_microstep: 3839.48 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.88 [2024-11-13 19:58:38,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.02 | bwd: 3846.97 | bwd_inner: 3839.48 | bwd_allreduce: 7.45 | step: 20.89 2%|▏ | 1240/50750 [3:16:00<81:28:04, 5.92s/it] {'loss': 0.0142, 'learning_rate': 3.256730137885752e-05, 'epoch': 1.22} 2%|▏ | 1240/50750 [3:16:00<81:28:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:58:44,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 19:58:44,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.88 | bwd_microstep: 3848.17 | bwd_inner_microstep: 3840.14 | bwd_allreduce_microstep: 7.98 | step_microstep: 22.54 [2024-11-13 19:58:44,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.88 | bwd: 3848.19 | bwd_inner: 3840.14 | bwd_allreduce: 8.00 | step: 22.56 2%|▏ | 1241/50750 [3:16:06<81:29:11, 5.93s/it] {'loss': 0.013, 'learning_rate': 3.25935653315824e-05, 'epoch': 1.22} 2%|▏ | 1241/50750 [3:16:06<81:29:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:58:50,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 19:58:50,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.56 | bwd_microstep: 3854.77 | bwd_inner_microstep: 3846.89 | bwd_allreduce_microstep: 7.84 | step_microstep: 21.68 [2024-11-13 19:58:50,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.55 | bwd: 3854.78 | bwd_inner: 3846.89 | bwd_allreduce: 7.85 | step: 21.68 2%|▏ | 1242/50750 [3:16:12<81:31:58, 5.93s/it] {'loss': 0.0783, 'learning_rate': 3.2619829284307295e-05, 'epoch': 1.22} 2%|▏ | 1242/50750 [3:16:12<81:31:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 19:58:56,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 19:58:56,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.10 | bwd_microstep: 3856.37 | bwd_inner_microstep: 3848.86 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.54 [2024-11-13 19:58:56,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.09 | bwd: 3856.39 | bwd_inner: 3848.86 | bwd_allreduce: 7.48 | step: 21.55 2%|▏ | 1243/50750 [3:16:18<81:34:41, 5.93s/it] {'loss': 0.0052, 'learning_rate': 3.2646093237032175e-05, 'epoch': 1.22} 2%|▏ | 1243/50750 [3:16:18<81:34:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:59:02,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 19:59:02,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.27 | bwd_microstep: 3851.04 | bwd_inner_microstep: 3843.52 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.25 [2024-11-13 19:59:02,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.27 | bwd: 3851.06 | bwd_inner: 3843.52 | bwd_allreduce: 7.50 | step: 21.25 2%|▏ | 1244/50750 [3:16:24<81:33:46, 5.93s/it] {'loss': 0.0021, 'learning_rate': 3.267235718975706e-05, 'epoch': 1.23} 2%|▏ | 1244/50750 [3:16:24<81:33:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:59:08,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 19:59:08,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.21 | bwd_microstep: 3849.52 | bwd_inner_microstep: 3842.03 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.04 [2024-11-13 19:59:08,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.21 | bwd: 3849.53 | bwd_inner: 3842.03 | bwd_allreduce: 7.46 | step: 21.05 2%|▏ | 1245/50750 [3:16:29<81:30:53, 5.93s/it] {'loss': 0.9717, 'learning_rate': 3.269862114248194e-05, 'epoch': 1.23} 2%|▏ | 1245/50750 [3:16:29<81:30:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:59:13,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 5.06 [2024-11-13 19:59:13,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.57 | bwd_microstep: 3851.33 | bwd_inner_microstep: 3843.87 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.19 [2024-11-13 19:59:13,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.57 | bwd: 3851.35 | bwd_inner: 3843.87 | bwd_allreduce: 7.44 | step: 21.20 2%|▏ | 1246/50750 [3:16:35<81:31:31, 5.93s/it] {'loss': 0.0001, 'learning_rate': 3.272488509520683e-05, 'epoch': 1.23} 2%|▏ | 1246/50750 [3:16:35<81:31:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 19:59:19,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 19:59:19,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.86 | bwd_microstep: 3853.43 | bwd_inner_microstep: 3845.94 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.89 [2024-11-13 19:59:19,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.86 | bwd: 3853.45 | bwd_inner: 3845.94 | bwd_allreduce: 7.47 | step: 20.89 2%|▏ | 1247/50750 [3:16:41<81:30:56, 5.93s/it] {'loss': 0.0042, 'learning_rate': 3.275114904793172e-05, 'epoch': 1.23} 2%|▏ | 1247/50750 [3:16:41<81:30:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:59:25,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.69 | optimizer_step: 4.92 [2024-11-13 19:59:25,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.93 | bwd_microstep: 3852.39 | bwd_inner_microstep: 3844.26 | bwd_allreduce_microstep: 8.08 | step_microstep: 22.31 [2024-11-13 19:59:25,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.93 | bwd: 3852.40 | bwd_inner: 3844.26 | bwd_allreduce: 8.10 | step: 22.32 2%|▏ | 1248/50750 [3:16:47<81:31:13, 5.93s/it] {'loss': 0.0044, 'learning_rate': 3.27774130006566e-05, 'epoch': 1.23} 2%|▏ | 1248/50750 [3:16:47<81:31:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:59:31,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 19:59:31,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.92 | bwd_microstep: 3854.14 | bwd_inner_microstep: 3846.55 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.49 [2024-11-13 19:59:31,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.92 | bwd: 3854.15 | bwd_inner: 3846.55 | bwd_allreduce: 7.56 | step: 21.49 2%|▏ | 1249/50750 [3:16:53<81:32:51, 5.93s/it] {'loss': 0.3754, 'learning_rate': 3.2803676953381485e-05, 'epoch': 1.23} 2%|▏ | 1249/50750 [3:16:53<81:32:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:59:37,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.36 | optimizer_step: 4.93 [2024-11-13 19:59:37,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.14 | bwd_microstep: 3857.24 | bwd_inner_microstep: 3849.49 | bwd_allreduce_microstep: 7.71 | step_microstep: 23.86 [2024-11-13 19:59:37,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.12 | bwd: 3857.25 | bwd_inner: 3849.49 | bwd_allreduce: 7.73 | step: 23.86 2%|▏ | 1250/50750 [3:16:59<81:34:37, 5.93s/it] {'loss': 0.6244, 'learning_rate': 3.282994090610637e-05, 'epoch': 1.23} 2%|▏ | 1250/50750 [3:16:59<81:34:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 19:59:43,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 19:59:43,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.41 | bwd_microstep: 3865.14 | bwd_inner_microstep: 3857.63 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.00 [2024-11-13 19:59:43,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.41 | bwd: 3865.15 | bwd_inner: 3857.63 | bwd_allreduce: 7.48 | step: 21.00 2%|▏ | 1251/50750 [3:17:05<81:37:05, 5.94s/it] {'loss': 0.013, 'learning_rate': 3.285620485883126e-05, 'epoch': 1.23} 2%|▏ | 1251/50750 [3:17:05<81:37:05, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 19:59:49,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 19:59:49,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.35 | bwd_microstep: 3860.52 | bwd_inner_microstep: 3852.77 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.27 [2024-11-13 19:59:49,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.35 | bwd: 3860.53 | bwd_inner: 3852.77 | bwd_allreduce: 7.72 | step: 21.27 2%|▏ | 1252/50750 [3:17:11<81:36:29, 5.94s/it] {'loss': 0.0091, 'learning_rate': 3.288246881155614e-05, 'epoch': 1.23} 2%|▏ | 1252/50750 [3:17:11<81:36:29, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 19:59:55,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 19:59:55,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.64 | bwd_microstep: 3860.57 | bwd_inner_microstep: 3853.07 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.15 [2024-11-13 19:59:55,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.64 | bwd: 3860.58 | bwd_inner: 3853.07 | bwd_allreduce: 7.47 | step: 21.16 2%|▏ | 1253/50750 [3:17:17<81:36:58, 5.94s/it] {'loss': 0.1753, 'learning_rate': 3.2908732764281026e-05, 'epoch': 1.23} 2%|▏ | 1253/50750 [3:17:17<81:36:58, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:00:01,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 20:00:01,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.00 | bwd_microstep: 3856.89 | bwd_inner_microstep: 3849.07 | bwd_allreduce_microstep: 7.78 | step_microstep: 21.52 [2024-11-13 20:00:01,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.01 | bwd: 3856.90 | bwd_inner: 3849.07 | bwd_allreduce: 7.79 | step: 21.53 2%|▏ | 1254/50750 [3:17:23<81:36:13, 5.94s/it] {'loss': 0.001, 'learning_rate': 3.2934996717005914e-05, 'epoch': 1.24} 2%|▏ | 1254/50750 [3:17:23<81:36:13, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:00:07,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.94 [2024-11-13 20:00:07,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.05 | bwd_microstep: 3851.42 | bwd_inner_microstep: 3843.76 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.18 [2024-11-13 20:00:07,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.05 | bwd: 3851.44 | bwd_inner: 3843.76 | bwd_allreduce: 7.63 | step: 21.18 2%|▏ | 1255/50750 [3:17:29<81:32:46, 5.93s/it] {'loss': 1.3217, 'learning_rate': 3.29612606697308e-05, 'epoch': 1.24} 2%|▏ | 1255/50750 [3:17:29<81:32:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:00:13,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 20:00:13,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.70 | bwd_microstep: 3854.92 | bwd_inner_microstep: 3847.43 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.88 [2024-11-13 20:00:13,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.70 | bwd: 3854.93 | bwd_inner: 3847.43 | bwd_allreduce: 7.46 | step: 20.88 2%|▏ | 1256/50750 [3:17:35<81:32:33, 5.93s/it] {'loss': 1.2539, 'learning_rate': 3.298752462245568e-05, 'epoch': 1.24} 2%|▏ | 1256/50750 [3:17:35<81:32:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:00:19,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.94 [2024-11-13 20:00:19,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.65 | bwd_microstep: 3860.89 | bwd_inner_microstep: 3853.00 | bwd_allreduce_microstep: 7.85 | step_microstep: 21.95 [2024-11-13 20:00:19,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.65 | bwd: 3860.91 | bwd_inner: 3853.00 | bwd_allreduce: 7.87 | step: 21.96 2%|▏ | 1257/50750 [3:17:41<81:34:09, 5.93s/it] {'loss': 0.0111, 'learning_rate': 3.301378857518057e-05, 'epoch': 1.24} 2%|▏ | 1257/50750 [3:17:41<81:34:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:00:25,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:00:25,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.63 | bwd_microstep: 3862.71 | bwd_inner_microstep: 3855.20 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.99 [2024-11-13 20:00:25,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.62 | bwd: 3862.72 | bwd_inner: 3855.20 | bwd_allreduce: 7.48 | step: 21.00 2%|▏ | 1258/50750 [3:17:47<81:35:37, 5.94s/it] {'loss': 0.1893, 'learning_rate': 3.3040052527905455e-05, 'epoch': 1.24} 2%|▏ | 1258/50750 [3:17:47<81:35:37, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:00:31,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.92 [2024-11-13 20:00:31,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.73 | bwd_microstep: 3866.56 | bwd_inner_microstep: 3858.83 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.69 [2024-11-13 20:00:31,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.72 | bwd: 3866.57 | bwd_inner: 3858.83 | bwd_allreduce: 7.70 | step: 21.70 2%|▏ | 1259/50750 [3:17:53<81:37:22, 5.94s/it] {'loss': 0.229, 'learning_rate': 3.3066316480630336e-05, 'epoch': 1.24} 2%|▏ | 1259/50750 [3:17:53<81:37:22, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:00:37,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.29 | optimizer_step: 4.93 [2024-11-13 20:00:37,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.65 | bwd_microstep: 3857.72 | bwd_inner_microstep: 3849.77 | bwd_allreduce_microstep: 7.90 | step_microstep: 22.35 [2024-11-13 20:00:37,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.65 | bwd: 3857.73 | bwd_inner: 3849.77 | bwd_allreduce: 7.92 | step: 22.36 2%|▏ | 1260/50750 [3:17:59<81:40:47, 5.94s/it] {'loss': 0.0517, 'learning_rate': 3.309258043335522e-05, 'epoch': 1.24} 2%|▏ | 1260/50750 [3:17:59<81:40:47, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:00:43,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 20:00:43,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2036.98 | bwd_microstep: 3854.65 | bwd_inner_microstep: 3846.99 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.37 [2024-11-13 20:00:43,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2036.96 | bwd: 3854.67 | bwd_inner: 3846.99 | bwd_allreduce: 7.64 | step: 21.38 2%|▏ | 1261/50750 [3:18:04<81:41:54, 5.94s/it] {'loss': 0.2432, 'learning_rate': 3.311884438608011e-05, 'epoch': 1.24} 2%|▏ | 1261/50750 [3:18:04<81:41:54, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:00:48,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.92 [2024-11-13 20:00:48,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.05 | bwd_microstep: 3861.55 | bwd_inner_microstep: 3853.84 | bwd_allreduce_microstep: 7.67 | step_microstep: 22.03 [2024-11-13 20:00:48,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.03 | bwd: 3861.56 | bwd_inner: 3853.84 | bwd_allreduce: 7.69 | step: 22.03 2%|▏ | 1262/50750 [3:18:10<81:41:40, 5.94s/it] {'loss': 0.0296, 'learning_rate': 3.3145108338805e-05, 'epoch': 1.24} 2%|▏ | 1262/50750 [3:18:10<81:41:40, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:00:54,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:00:54,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.20 | bwd_microstep: 3851.21 | bwd_inner_microstep: 3843.76 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.86 [2024-11-13 20:00:54,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.19 | bwd: 3851.23 | bwd_inner: 3843.76 | bwd_allreduce: 7.43 | step: 20.86 2%|▏ | 1263/50750 [3:18:16<81:36:51, 5.94s/it] {'loss': 0.5489, 'learning_rate': 3.317137229152988e-05, 'epoch': 1.24} 2%|▏ | 1263/50750 [3:18:16<81:36:51, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:01:00,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:01:00,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.34 | bwd_microstep: 3853.62 | bwd_inner_microstep: 3846.08 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.15 [2024-11-13 20:01:00,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.34 | bwd: 3853.63 | bwd_inner: 3846.08 | bwd_allreduce: 7.52 | step: 21.15 2%|▏ | 1264/50750 [3:18:22<81:34:37, 5.93s/it] {'loss': 0.0117, 'learning_rate': 3.3197636244254765e-05, 'epoch': 1.25} 2%|▏ | 1264/50750 [3:18:22<81:34:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:01:06,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 20:01:06,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.93 | bwd_microstep: 3857.17 | bwd_inner_microstep: 3849.66 | bwd_allreduce_microstep: 7.47 | step_microstep: 22.11 [2024-11-13 20:01:06,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.93 | bwd: 3857.18 | bwd_inner: 3849.66 | bwd_allreduce: 7.49 | step: 22.12 2%|▏ | 1265/50750 [3:18:28<81:34:52, 5.93s/it] {'loss': 0.324, 'learning_rate': 3.3223900196979645e-05, 'epoch': 1.25} 2%|▏ | 1265/50750 [3:18:28<81:34:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:01:12,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 20:01:12,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.57 | bwd_microstep: 3857.93 | bwd_inner_microstep: 3850.40 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.92 [2024-11-13 20:01:12,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.57 | bwd: 3857.94 | bwd_inner: 3850.40 | bwd_allreduce: 7.50 | step: 20.93 2%|▏ | 1266/50750 [3:18:34<81:35:02, 5.94s/it] {'loss': 0.0058, 'learning_rate': 3.325016414970453e-05, 'epoch': 1.25} 2%|▏ | 1266/50750 [3:18:34<81:35:02, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:01:18,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:01:18,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.88 | bwd_microstep: 3858.64 | bwd_inner_microstep: 3851.13 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.07 [2024-11-13 20:01:18,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.88 | bwd: 3858.65 | bwd_inner: 3851.13 | bwd_allreduce: 7.48 | step: 21.07 2%|▏ | 1267/50750 [3:18:40<81:33:35, 5.93s/it] {'loss': 0.0671, 'learning_rate': 3.327642810242942e-05, 'epoch': 1.25} 2%|▏ | 1267/50750 [3:18:40<81:33:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:01:24,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 20:01:24,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.30 | bwd_microstep: 3856.27 | bwd_inner_microstep: 3848.75 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.25 [2024-11-13 20:01:24,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.30 | bwd: 3856.28 | bwd_inner: 3848.75 | bwd_allreduce: 7.50 | step: 21.25 2%|▏ | 1268/50750 [3:18:46<81:32:23, 5.93s/it] {'loss': 0.062, 'learning_rate': 3.33026920551543e-05, 'epoch': 1.25} 2%|▏ | 1268/50750 [3:18:46<81:32:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:01:30,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 20:01:30,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.96 | bwd_microstep: 3867.31 | bwd_inner_microstep: 3859.79 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.03 [2024-11-13 20:01:30,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.94 | bwd: 3867.32 | bwd_inner: 3859.79 | bwd_allreduce: 7.49 | step: 21.03 3%|▎ | 1269/50750 [3:18:52<81:36:44, 5.94s/it] {'loss': 0.6382, 'learning_rate': 3.332895600787919e-05, 'epoch': 1.25} 3%|▎ | 1269/50750 [3:18:52<81:36:44, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:01:36,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.94 [2024-11-13 20:01:36,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.26 | bwd_microstep: 3851.46 | bwd_inner_microstep: 3843.94 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.04 [2024-11-13 20:01:36,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.26 | bwd: 3851.47 | bwd_inner: 3843.94 | bwd_allreduce: 7.49 | step: 21.04 3%|▎ | 1270/50750 [3:18:58<81:33:27, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.3355219960604074e-05, 'epoch': 1.25} 3%|▎ | 1270/50750 [3:18:58<81:33:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:01:42,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 20:01:42,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.01 | bwd_microstep: 3850.06 | bwd_inner_microstep: 3842.53 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.18 [2024-11-13 20:01:42,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.01 | bwd: 3850.07 | bwd_inner: 3842.53 | bwd_allreduce: 7.50 | step: 21.18 3%|▎ | 1271/50750 [3:19:04<81:32:13, 5.93s/it] {'loss': 0.4427, 'learning_rate': 3.338148391332896e-05, 'epoch': 1.25} 3%|▎ | 1271/50750 [3:19:04<81:32:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:01:48,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:01:48,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.51 | bwd_microstep: 3853.16 | bwd_inner_microstep: 3845.62 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.03 [2024-11-13 20:01:48,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.51 | bwd: 3853.17 | bwd_inner: 3845.62 | bwd_allreduce: 7.51 | step: 21.04 3%|▎ | 1272/50750 [3:19:10<81:31:46, 5.93s/it] {'loss': 0.0049, 'learning_rate': 3.340774786605384e-05, 'epoch': 1.25} 3%|▎ | 1272/50750 [3:19:10<81:31:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 20:01:54,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:01:54,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.95 | bwd_microstep: 3856.43 | bwd_inner_microstep: 3848.90 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.10 [2024-11-13 20:01:54,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.95 | bwd: 3856.44 | bwd_inner: 3848.90 | bwd_allreduce: 7.50 | step: 21.10 3%|▎ | 1273/50750 [3:19:16<81:31:25, 5.93s/it] {'loss': 0.4561, 'learning_rate': 3.343401181877873e-05, 'epoch': 1.25} 3%|▎ | 1273/50750 [3:19:16<81:31:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:02:00,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:02:00,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.60 | bwd_microstep: 3858.18 | bwd_inner_microstep: 3850.69 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.15 [2024-11-13 20:02:00,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.60 | bwd: 3858.20 | bwd_inner: 3850.69 | bwd_allreduce: 7.47 | step: 21.15 3%|▎ | 1274/50750 [3:19:22<81:31:06, 5.93s/it] {'loss': 0.092, 'learning_rate': 3.3460275771503616e-05, 'epoch': 1.26} 3%|▎ | 1274/50750 [3:19:22<81:31:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:02:06,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.75 | optimizer_step: 4.93 [2024-11-13 20:02:06,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.95 | bwd_microstep: 3855.59 | bwd_inner_microstep: 3847.55 | bwd_allreduce_microstep: 7.97 | step_microstep: 27.88 [2024-11-13 20:02:06,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.95 | bwd: 3855.61 | bwd_inner: 3847.55 | bwd_allreduce: 8.00 | step: 27.90 3%|▎ | 1275/50750 [3:19:28<81:31:34, 5.93s/it] {'loss': 0.0703, 'learning_rate': 3.3486539724228496e-05, 'epoch': 1.26} 3%|▎ | 1275/50750 [3:19:28<81:31:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:02:11,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:02:11,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.16 | bwd_microstep: 3856.92 | bwd_inner_microstep: 3849.34 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.11 [2024-11-13 20:02:11,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.16 | bwd: 3856.93 | bwd_inner: 3849.34 | bwd_allreduce: 7.55 | step: 21.11 3%|▎ | 1276/50750 [3:19:33<81:31:03, 5.93s/it] {'loss': 0.0241, 'learning_rate': 3.3512803676953384e-05, 'epoch': 1.26} 3%|▎ | 1276/50750 [3:19:33<81:31:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:02:17,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 20:02:17,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.74 | bwd_microstep: 3851.13 | bwd_inner_microstep: 3843.34 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.52 [2024-11-13 20:02:17,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.74 | bwd: 3851.14 | bwd_inner: 3843.34 | bwd_allreduce: 7.76 | step: 21.52 3%|▎ | 1277/50750 [3:19:39<81:31:01, 5.93s/it] {'loss': 0.157, 'learning_rate': 3.353906762967827e-05, 'epoch': 1.26} 3%|▎ | 1277/50750 [3:19:39<81:31:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:02:23,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:02:23,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.22 | bwd_microstep: 3853.97 | bwd_inner_microstep: 3846.46 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.01 [2024-11-13 20:02:23,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.21 | bwd: 3853.98 | bwd_inner: 3846.46 | bwd_allreduce: 7.48 | step: 21.02 3%|▎ | 1278/50750 [3:19:45<81:30:16, 5.93s/it] {'loss': 0.6591, 'learning_rate': 3.356533158240316e-05, 'epoch': 1.26} 3%|▎ | 1278/50750 [3:19:45<81:30:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:02:29,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:02:29,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.44 | bwd_microstep: 3845.87 | bwd_inner_microstep: 3838.37 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.06 [2024-11-13 20:02:29,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.43 | bwd: 3845.88 | bwd_inner: 3838.37 | bwd_allreduce: 7.48 | step: 21.06 3%|▎ | 1279/50750 [3:19:51<81:27:37, 5.93s/it] {'loss': 0.0, 'learning_rate': 3.359159553512804e-05, 'epoch': 1.26} 3%|▎ | 1279/50750 [3:19:51<81:27:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:02:35,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 20:02:35,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.81 | bwd_microstep: 3857.17 | bwd_inner_microstep: 3849.65 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.96 [2024-11-13 20:02:35,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.81 | bwd: 3857.19 | bwd_inner: 3849.65 | bwd_allreduce: 7.50 | step: 20.96 3%|▎ | 1280/50750 [3:19:57<81:27:56, 5.93s/it] {'loss': 0.0075, 'learning_rate': 3.3617859487852925e-05, 'epoch': 1.26} 3%|▎ | 1280/50750 [3:19:57<81:27:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:02:41,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 20:02:41,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.03 | bwd_microstep: 3849.84 | bwd_inner_microstep: 3842.24 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.37 [2024-11-13 20:02:41,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.03 | bwd: 3849.85 | bwd_inner: 3842.24 | bwd_allreduce: 7.57 | step: 21.38 3%|▎ | 1281/50750 [3:20:03<81:26:22, 5.93s/it] {'loss': 0.1075, 'learning_rate': 3.364412344057781e-05, 'epoch': 1.26} 3%|▎ | 1281/50750 [3:20:03<81:26:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:02:47,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 20:02:47,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.29 | bwd_microstep: 3856.12 | bwd_inner_microstep: 3848.57 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.42 [2024-11-13 20:02:47,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.28 | bwd: 3856.13 | bwd_inner: 3848.57 | bwd_allreduce: 7.52 | step: 21.42 3%|▎ | 1282/50750 [3:20:09<81:28:34, 5.93s/it] {'loss': 0.0946, 'learning_rate': 3.36703873933027e-05, 'epoch': 1.26} 3%|▎ | 1282/50750 [3:20:09<81:28:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:02:53,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:02:53,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.00 | bwd_microstep: 3862.30 | bwd_inner_microstep: 3854.79 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.07 [2024-11-13 20:02:53,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.99 | bwd: 3862.31 | bwd_inner: 3854.79 | bwd_allreduce: 7.48 | step: 21.07 3%|▎ | 1283/50750 [3:20:15<81:31:29, 5.93s/it] {'loss': 0.0184, 'learning_rate': 3.369665134602758e-05, 'epoch': 1.26} 3%|▎ | 1283/50750 [3:20:15<81:31:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:02:59,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 20:02:59,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.17 | bwd_microstep: 3853.13 | bwd_inner_microstep: 3845.37 | bwd_allreduce_microstep: 7.71 | step_microstep: 24.82 [2024-11-13 20:02:59,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.17 | bwd: 3853.15 | bwd_inner: 3845.37 | bwd_allreduce: 7.73 | step: 24.81 3%|▎ | 1284/50750 [3:20:21<81:31:26, 5.93s/it] {'loss': 0.006, 'learning_rate': 3.372291529875246e-05, 'epoch': 1.27} 3%|▎ | 1284/50750 [3:20:21<81:31:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:03:05,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 20:03:05,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.68 | bwd_microstep: 3860.89 | bwd_inner_microstep: 3853.07 | bwd_allreduce_microstep: 7.78 | step_microstep: 21.81 [2024-11-13 20:03:05,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.68 | bwd: 3860.91 | bwd_inner: 3853.07 | bwd_allreduce: 7.80 | step: 21.81 3%|▎ | 1285/50750 [3:20:27<81:33:51, 5.94s/it] {'loss': 0.1, 'learning_rate': 3.374917925147735e-05, 'epoch': 1.27} 3%|▎ | 1285/50750 [3:20:27<81:33:51, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:03:11,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.92 [2024-11-13 20:03:11,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.16 | bwd_microstep: 3858.92 | bwd_inner_microstep: 3851.06 | bwd_allreduce_microstep: 7.81 | step_microstep: 21.62 [2024-11-13 20:03:11,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.16 | bwd: 3858.93 | bwd_inner: 3851.06 | bwd_allreduce: 7.83 | step: 21.63 3%|▎ | 1286/50750 [3:20:33<81:34:12, 5.94s/it] {'loss': 0.0034, 'learning_rate': 3.3775443204202235e-05, 'epoch': 1.27} 3%|▎ | 1286/50750 [3:20:33<81:34:12, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:03:17,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 20:03:17,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.45 | bwd_microstep: 3854.26 | bwd_inner_microstep: 3846.74 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.97 [2024-11-13 20:03:17,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.44 | bwd: 3854.27 | bwd_inner: 3846.74 | bwd_allreduce: 7.49 | step: 20.97 3%|▎ | 1287/50750 [3:20:39<81:33:42, 5.94s/it] {'loss': 0.4254, 'learning_rate': 3.380170715692712e-05, 'epoch': 1.27} 3%|▎ | 1287/50750 [3:20:39<81:33:42, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:03:23,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 20:03:23,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.40 | bwd_microstep: 3863.12 | bwd_inner_microstep: 3855.31 | bwd_allreduce_microstep: 7.76 | step_microstep: 21.52 [2024-11-13 20:03:23,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.40 | bwd: 3863.13 | bwd_inner: 3855.31 | bwd_allreduce: 7.78 | step: 21.52 3%|▎ | 1288/50750 [3:20:45<81:33:16, 5.94s/it] {'loss': 0.0039, 'learning_rate': 3.3827971109652e-05, 'epoch': 1.27} 3%|▎ | 1288/50750 [3:20:45<81:33:16, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:03:29,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 20:03:29,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.71 | bwd_microstep: 3861.25 | bwd_inner_microstep: 3853.51 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.62 [2024-11-13 20:03:29,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.70 | bwd: 3861.26 | bwd_inner: 3853.51 | bwd_allreduce: 7.70 | step: 21.63 3%|▎ | 1289/50750 [3:20:51<81:33:40, 5.94s/it] {'loss': 0.0145, 'learning_rate': 3.385423506237689e-05, 'epoch': 1.27} 3%|▎ | 1289/50750 [3:20:51<81:33:40, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:03:35,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:03:35,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.17 | bwd_microstep: 3854.40 | bwd_inner_microstep: 3846.88 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.98 [2024-11-13 20:03:35,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.16 | bwd: 3854.41 | bwd_inner: 3846.88 | bwd_allreduce: 7.49 | step: 20.98 3%|▎ | 1290/50750 [3:20:57<81:34:31, 5.94s/it] {'loss': 0.7065, 'learning_rate': 3.388049901510178e-05, 'epoch': 1.27} 3%|▎ | 1290/50750 [3:20:57<81:34:31, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:03:40,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 20:03:40,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.13 | bwd_microstep: 3862.09 | bwd_inner_microstep: 3854.60 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.96 [2024-11-13 20:03:40,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.13 | bwd: 3862.10 | bwd_inner: 3854.60 | bwd_allreduce: 7.47 | step: 20.96 3%|▎ | 1291/50750 [3:21:02<81:33:07, 5.94s/it] {'loss': 0.0141, 'learning_rate': 3.390676296782666e-05, 'epoch': 1.27} 3%|▎ | 1291/50750 [3:21:02<81:33:07, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:03:46,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-13 20:03:46,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.58 | bwd_microstep: 3860.80 | bwd_inner_microstep: 3853.29 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.76 [2024-11-13 20:03:46,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.56 | bwd: 3860.81 | bwd_inner: 3853.29 | bwd_allreduce: 7.48 | step: 21.76 3%|▎ | 1292/50750 [3:21:08<81:34:03, 5.94s/it] {'loss': 0.0035, 'learning_rate': 3.3933026920551544e-05, 'epoch': 1.27} 3%|▎ | 1292/50750 [3:21:08<81:34:03, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:03:52,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 20:03:52,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.66 | bwd_microstep: 3860.02 | bwd_inner_microstep: 3852.49 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.06 [2024-11-13 20:03:52,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.64 | bwd: 3860.03 | bwd_inner: 3852.49 | bwd_allreduce: 7.50 | step: 21.06 3%|▎ | 1293/50750 [3:21:14<81:35:59, 5.94s/it] {'loss': 0.5036, 'learning_rate': 3.395929087327643e-05, 'epoch': 1.27} 3%|▎ | 1293/50750 [3:21:14<81:35:59, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:03:58,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 20:03:58,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.08 | bwd_microstep: 3855.56 | bwd_inner_microstep: 3847.84 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.71 [2024-11-13 20:03:58,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.08 | bwd: 3855.57 | bwd_inner: 3847.84 | bwd_allreduce: 7.69 | step: 21.71 3%|▎ | 1294/50750 [3:21:20<81:33:41, 5.94s/it] {'loss': 0.0395, 'learning_rate': 3.398555482600132e-05, 'epoch': 1.27} 3%|▎ | 1294/50750 [3:21:20<81:33:41, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:04:04,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 20:04:04,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.94 | bwd_microstep: 3854.10 | bwd_inner_microstep: 3846.20 | bwd_allreduce_microstep: 7.85 | step_microstep: 21.13 [2024-11-13 20:04:04,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.93 | bwd: 3854.11 | bwd_inner: 3846.20 | bwd_allreduce: 7.87 | step: 21.13 3%|▎ | 1295/50750 [3:21:26<81:31:55, 5.94s/it] {'loss': 1.0706, 'learning_rate': 3.40118187787262e-05, 'epoch': 1.28} 3%|▎ | 1295/50750 [3:21:26<81:31:55, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:04:10,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 20:04:10,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.57 | bwd_microstep: 3858.59 | bwd_inner_microstep: 3851.07 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.23 [2024-11-13 20:04:10,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.57 | bwd: 3858.60 | bwd_inner: 3851.07 | bwd_allreduce: 7.49 | step: 21.24 3%|▎ | 1296/50750 [3:21:32<81:30:39, 5.93s/it] {'loss': 0.1896, 'learning_rate': 3.4038082731451086e-05, 'epoch': 1.28} 3%|▎ | 1296/50750 [3:21:32<81:30:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:04:16,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:04:16,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.32 | bwd_microstep: 3850.89 | bwd_inner_microstep: 3843.21 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.23 [2024-11-13 20:04:16,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.32 | bwd: 3850.91 | bwd_inner: 3843.21 | bwd_allreduce: 7.66 | step: 21.23 3%|▎ | 1297/50750 [3:21:38<81:27:56, 5.93s/it] {'loss': 0.2165, 'learning_rate': 3.406434668417597e-05, 'epoch': 1.28} 3%|▎ | 1297/50750 [3:21:38<81:27:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:04:22,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:04:22,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1993.86 | bwd_microstep: 3794.17 | bwd_inner_microstep: 3786.64 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.98 [2024-11-13 20:04:22,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1993.86 | bwd: 3794.18 | bwd_inner: 3786.64 | bwd_allreduce: 7.50 | step: 20.98 3%|▎ | 1298/50750 [3:21:44<81:04:13, 5.90s/it] {'loss': 0.0014, 'learning_rate': 3.409061063690086e-05, 'epoch': 1.28} 3%|▎ | 1298/50750 [3:21:44<81:04:13, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:04:28,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 20:04:28,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.44 | bwd_microstep: 3861.65 | bwd_inner_microstep: 3854.14 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.02 [2024-11-13 20:04:28,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.44 | bwd: 3861.66 | bwd_inner: 3854.14 | bwd_allreduce: 7.49 | step: 21.03 3%|▎ | 1299/50750 [3:21:50<81:13:34, 5.91s/it] {'loss': 0.2114, 'learning_rate': 3.411687458962574e-05, 'epoch': 1.28} 3%|▎ | 1299/50750 [3:21:50<81:13:34, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:04:34,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 20:04:34,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.99 | bwd_microstep: 3855.53 | bwd_inner_microstep: 3847.97 | bwd_allreduce_microstep: 7.52 | step_microstep: 20.90 [2024-11-13 20:04:34,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.99 | bwd: 3855.54 | bwd_inner: 3847.97 | bwd_allreduce: 7.53 | step: 20.90 3%|▎ | 1300/50750 [3:21:56<81:17:35, 5.92s/it] {'loss': 0.4447, 'learning_rate': 3.414313854235062e-05, 'epoch': 1.28} 3%|▎ | 1300/50750 [3:21:56<81:17:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:04:40,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:04:40,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.39 | bwd_microstep: 3862.33 | bwd_inner_microstep: 3854.59 | bwd_allreduce_microstep: 7.69 | step_microstep: 23.58 [2024-11-13 20:04:40,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.38 | bwd: 3862.35 | bwd_inner: 3854.59 | bwd_allreduce: 7.71 | step: 23.58 3%|▎ | 1301/50750 [3:22:02<81:24:24, 5.93s/it] {'loss': 0.5964, 'learning_rate': 3.4169402495075515e-05, 'epoch': 1.28} 3%|▎ | 1301/50750 [3:22:02<81:24:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:04:46,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 20:04:46,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.14 | bwd_microstep: 3862.30 | bwd_inner_microstep: 3854.03 | bwd_allreduce_microstep: 8.19 | step_microstep: 21.95 [2024-11-13 20:04:46,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.14 | bwd: 3862.33 | bwd_inner: 3854.03 | bwd_allreduce: 8.22 | step: 21.94 3%|▎ | 1302/50750 [3:22:08<81:27:33, 5.93s/it] {'loss': 0.6764, 'learning_rate': 3.4195666447800396e-05, 'epoch': 1.28} 3%|▎ | 1302/50750 [3:22:08<81:27:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:04:52,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 20:04:52,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.81 | bwd_microstep: 3859.62 | bwd_inner_microstep: 3851.93 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.71 [2024-11-13 20:04:52,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.79 | bwd: 3859.63 | bwd_inner: 3851.93 | bwd_allreduce: 7.66 | step: 21.72 3%|▎ | 1303/50750 [3:22:14<81:28:41, 5.93s/it] {'loss': 0.7853, 'learning_rate': 3.422193040052528e-05, 'epoch': 1.28} 3%|▎ | 1303/50750 [3:22:14<81:28:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:04:58,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 20:04:58,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.83 | bwd_microstep: 3862.17 | bwd_inner_microstep: 3854.49 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.34 [2024-11-13 20:04:58,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.81 | bwd: 3862.18 | bwd_inner: 3854.49 | bwd_allreduce: 7.65 | step: 21.34 3%|▎ | 1304/50750 [3:22:20<81:30:18, 5.93s/it] {'loss': 0.019, 'learning_rate': 3.424819435325016e-05, 'epoch': 1.28} 3%|▎ | 1304/50750 [3:22:20<81:30:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:05:04,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 20:05:04,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.12 | bwd_microstep: 3863.80 | bwd_inner_microstep: 3856.19 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.88 [2024-11-13 20:05:04,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.12 | bwd: 3863.81 | bwd_inner: 3856.19 | bwd_allreduce: 7.58 | step: 21.88 3%|▎ | 1305/50750 [3:22:25<81:32:45, 5.94s/it] {'loss': 0.18, 'learning_rate': 3.427445830597505e-05, 'epoch': 1.29} 3%|▎ | 1305/50750 [3:22:25<81:32:45, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:05:09,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 20:05:09,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.05 | bwd_microstep: 3856.50 | bwd_inner_microstep: 3848.75 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.43 [2024-11-13 20:05:09,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.04 | bwd: 3856.51 | bwd_inner: 3848.75 | bwd_allreduce: 7.71 | step: 21.43 3%|▎ | 1306/50750 [3:22:31<81:33:10, 5.94s/it] {'loss': 0.013, 'learning_rate': 3.430072225869994e-05, 'epoch': 1.29} 3%|▎ | 1306/50750 [3:22:31<81:33:10, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:05:15,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.33 | optimizer_step: 4.93 [2024-11-13 20:05:15,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.14 | bwd_microstep: 3859.72 | bwd_inner_microstep: 3851.67 | bwd_allreduce_microstep: 8.00 | step_microstep: 22.62 [2024-11-13 20:05:15,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.13 | bwd: 3859.74 | bwd_inner: 3851.67 | bwd_allreduce: 8.02 | step: 22.63 3%|▎ | 1307/50750 [3:22:37<81:34:56, 5.94s/it] {'loss': 0.2722, 'learning_rate': 3.4326986211424825e-05, 'epoch': 1.29} 3%|▎ | 1307/50750 [3:22:37<81:34:56, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:05:21,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 20:05:21,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.91 | bwd_microstep: 3865.23 | bwd_inner_microstep: 3857.51 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.81 [2024-11-13 20:05:21,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.89 | bwd: 3865.25 | bwd_inner: 3857.51 | bwd_allreduce: 7.69 | step: 21.82 3%|▎ | 1308/50750 [3:22:43<81:37:08, 5.94s/it] {'loss': 0.1883, 'learning_rate': 3.4353250164149705e-05, 'epoch': 1.29} 3%|▎ | 1308/50750 [3:22:43<81:37:08, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:05:27,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:05:27,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.87 | bwd_microstep: 3863.22 | bwd_inner_microstep: 3855.67 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.67 [2024-11-13 20:05:27,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.86 | bwd: 3863.23 | bwd_inner: 3855.67 | bwd_allreduce: 7.52 | step: 21.67 3%|▎ | 1309/50750 [3:22:49<81:37:49, 5.94s/it] {'loss': 0.0854, 'learning_rate': 3.437951411687459e-05, 'epoch': 1.29} 3%|▎ | 1309/50750 [3:22:49<81:37:49, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:05:33,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 20:05:33,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.28 | bwd_microstep: 3856.79 | bwd_inner_microstep: 3849.24 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.22 [2024-11-13 20:05:33,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.28 | bwd: 3856.80 | bwd_inner: 3849.24 | bwd_allreduce: 7.52 | step: 21.23 3%|▎ | 1310/50750 [3:22:55<81:34:36, 5.94s/it] {'loss': 0.3981, 'learning_rate': 3.440577806959948e-05, 'epoch': 1.29} 3%|▎ | 1310/50750 [3:22:55<81:34:36, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:05:39,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 5.08 [2024-11-13 20:05:39,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.27 | bwd_microstep: 3856.72 | bwd_inner_microstep: 3849.18 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.65 [2024-11-13 20:05:39,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.26 | bwd: 3856.73 | bwd_inner: 3849.18 | bwd_allreduce: 7.51 | step: 21.65 3%|▎ | 1311/50750 [3:23:01<81:33:42, 5.94s/it] {'loss': 0.0155, 'learning_rate': 3.443204202232436e-05, 'epoch': 1.29} 3%|▎ | 1311/50750 [3:23:01<81:33:42, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:05:45,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 20:05:45,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.06 | bwd_microstep: 3860.44 | bwd_inner_microstep: 3852.92 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.03 [2024-11-13 20:05:45,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.06 | bwd: 3860.45 | bwd_inner: 3852.92 | bwd_allreduce: 7.49 | step: 21.03 3%|▎ | 1312/50750 [3:23:07<81:31:44, 5.94s/it] {'loss': 0.0166, 'learning_rate': 3.445830597504925e-05, 'epoch': 1.29} 3%|▎ | 1312/50750 [3:23:07<81:31:44, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:05:51,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 20:05:51,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.84 | bwd_microstep: 3858.37 | bwd_inner_microstep: 3850.84 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.11 [2024-11-13 20:05:51,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.82 | bwd: 3858.39 | bwd_inner: 3850.84 | bwd_allreduce: 7.51 | step: 21.11 3%|▎ | 1313/50750 [3:23:13<81:31:41, 5.94s/it] {'loss': 0.0243, 'learning_rate': 3.4484569927774134e-05, 'epoch': 1.29} 3%|▎ | 1313/50750 [3:23:13<81:31:41, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:05:57,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.92 [2024-11-13 20:05:57,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.72 | bwd_microstep: 3858.57 | bwd_inner_microstep: 3850.92 | bwd_allreduce_microstep: 7.61 | step_microstep: 22.05 [2024-11-13 20:05:57,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.72 | bwd: 3858.58 | bwd_inner: 3850.92 | bwd_allreduce: 7.63 | step: 22.05 3%|▎ | 1314/50750 [3:23:19<81:30:33, 5.94s/it] {'loss': 0.0084, 'learning_rate': 3.451083388049902e-05, 'epoch': 1.29} 3%|▎ | 1314/50750 [3:23:19<81:30:33, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:06:03,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.52 | optimizer_step: 4.93 [2024-11-13 20:06:03,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.17 | bwd_microstep: 3858.34 | bwd_inner_microstep: 3850.60 | bwd_allreduce_microstep: 7.69 | step_microstep: 31.10 [2024-11-13 20:06:03,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.15 | bwd: 3858.36 | bwd_inner: 3850.60 | bwd_allreduce: 7.71 | step: 31.09 3%|▎ | 1315/50750 [3:23:25<81:32:19, 5.94s/it] {'loss': 0.0018, 'learning_rate': 3.45370978332239e-05, 'epoch': 1.3} 3%|▎ | 1315/50750 [3:23:25<81:32:19, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:06:09,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 20:06:09,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.48 | bwd_microstep: 3861.32 | bwd_inner_microstep: 3853.41 | bwd_allreduce_microstep: 7.86 | step_microstep: 21.71 [2024-11-13 20:06:09,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.47 | bwd: 3861.33 | bwd_inner: 3853.41 | bwd_allreduce: 7.88 | step: 21.71 3%|▎ | 1316/50750 [3:23:31<81:31:22, 5.94s/it] {'loss': 0.0172, 'learning_rate': 3.456336178594879e-05, 'epoch': 1.3} 3%|▎ | 1316/50750 [3:23:31<81:31:22, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:06:15,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 20:06:15,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.59 | bwd_microstep: 3855.37 | bwd_inner_microstep: 3847.49 | bwd_allreduce_microstep: 7.83 | step_microstep: 21.64 [2024-11-13 20:06:15,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.59 | bwd: 3855.38 | bwd_inner: 3847.49 | bwd_allreduce: 7.85 | step: 21.64 3%|▎ | 1317/50750 [3:23:37<81:31:37, 5.94s/it] {'loss': 0.005, 'learning_rate': 3.4589625738673676e-05, 'epoch': 1.3} 3%|▎ | 1317/50750 [3:23:37<81:31:37, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:06:21,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 20:06:21,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.96 | bwd_microstep: 3852.32 | bwd_inner_microstep: 3844.85 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.85 [2024-11-13 20:06:21,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.96 | bwd: 3852.33 | bwd_inner: 3844.85 | bwd_allreduce: 7.45 | step: 20.86 3%|▎ | 1318/50750 [3:23:43<81:28:31, 5.93s/it] {'loss': 0.6516, 'learning_rate': 3.4615889691398556e-05, 'epoch': 1.3} 3%|▎ | 1318/50750 [3:23:43<81:28:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:06:27,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:06:27,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.36 | bwd_microstep: 3853.57 | bwd_inner_microstep: 3845.99 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.15 [2024-11-13 20:06:27,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.36 | bwd: 3853.58 | bwd_inner: 3845.99 | bwd_allreduce: 7.55 | step: 21.16 3%|▎ | 1319/50750 [3:23:49<81:27:07, 5.93s/it] {'loss': 0.1957, 'learning_rate': 3.464215364412344e-05, 'epoch': 1.3} 3%|▎ | 1319/50750 [3:23:49<81:27:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:06:33,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.35 | optimizer_step: 4.93 [2024-11-13 20:06:33,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.93 | bwd_microstep: 3854.80 | bwd_inner_microstep: 3846.54 | bwd_allreduce_microstep: 8.21 | step_microstep: 22.22 [2024-11-13 20:06:33,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.93 | bwd: 3854.81 | bwd_inner: 3846.54 | bwd_allreduce: 8.23 | step: 22.23 3%|▎ | 1320/50750 [3:23:55<81:27:10, 5.93s/it] {'loss': 0.8461, 'learning_rate': 3.466841759684833e-05, 'epoch': 1.3} 3%|▎ | 1320/50750 [3:23:55<81:27:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:06:38,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 20:06:38,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.96 | bwd_microstep: 3851.28 | bwd_inner_microstep: 3843.52 | bwd_allreduce_microstep: 7.70 | step_microstep: 22.44 [2024-11-13 20:06:38,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.95 | bwd: 3851.30 | bwd_inner: 3843.52 | bwd_allreduce: 7.72 | step: 22.44 3%|▎ | 1321/50750 [3:24:00<81:27:29, 5.93s/it] {'loss': 0.2148, 'learning_rate': 3.469468154957322e-05, 'epoch': 1.3} 3%|▎ | 1321/50750 [3:24:00<81:27:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:06:44,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 5.09 [2024-11-13 20:06:44,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.03 | bwd_microstep: 3847.46 | bwd_inner_microstep: 3840.00 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.90 [2024-11-13 20:06:44,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.03 | bwd: 3847.47 | bwd_inner: 3840.00 | bwd_allreduce: 7.43 | step: 20.90 3%|▎ | 1322/50750 [3:24:06<81:24:16, 5.93s/it] {'loss': 0.0095, 'learning_rate': 3.47209455022981e-05, 'epoch': 1.3} 3%|▎ | 1322/50750 [3:24:06<81:24:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:06:50,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-13 20:06:50,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.92 | bwd_microstep: 3845.76 | bwd_inner_microstep: 3838.31 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.78 [2024-11-13 20:06:50,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.92 | bwd: 3845.77 | bwd_inner: 3838.31 | bwd_allreduce: 7.42 | step: 20.78 3%|▎ | 1323/50750 [3:24:12<81:20:56, 5.93s/it] {'loss': 0.0027, 'learning_rate': 3.4747209455022985e-05, 'epoch': 1.3} 3%|▎ | 1323/50750 [3:24:12<81:20:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:06:56,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:06:56,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.02 | bwd_microstep: 3848.82 | bwd_inner_microstep: 3841.34 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.92 [2024-11-13 20:06:56,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.02 | bwd: 3848.84 | bwd_inner: 3841.34 | bwd_allreduce: 7.46 | step: 20.92 3%|▎ | 1324/50750 [3:24:18<81:20:04, 5.92s/it] {'loss': 0.0373, 'learning_rate': 3.4773473407747866e-05, 'epoch': 1.3} 3%|▎ | 1324/50750 [3:24:18<81:20:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:07:02,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:07:02,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.38 | bwd_microstep: 3851.47 | bwd_inner_microstep: 3843.99 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.87 [2024-11-13 20:07:02,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.38 | bwd: 3851.48 | bwd_inner: 3843.99 | bwd_allreduce: 7.45 | step: 20.88 3%|▎ | 1325/50750 [3:24:24<81:19:06, 5.92s/it] {'loss': 0.064, 'learning_rate': 3.479973736047276e-05, 'epoch': 1.31} 3%|▎ | 1325/50750 [3:24:24<81:19:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:07:08,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 20:07:08,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.03 | bwd_microstep: 3850.91 | bwd_inner_microstep: 3843.43 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.87 [2024-11-13 20:07:08,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.03 | bwd: 3850.92 | bwd_inner: 3843.43 | bwd_allreduce: 7.45 | step: 20.88 3%|▎ | 1326/50750 [3:24:30<81:18:33, 5.92s/it] {'loss': 0.2255, 'learning_rate': 3.482600131319764e-05, 'epoch': 1.31} 3%|▎ | 1326/50750 [3:24:30<81:18:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:07:14,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 20:07:14,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.88 | bwd_microstep: 3847.46 | bwd_inner_microstep: 3839.86 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.43 [2024-11-13 20:07:14,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.88 | bwd: 3847.47 | bwd_inner: 3839.86 | bwd_allreduce: 7.57 | step: 21.44 3%|▎ | 1327/50750 [3:24:36<81:19:11, 5.92s/it] {'loss': 0.1667, 'learning_rate': 3.485226526592252e-05, 'epoch': 1.31} 3%|▎ | 1327/50750 [3:24:36<81:19:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:07:20,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 20:07:20,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.41 | bwd_microstep: 3849.71 | bwd_inner_microstep: 3842.13 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.35 [2024-11-13 20:07:20,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.40 | bwd: 3849.72 | bwd_inner: 3842.13 | bwd_allreduce: 7.55 | step: 21.35 3%|▎ | 1328/50750 [3:24:42<81:20:15, 5.92s/it] {'loss': 0.0731, 'learning_rate': 3.487852921864741e-05, 'epoch': 1.31} 3%|▎ | 1328/50750 [3:24:42<81:20:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:07:26,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 20:07:26,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.04 | bwd_microstep: 3869.57 | bwd_inner_microstep: 3861.82 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.42 [2024-11-13 20:07:26,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.02 | bwd: 3869.59 | bwd_inner: 3861.82 | bwd_allreduce: 7.72 | step: 21.42 3%|▎ | 1329/50750 [3:24:48<81:28:25, 5.93s/it] {'loss': 0.007, 'learning_rate': 3.4904793171372295e-05, 'epoch': 1.31} 3%|▎ | 1329/50750 [3:24:48<81:28:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:07:32,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.92 [2024-11-13 20:07:32,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.93 | bwd_microstep: 3860.28 | bwd_inner_microstep: 3852.44 | bwd_allreduce_microstep: 7.79 | step_microstep: 22.17 [2024-11-13 20:07:32,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.91 | bwd: 3860.29 | bwd_inner: 3852.43 | bwd_allreduce: 7.81 | step: 22.17 3%|▎ | 1330/50750 [3:24:54<81:33:39, 5.94s/it] {'loss': 0.0177, 'learning_rate': 3.493105712409718e-05, 'epoch': 1.31} 3%|▎ | 1330/50750 [3:24:54<81:33:39, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:07:38,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 20:07:38,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.07 | bwd_microstep: 3849.39 | bwd_inner_microstep: 3841.87 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.83 [2024-11-13 20:07:38,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.06 | bwd: 3849.40 | bwd_inner: 3841.87 | bwd_allreduce: 7.49 | step: 20.84 3%|▎ | 1331/50750 [3:25:00<82:05:01, 5.98s/it] {'loss': 0.0586, 'learning_rate': 3.495732107682206e-05, 'epoch': 1.31} 3%|▎ | 1331/50750 [3:25:00<82:05:01, 5.98s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:07:44,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:07:44,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.87 | bwd_microstep: 3844.34 | bwd_inner_microstep: 3836.81 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.02 [2024-11-13 20:07:44,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.86 | bwd: 3844.35 | bwd_inner: 3836.81 | bwd_allreduce: 7.50 | step: 21.02 3%|▎ | 1332/50750 [3:25:06<81:49:21, 5.96s/it] {'loss': 0.2677, 'learning_rate': 3.498358502954695e-05, 'epoch': 1.31} 3%|▎ | 1332/50750 [3:25:06<81:49:21, 5.96s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:07:50,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:07:50,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.22 | bwd_microstep: 3845.67 | bwd_inner_microstep: 3838.20 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.98 [2024-11-13 20:07:50,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.22 | bwd: 3845.68 | bwd_inner: 3838.20 | bwd_allreduce: 7.44 | step: 20.98 3%|▎ | 1333/50750 [3:25:12<81:38:57, 5.95s/it] {'loss': 0.0011, 'learning_rate': 3.5009848982271836e-05, 'epoch': 1.31} 3%|▎ | 1333/50750 [3:25:12<81:38:57, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:07:56,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 20:07:56,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.38 | bwd_microstep: 3848.22 | bwd_inner_microstep: 3840.56 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.37 [2024-11-13 20:07:56,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.38 | bwd: 3848.24 | bwd_inner: 3840.56 | bwd_allreduce: 7.63 | step: 21.38 3%|▎ | 1334/50750 [3:25:18<81:32:04, 5.94s/it] {'loss': 0.0058, 'learning_rate': 3.503611293499672e-05, 'epoch': 1.31} 3%|▎ | 1334/50750 [3:25:18<81:32:04, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:08:02,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:08:02,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.72 | bwd_microstep: 3853.62 | bwd_inner_microstep: 3846.10 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.19 [2024-11-13 20:08:02,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.71 | bwd: 3853.63 | bwd_inner: 3846.10 | bwd_allreduce: 7.49 | step: 21.20 3%|▎ | 1335/50750 [3:25:24<81:29:56, 5.94s/it] {'loss': 0.0156, 'learning_rate': 3.5062376887721604e-05, 'epoch': 1.32} 3%|▎ | 1335/50750 [3:25:24<81:29:56, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:08:08,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 20:08:08,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.11 | bwd_microstep: 3855.49 | bwd_inner_microstep: 3847.93 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.86 [2024-11-13 20:08:08,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.11 | bwd: 3855.51 | bwd_inner: 3847.93 | bwd_allreduce: 7.54 | step: 21.86 3%|▎ | 1336/50750 [3:25:30<81:28:57, 5.94s/it] {'loss': 1.01, 'learning_rate': 3.508864084044649e-05, 'epoch': 1.32} 3%|▎ | 1336/50750 [3:25:30<81:28:57, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:08:13,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:08:13,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.85 | bwd_microstep: 3864.19 | bwd_inner_microstep: 3856.64 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.19 [2024-11-13 20:08:13,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.85 | bwd: 3864.21 | bwd_inner: 3856.64 | bwd_allreduce: 7.52 | step: 21.19 3%|▎ | 1337/50750 [3:25:35<81:30:07, 5.94s/it] {'loss': 0.0149, 'learning_rate': 3.511490479317138e-05, 'epoch': 1.32} 3%|▎ | 1337/50750 [3:25:35<81:30:07, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:08:19,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:08:19,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.07 | bwd_microstep: 3857.23 | bwd_inner_microstep: 3849.70 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.29 [2024-11-13 20:08:19,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.08 | bwd: 3857.24 | bwd_inner: 3849.70 | bwd_allreduce: 7.50 | step: 21.29 3%|▎ | 1338/50750 [3:25:41<81:28:34, 5.94s/it] {'loss': 0.1364, 'learning_rate': 3.514116874589626e-05, 'epoch': 1.32} 3%|▎ | 1338/50750 [3:25:41<81:28:34, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:08:25,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:08:25,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.63 | bwd_microstep: 3861.41 | bwd_inner_microstep: 3853.88 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.93 [2024-11-13 20:08:25,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.63 | bwd: 3861.42 | bwd_inner: 3853.88 | bwd_allreduce: 7.50 | step: 20.93 3%|▎ | 1339/50750 [3:25:47<81:27:42, 5.94s/it] {'loss': 0.0077, 'learning_rate': 3.5167432698621146e-05, 'epoch': 1.32} 3%|▎ | 1339/50750 [3:25:47<81:27:42, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:08:31,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 20:08:31,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.13 | bwd_microstep: 3857.60 | bwd_inner_microstep: 3849.77 | bwd_allreduce_microstep: 7.78 | step_microstep: 21.57 [2024-11-13 20:08:31,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.11 | bwd: 3857.62 | bwd_inner: 3849.77 | bwd_allreduce: 7.80 | step: 21.57 3%|▎ | 1340/50750 [3:25:53<81:27:24, 5.93s/it] {'loss': 0.0026, 'learning_rate': 3.519369665134603e-05, 'epoch': 1.32} 3%|▎ | 1340/50750 [3:25:53<81:27:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:08:37,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 20:08:37,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.91 | bwd_microstep: 3859.80 | bwd_inner_microstep: 3852.27 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.21 [2024-11-13 20:08:37,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.90 | bwd: 3859.81 | bwd_inner: 3852.27 | bwd_allreduce: 7.51 | step: 21.21 3%|▎ | 1341/50750 [3:25:59<81:28:38, 5.94s/it] {'loss': 0.0039, 'learning_rate': 3.521996060407092e-05, 'epoch': 1.32} 3%|▎ | 1341/50750 [3:25:59<81:28:38, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:08:43,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 20:08:43,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1994.28 | bwd_microstep: 3792.81 | bwd_inner_microstep: 3784.91 | bwd_allreduce_microstep: 7.85 | step_microstep: 21.70 [2024-11-13 20:08:43,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1994.28 | bwd: 3792.82 | bwd_inner: 3784.91 | bwd_allreduce: 7.87 | step: 21.71 3%|▎ | 1342/50750 [3:26:05<81:03:45, 5.91s/it] {'loss': 0.6299, 'learning_rate': 3.52462245567958e-05, 'epoch': 1.32} 3%|▎ | 1342/50750 [3:26:05<81:03:45, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:08:49,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.74 | optimizer_step: 4.93 [2024-11-13 20:08:49,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.37 | bwd_microstep: 3858.26 | bwd_inner_microstep: 3850.76 | bwd_allreduce_microstep: 7.46 | step_microstep: 22.50 [2024-11-13 20:08:49,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.33 | bwd: 3858.27 | bwd_inner: 3850.76 | bwd_allreduce: 7.47 | step: 22.53 3%|▎ | 1343/50750 [3:26:11<81:11:37, 5.92s/it] {'loss': 0.0552, 'learning_rate': 3.527248850952068e-05, 'epoch': 1.32} 3%|▎ | 1343/50750 [3:26:11<81:11:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:08:55,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.35 | optimizer_step: 4.92 [2024-11-13 20:08:55,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.22 | bwd_microstep: 3857.16 | bwd_inner_microstep: 3848.99 | bwd_allreduce_microstep: 8.11 | step_microstep: 26.29 [2024-11-13 20:08:55,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.22 | bwd: 3857.19 | bwd_inner: 3848.99 | bwd_allreduce: 8.14 | step: 26.29 3%|▎ | 1344/50750 [3:26:17<81:18:22, 5.92s/it] {'loss': 0.0038, 'learning_rate': 3.529875246224557e-05, 'epoch': 1.32} 3%|▎ | 1344/50750 [3:26:17<81:18:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:09:01,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.37 | optimizer_step: 4.93 [2024-11-13 20:09:01,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.67 | bwd_microstep: 3854.64 | bwd_inner_microstep: 3846.92 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.99 [2024-11-13 20:09:01,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.66 | bwd: 3854.66 | bwd_inner: 3846.92 | bwd_allreduce: 7.70 | step: 22.00 3%|▎ | 1345/50750 [3:26:23<81:20:24, 5.93s/it] {'loss': 0.4145, 'learning_rate': 3.5325016414970455e-05, 'epoch': 1.33} 3%|▎ | 1345/50750 [3:26:23<81:20:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:09:07,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 20:09:07,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.46 | bwd_microstep: 3853.32 | bwd_inner_microstep: 3845.72 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.05 [2024-11-13 20:09:07,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.44 | bwd: 3853.33 | bwd_inner: 3845.72 | bwd_allreduce: 7.57 | step: 21.05 3%|▎ | 1346/50750 [3:26:29<81:21:05, 5.93s/it] {'loss': 0.6102, 'learning_rate': 3.535128036769534e-05, 'epoch': 1.33} 3%|▎ | 1346/50750 [3:26:29<81:21:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:09:13,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 20:09:13,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.60 | bwd_microstep: 3854.13 | bwd_inner_microstep: 3846.42 | bwd_allreduce_microstep: 7.65 | step_microstep: 22.64 [2024-11-13 20:09:13,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.60 | bwd: 3854.15 | bwd_inner: 3846.42 | bwd_allreduce: 7.68 | step: 22.64 3%|▎ | 1347/50750 [3:26:35<81:21:52, 5.93s/it] {'loss': 0.0087, 'learning_rate': 3.537754432042022e-05, 'epoch': 1.33} 3%|▎ | 1347/50750 [3:26:35<81:21:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:09:19,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 20:09:19,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.95 | bwd_microstep: 3856.02 | bwd_inner_microstep: 3848.32 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.43 [2024-11-13 20:09:19,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.95 | bwd: 3856.03 | bwd_inner: 3848.32 | bwd_allreduce: 7.67 | step: 21.44 3%|▎ | 1348/50750 [3:26:41<81:22:05, 5.93s/it] {'loss': 0.0039, 'learning_rate': 3.540380827314511e-05, 'epoch': 1.33} 3%|▎ | 1348/50750 [3:26:41<81:22:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:09:25,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 5.11 [2024-11-13 20:09:25,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.18 | bwd_microstep: 3856.89 | bwd_inner_microstep: 3848.81 | bwd_allreduce_microstep: 8.02 | step_microstep: 30.25 [2024-11-13 20:09:25,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.16 | bwd: 3856.91 | bwd_inner: 3848.81 | bwd_allreduce: 8.05 | step: 30.25 3%|▎ | 1349/50750 [3:26:47<81:28:47, 5.94s/it] {'loss': 0.4585, 'learning_rate': 3.543007222587e-05, 'epoch': 1.33} 3%|▎ | 1349/50750 [3:26:47<81:28:47, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:09:31,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:09:31,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.84 | bwd_microstep: 3847.27 | bwd_inner_microstep: 3839.73 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.94 [2024-11-13 20:09:31,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.84 | bwd: 3847.28 | bwd_inner: 3839.73 | bwd_allreduce: 7.51 | step: 20.94 3%|▎ | 1350/50750 [3:26:53<81:25:08, 5.93s/it] {'loss': 0.0046, 'learning_rate': 3.5456336178594884e-05, 'epoch': 1.33} 3%|▎ | 1350/50750 [3:26:53<81:25:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:09:36,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 20:09:36,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.63 | bwd_microstep: 3856.73 | bwd_inner_microstep: 3849.22 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.99 [2024-11-13 20:09:36,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.64 | bwd: 3856.74 | bwd_inner: 3849.22 | bwd_allreduce: 7.48 | step: 20.99 3%|▎ | 1351/50750 [3:26:58<81:23:31, 5.93s/it] {'loss': 0.4608, 'learning_rate': 3.5482600131319765e-05, 'epoch': 1.33} 3%|▎ | 1351/50750 [3:26:58<81:23:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:09:42,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 20:09:42,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.68 | bwd_microstep: 3858.77 | bwd_inner_microstep: 3851.28 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.10 [2024-11-13 20:09:42,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.68 | bwd: 3858.78 | bwd_inner: 3851.28 | bwd_allreduce: 7.46 | step: 21.10 3%|▎ | 1352/50750 [3:27:04<81:25:01, 5.93s/it] {'loss': 0.0283, 'learning_rate': 3.550886408404465e-05, 'epoch': 1.33} 3%|▎ | 1352/50750 [3:27:04<81:25:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:09:48,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:09:48,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.46 | bwd_microstep: 3854.39 | bwd_inner_microstep: 3846.87 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.12 [2024-11-13 20:09:48,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.46 | bwd: 3854.40 | bwd_inner: 3846.87 | bwd_allreduce: 7.49 | step: 21.12 3%|▎ | 1353/50750 [3:27:10<81:24:23, 5.93s/it] {'loss': 0.5789, 'learning_rate': 3.553512803676954e-05, 'epoch': 1.33} 3%|▎ | 1353/50750 [3:27:10<81:24:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:09:54,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 20:09:54,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.05 | bwd_microstep: 3847.21 | bwd_inner_microstep: 3839.54 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.56 [2024-11-13 20:09:54,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.05 | bwd: 3847.23 | bwd_inner: 3839.54 | bwd_allreduce: 7.64 | step: 21.56 3%|▎ | 1354/50750 [3:27:16<81:21:59, 5.93s/it] {'loss': 0.0036, 'learning_rate': 3.556139198949442e-05, 'epoch': 1.33} 3%|▎ | 1354/50750 [3:27:16<81:21:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:10:00,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:10:00,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.82 | bwd_microstep: 3851.24 | bwd_inner_microstep: 3843.72 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.23 [2024-11-13 20:10:00,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.81 | bwd: 3851.26 | bwd_inner: 3843.72 | bwd_allreduce: 7.49 | step: 21.23 3%|▎ | 1355/50750 [3:27:22<81:21:44, 5.93s/it] {'loss': 0.0086, 'learning_rate': 3.5587655942219306e-05, 'epoch': 1.33} 3%|▎ | 1355/50750 [3:27:22<81:21:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:10:06,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:10:06,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.90 | bwd_microstep: 3861.29 | bwd_inner_microstep: 3853.60 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.17 [2024-11-13 20:10:06,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.90 | bwd: 3861.31 | bwd_inner: 3853.60 | bwd_allreduce: 7.65 | step: 21.17 3%|▎ | 1356/50750 [3:27:28<81:23:54, 5.93s/it] {'loss': 0.0692, 'learning_rate': 3.5613919894944194e-05, 'epoch': 1.34} 3%|▎ | 1356/50750 [3:27:28<81:23:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:10:12,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.92 [2024-11-13 20:10:12,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.62 | bwd_microstep: 3855.50 | bwd_inner_microstep: 3847.72 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.68 [2024-11-13 20:10:12,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.62 | bwd: 3855.51 | bwd_inner: 3847.72 | bwd_allreduce: 7.75 | step: 21.68 3%|▎ | 1357/50750 [3:27:34<81:24:28, 5.93s/it] {'loss': 0.0273, 'learning_rate': 3.564018384766908e-05, 'epoch': 1.34} 3%|▎ | 1357/50750 [3:27:34<81:24:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:10:18,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:10:18,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.15 | bwd_microstep: 3858.64 | bwd_inner_microstep: 3851.12 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.96 [2024-11-13 20:10:18,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.13 | bwd: 3858.66 | bwd_inner: 3851.12 | bwd_allreduce: 7.49 | step: 20.96 3%|▎ | 1358/50750 [3:27:40<81:25:53, 5.94s/it] {'loss': 0.0079, 'learning_rate': 3.566644780039396e-05, 'epoch': 1.34} 3%|▎ | 1358/50750 [3:27:40<81:25:53, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:10:24,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.50 | optimizer_step: 4.93 [2024-11-13 20:10:24,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.95 | bwd_microstep: 3855.14 | bwd_inner_microstep: 3847.34 | bwd_allreduce_microstep: 7.75 | step_microstep: 23.03 [2024-11-13 20:10:24,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.95 | bwd: 3855.16 | bwd_inner: 3847.34 | bwd_allreduce: 7.77 | step: 23.03 3%|▎ | 1359/50750 [3:27:46<81:24:18, 5.93s/it] {'loss': 0.511, 'learning_rate': 3.569271175311885e-05, 'epoch': 1.34} 3%|▎ | 1359/50750 [3:27:46<81:24:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:10:30,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:10:30,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.04 | bwd_microstep: 3864.73 | bwd_inner_microstep: 3857.19 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.35 [2024-11-13 20:10:30,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.04 | bwd: 3864.74 | bwd_inner: 3857.19 | bwd_allreduce: 7.51 | step: 21.36 3%|▎ | 1360/50750 [3:27:52<81:26:41, 5.94s/it] {'loss': 0.3887, 'learning_rate': 3.5718975705843735e-05, 'epoch': 1.34} 3%|▎ | 1360/50750 [3:27:52<81:26:41, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:10:36,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 20:10:36,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.94 | bwd_microstep: 3855.69 | bwd_inner_microstep: 3847.86 | bwd_allreduce_microstep: 7.76 | step_microstep: 27.57 [2024-11-13 20:10:36,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.94 | bwd: 3855.71 | bwd_inner: 3847.86 | bwd_allreduce: 7.79 | step: 27.58 3%|▎ | 1361/50750 [3:27:58<81:27:31, 5.94s/it] {'loss': 0.0181, 'learning_rate': 3.5745239658568616e-05, 'epoch': 1.34} 3%|▎ | 1361/50750 [3:27:58<81:27:31, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:10:42,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:10:42,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.64 | bwd_microstep: 3855.90 | bwd_inner_microstep: 3848.37 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.12 [2024-11-13 20:10:42,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.62 | bwd: 3855.91 | bwd_inner: 3848.37 | bwd_allreduce: 7.50 | step: 21.13 3%|▎ | 1362/50750 [3:28:04<81:26:54, 5.94s/it] {'loss': 0.1881, 'learning_rate': 3.57715036112935e-05, 'epoch': 1.34} 3%|▎ | 1362/50750 [3:28:04<81:26:54, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:10:48,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 20:10:48,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.37 | bwd_microstep: 3858.96 | bwd_inner_microstep: 3850.69 | bwd_allreduce_microstep: 8.23 | step_microstep: 21.04 [2024-11-13 20:10:48,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.37 | bwd: 3858.97 | bwd_inner: 3850.69 | bwd_allreduce: 8.24 | step: 21.04 3%|▎ | 1363/50750 [3:28:10<81:25:03, 5.93s/it] {'loss': 0.3037, 'learning_rate': 3.579776756401838e-05, 'epoch': 1.34} 3%|▎ | 1363/50750 [3:28:10<81:25:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:10:54,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.94 [2024-11-13 20:10:54,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.98 | bwd_microstep: 3858.63 | bwd_inner_microstep: 3851.09 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.89 [2024-11-13 20:10:54,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.98 | bwd: 3858.64 | bwd_inner: 3851.09 | bwd_allreduce: 7.51 | step: 21.90 3%|▎ | 1364/50750 [3:28:16<81:23:54, 5.93s/it] {'loss': 0.0083, 'learning_rate': 3.582403151674328e-05, 'epoch': 1.34} 3%|▎ | 1364/50750 [3:28:16<81:23:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:11:00,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 20:11:00,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.17 | bwd_microstep: 3859.10 | bwd_inner_microstep: 3851.40 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.49 [2024-11-13 20:11:00,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.17 | bwd: 3859.11 | bwd_inner: 3851.40 | bwd_allreduce: 7.67 | step: 21.49 3%|▎ | 1365/50750 [3:28:22<81:23:48, 5.93s/it] {'loss': 0.255, 'learning_rate': 3.585029546946816e-05, 'epoch': 1.34} 3%|▎ | 1365/50750 [3:28:22<81:23:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:11:05,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 20:11:05,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.98 | bwd_microstep: 3858.09 | bwd_inner_microstep: 3850.58 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.89 [2024-11-13 20:11:05,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.96 | bwd: 3858.10 | bwd_inner: 3850.58 | bwd_allreduce: 7.48 | step: 20.90 3%|▎ | 1366/50750 [3:28:27<81:22:55, 5.93s/it] {'loss': 0.0247, 'learning_rate': 3.5876559422193045e-05, 'epoch': 1.35} 3%|▎ | 1366/50750 [3:28:27<81:22:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:11:11,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 20:11:11,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.89 | bwd_microstep: 3861.34 | bwd_inner_microstep: 3853.74 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.45 [2024-11-13 20:11:11,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.89 | bwd: 3861.35 | bwd_inner: 3853.74 | bwd_allreduce: 7.57 | step: 21.46 3%|▎ | 1367/50750 [3:28:33<81:22:39, 5.93s/it] {'loss': 0.0235, 'learning_rate': 3.5902823374917925e-05, 'epoch': 1.35} 3%|▎ | 1367/50750 [3:28:33<81:22:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:11:17,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 20:11:17,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.70 | bwd_microstep: 3859.25 | bwd_inner_microstep: 3851.56 | bwd_allreduce_microstep: 7.64 | step_microstep: 22.29 [2024-11-13 20:11:17,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.69 | bwd: 3859.26 | bwd_inner: 3851.56 | bwd_allreduce: 7.66 | step: 22.29 3%|▎ | 1368/50750 [3:28:39<81:24:03, 5.93s/it] {'loss': 0.0198, 'learning_rate': 3.592908732764281e-05, 'epoch': 1.35} 3%|▎ | 1368/50750 [3:28:39<81:24:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:11:23,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.63 | optimizer_step: 4.93 [2024-11-13 20:11:23,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.34 | bwd_microstep: 3859.96 | bwd_inner_microstep: 3851.92 | bwd_allreduce_microstep: 7.98 | step_microstep: 29.58 [2024-11-13 20:11:23,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.33 | bwd: 3859.98 | bwd_inner: 3851.92 | bwd_allreduce: 8.00 | step: 29.58 3%|▎ | 1369/50750 [3:28:45<81:28:27, 5.94s/it] {'loss': 0.0298, 'learning_rate': 3.59553512803677e-05, 'epoch': 1.35} 3%|▎ | 1369/50750 [3:28:45<81:28:27, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:11:29,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:11:29,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.88 | bwd_microstep: 3862.45 | bwd_inner_microstep: 3854.94 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.05 [2024-11-13 20:11:29,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.88 | bwd: 3862.46 | bwd_inner: 3854.94 | bwd_allreduce: 7.48 | step: 21.05 3%|▎ | 1370/50750 [3:28:51<81:26:55, 5.94s/it] {'loss': 0.2086, 'learning_rate': 3.598161523309258e-05, 'epoch': 1.35} 3%|▎ | 1370/50750 [3:28:51<81:26:55, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:11:35,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 20:11:35,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.38 | bwd_microstep: 3854.53 | bwd_inner_microstep: 3846.98 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.27 [2024-11-13 20:11:35,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.38 | bwd: 3854.54 | bwd_inner: 3846.98 | bwd_allreduce: 7.53 | step: 21.27 3%|▎ | 1371/50750 [3:28:57<81:24:55, 5.94s/it] {'loss': 0.0527, 'learning_rate': 3.600787918581747e-05, 'epoch': 1.35} 3%|▎ | 1371/50750 [3:28:57<81:24:55, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:11:41,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 20:11:41,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.27 | bwd_microstep: 3860.38 | bwd_inner_microstep: 3852.83 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.52 [2024-11-13 20:11:41,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.27 | bwd: 3860.39 | bwd_inner: 3852.83 | bwd_allreduce: 7.52 | step: 21.52 3%|▎ | 1372/50750 [3:29:03<81:24:53, 5.94s/it] {'loss': 0.0025, 'learning_rate': 3.6034143138542354e-05, 'epoch': 1.35} 3%|▎ | 1372/50750 [3:29:03<81:24:53, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:11:47,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.93 [2024-11-13 20:11:47,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.16 | bwd_microstep: 3850.06 | bwd_inner_microstep: 3842.41 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.95 [2024-11-13 20:11:47,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.16 | bwd: 3850.07 | bwd_inner: 3842.41 | bwd_allreduce: 7.62 | step: 21.95 3%|▎ | 1373/50750 [3:29:09<81:22:48, 5.93s/it] {'loss': 0.0614, 'learning_rate': 3.606040709126724e-05, 'epoch': 1.35} 3%|▎ | 1373/50750 [3:29:09<81:22:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:11:53,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:11:53,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.24 | bwd_microstep: 3859.47 | bwd_inner_microstep: 3851.94 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.89 [2024-11-13 20:11:53,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.23 | bwd: 3859.48 | bwd_inner: 3851.94 | bwd_allreduce: 7.50 | step: 20.89 3%|▎ | 1374/50750 [3:29:15<81:22:57, 5.93s/it] {'loss': 0.297, 'learning_rate': 3.608667104399212e-05, 'epoch': 1.35} 3%|▎ | 1374/50750 [3:29:15<81:22:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:11:59,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:11:59,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.32 | bwd_microstep: 3859.64 | bwd_inner_microstep: 3852.14 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.11 [2024-11-13 20:11:59,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.32 | bwd: 3859.65 | bwd_inner: 3852.14 | bwd_allreduce: 7.47 | step: 21.11 3%|▎ | 1375/50750 [3:29:21<81:21:53, 5.93s/it] {'loss': 0.6988, 'learning_rate': 3.611293499671701e-05, 'epoch': 1.35} 3%|▎ | 1375/50750 [3:29:21<81:21:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:12:05,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 20:12:05,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.41 | bwd_microstep: 3851.45 | bwd_inner_microstep: 3843.98 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.80 [2024-11-13 20:12:05,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.41 | bwd: 3851.46 | bwd_inner: 3843.98 | bwd_allreduce: 7.45 | step: 20.81 3%|▎ | 1376/50750 [3:29:27<81:19:20, 5.93s/it] {'loss': 0.0414, 'learning_rate': 3.6139198949441896e-05, 'epoch': 1.36} 3%|▎ | 1376/50750 [3:29:27<81:19:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:12:11,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 20:12:11,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.83 | bwd_microstep: 3853.73 | bwd_inner_microstep: 3846.25 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.98 [2024-11-13 20:12:11,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.82 | bwd: 3853.74 | bwd_inner: 3846.25 | bwd_allreduce: 7.45 | step: 20.98 3%|▎ | 1377/50750 [3:29:33<81:18:11, 5.93s/it] {'loss': 0.0746, 'learning_rate': 3.6165462902166776e-05, 'epoch': 1.36} 3%|▎ | 1377/50750 [3:29:33<81:18:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:12:17,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 20:12:17,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.13 | bwd_microstep: 3847.49 | bwd_inner_microstep: 3839.95 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.45 [2024-11-13 20:12:17,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.13 | bwd: 3847.50 | bwd_inner: 3839.95 | bwd_allreduce: 7.51 | step: 21.46 3%|▎ | 1378/50750 [3:29:39<81:15:57, 5.93s/it] {'loss': 0.1295, 'learning_rate': 3.6191726854891664e-05, 'epoch': 1.36} 3%|▎ | 1378/50750 [3:29:39<81:15:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:12:23,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.92 [2024-11-13 20:12:23,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.56 | bwd_microstep: 3856.80 | bwd_inner_microstep: 3849.26 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.74 [2024-11-13 20:12:23,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.56 | bwd: 3856.81 | bwd_inner: 3849.26 | bwd_allreduce: 7.51 | step: 21.74 3%|▎ | 1379/50750 [3:29:45<81:16:50, 5.93s/it] {'loss': 0.0339, 'learning_rate': 3.621799080761655e-05, 'epoch': 1.36} 3%|▎ | 1379/50750 [3:29:45<81:16:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:12:29,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 20:12:29,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.84 | bwd_microstep: 3846.65 | bwd_inner_microstep: 3839.14 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.20 [2024-11-13 20:12:29,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.84 | bwd: 3846.66 | bwd_inner: 3839.14 | bwd_allreduce: 7.48 | step: 21.21 3%|▎ | 1380/50750 [3:29:50<81:15:15, 5.92s/it] {'loss': 0.6607, 'learning_rate': 3.624425476034144e-05, 'epoch': 1.36} 3%|▎ | 1380/50750 [3:29:50<81:15:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:12:34,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:12:34,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.17 | bwd_microstep: 3848.39 | bwd_inner_microstep: 3840.89 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.09 [2024-11-13 20:12:34,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.16 | bwd: 3848.41 | bwd_inner: 3840.89 | bwd_allreduce: 7.47 | step: 21.09 3%|▎ | 1381/50750 [3:29:56<81:14:10, 5.92s/it] {'loss': 0.1675, 'learning_rate': 3.627051871306632e-05, 'epoch': 1.36} 3%|▎ | 1381/50750 [3:29:56<81:14:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:12:40,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:12:40,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.77 | bwd_microstep: 3854.80 | bwd_inner_microstep: 3847.30 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.07 [2024-11-13 20:12:40,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.77 | bwd: 3854.82 | bwd_inner: 3847.30 | bwd_allreduce: 7.47 | step: 21.07 3%|▎ | 1382/50750 [3:30:02<81:14:00, 5.92s/it] {'loss': 0.0968, 'learning_rate': 3.6296782665791205e-05, 'epoch': 1.36} 3%|▎ | 1382/50750 [3:30:02<81:14:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:12:46,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.92 [2024-11-13 20:12:46,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.09 | bwd_microstep: 3854.07 | bwd_inner_microstep: 3846.25 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.24 [2024-11-13 20:12:46,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.09 | bwd: 3854.09 | bwd_inner: 3846.25 | bwd_allreduce: 7.80 | step: 22.24 3%|▎ | 1383/50750 [3:30:08<81:16:11, 5.93s/it] {'loss': 0.2223, 'learning_rate': 3.6323046618516086e-05, 'epoch': 1.36} 3%|▎ | 1383/50750 [3:30:08<81:16:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:12:52,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.74 | optimizer_step: 4.93 [2024-11-13 20:12:52,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.90 | bwd_microstep: 3849.83 | bwd_inner_microstep: 3841.62 | bwd_allreduce_microstep: 8.14 | step_microstep: 31.78 [2024-11-13 20:12:52,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.89 | bwd: 3849.85 | bwd_inner: 3841.62 | bwd_allreduce: 8.17 | step: 31.78 3%|▎ | 1384/50750 [3:30:14<81:19:44, 5.93s/it] {'loss': 0.0574, 'learning_rate': 3.634931057124098e-05, 'epoch': 1.36} 3%|▎ | 1384/50750 [3:30:14<81:19:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:12:58,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 20:12:58,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.24 | bwd_microstep: 3854.61 | bwd_inner_microstep: 3847.09 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.61 [2024-11-13 20:12:58,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.22 | bwd: 3854.63 | bwd_inner: 3847.09 | bwd_allreduce: 7.50 | step: 21.63 3%|▎ | 1385/50750 [3:30:20<81:20:48, 5.93s/it] {'loss': 0.005, 'learning_rate': 3.637557452396586e-05, 'epoch': 1.36} 3%|▎ | 1385/50750 [3:30:20<81:20:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:13:04,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.94 [2024-11-13 20:13:04,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.40 | bwd_microstep: 3848.24 | bwd_inner_microstep: 3840.78 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.95 [2024-11-13 20:13:04,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.40 | bwd: 3848.26 | bwd_inner: 3840.78 | bwd_allreduce: 7.44 | step: 20.95 3%|▎ | 1386/50750 [3:30:26<81:18:31, 5.93s/it] {'loss': 0.0957, 'learning_rate': 3.640183847669074e-05, 'epoch': 1.37} 3%|▎ | 1386/50750 [3:30:26<81:18:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:13:10,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 20:13:10,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.80 | bwd_microstep: 3850.01 | bwd_inner_microstep: 3842.07 | bwd_allreduce_microstep: 7.88 | step_microstep: 23.85 [2024-11-13 20:13:10,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.80 | bwd: 3850.04 | bwd_inner: 3842.07 | bwd_allreduce: 7.90 | step: 23.85 3%|▎ | 1387/50750 [3:30:32<81:17:39, 5.93s/it] {'loss': 0.0238, 'learning_rate': 3.642810242941563e-05, 'epoch': 1.37} 3%|▎ | 1387/50750 [3:30:32<81:17:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:13:16,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 5.05 [2024-11-13 20:13:16,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.52 | bwd_microstep: 3853.42 | bwd_inner_microstep: 3845.90 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.47 [2024-11-13 20:13:16,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.50 | bwd: 3853.44 | bwd_inner: 3845.90 | bwd_allreduce: 7.50 | step: 21.47 3%|▎ | 1388/50750 [3:30:38<81:18:48, 5.93s/it] {'loss': 0.0059, 'learning_rate': 3.6454366382140515e-05, 'epoch': 1.37} 3%|▎ | 1388/50750 [3:30:38<81:18:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:13:22,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 20:13:22,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.62 | bwd_microstep: 3850.45 | bwd_inner_microstep: 3842.94 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.15 [2024-11-13 20:13:22,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.60 | bwd: 3850.46 | bwd_inner: 3842.94 | bwd_allreduce: 7.48 | step: 21.16 3%|▎ | 1389/50750 [3:30:44<81:18:14, 5.93s/it] {'loss': 0.0109, 'learning_rate': 3.64806303348654e-05, 'epoch': 1.37} 3%|▎ | 1389/50750 [3:30:44<81:18:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:13:28,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-13 20:13:28,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.48 | bwd_microstep: 3854.79 | bwd_inner_microstep: 3847.10 | bwd_allreduce_microstep: 7.65 | step_microstep: 22.00 [2024-11-13 20:13:28,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.48 | bwd: 3854.81 | bwd_inner: 3847.10 | bwd_allreduce: 7.66 | step: 22.00 3%|▎ | 1390/50750 [3:30:50<81:18:57, 5.93s/it] {'loss': 0.0093, 'learning_rate': 3.650689428759028e-05, 'epoch': 1.37} 3%|▎ | 1390/50750 [3:30:50<81:18:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2193 [2024-11-13 20:13:34,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 4.93 [2024-11-13 20:13:34,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.84 | bwd_microstep: 3849.79 | bwd_inner_microstep: 3841.53 | bwd_allreduce_microstep: 8.20 | step_microstep: 22.92 [2024-11-13 20:13:34,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.84 | bwd: 3849.80 | bwd_inner: 3841.53 | bwd_allreduce: 8.23 | step: 22.92 3%|▎ | 1391/50750 [3:30:56<81:18:45, 5.93s/it] {'loss': 0.0, 'learning_rate': 3.653315824031517e-05, 'epoch': 1.37} 3%|▎ | 1391/50750 [3:30:56<81:18:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:13:40,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 20:13:40,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.68 | bwd_microstep: 3853.08 | bwd_inner_microstep: 3845.62 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.21 [2024-11-13 20:13:40,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.66 | bwd: 3853.09 | bwd_inner: 3845.62 | bwd_allreduce: 7.44 | step: 21.22 3%|▎ | 1392/50750 [3:31:02<81:19:27, 5.93s/it] {'loss': 0.0027, 'learning_rate': 3.655942219304006e-05, 'epoch': 1.37} 3%|▎ | 1392/50750 [3:31:02<81:19:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:13:46,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:13:46,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.96 | bwd_microstep: 3850.08 | bwd_inner_microstep: 3842.60 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.81 [2024-11-13 20:13:46,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.96 | bwd: 3850.09 | bwd_inner: 3842.60 | bwd_allreduce: 7.46 | step: 20.81 3%|▎ | 1393/50750 [3:31:08<81:17:28, 5.93s/it] {'loss': 0.003, 'learning_rate': 3.6585686145764944e-05, 'epoch': 1.37} 3%|▎ | 1393/50750 [3:31:08<81:17:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:13:52,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 20:13:52,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.61 | bwd_microstep: 3851.58 | bwd_inner_microstep: 3844.01 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.59 [2024-11-13 20:13:52,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.61 | bwd: 3851.60 | bwd_inner: 3844.01 | bwd_allreduce: 7.55 | step: 21.59 3%|▎ | 1394/50750 [3:31:14<81:17:21, 5.93s/it] {'loss': 0.0011, 'learning_rate': 3.6611950098489824e-05, 'epoch': 1.37} 3%|▎ | 1394/50750 [3:31:14<81:17:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:13:57,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 20:13:57,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.37 | bwd_microstep: 3848.37 | bwd_inner_microstep: 3840.83 | bwd_allreduce_microstep: 7.50 | step_microstep: 20.91 [2024-11-13 20:13:57,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.37 | bwd: 3848.38 | bwd_inner: 3840.83 | bwd_allreduce: 7.51 | step: 20.92 3%|▎ | 1395/50750 [3:31:19<81:16:49, 5.93s/it] {'loss': 0.1726, 'learning_rate': 3.663821405121471e-05, 'epoch': 1.37} 3%|▎ | 1395/50750 [3:31:19<81:16:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:14:03,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 20:14:03,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.41 | bwd_microstep: 3859.22 | bwd_inner_microstep: 3851.24 | bwd_allreduce_microstep: 7.92 | step_microstep: 22.14 [2024-11-13 20:14:03,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.41 | bwd: 3859.24 | bwd_inner: 3851.24 | bwd_allreduce: 7.95 | step: 22.14 3%|▎ | 1396/50750 [3:31:25<81:19:46, 5.93s/it] {'loss': 0.0032, 'learning_rate': 3.66644780039396e-05, 'epoch': 1.38} 3%|▎ | 1396/50750 [3:31:25<81:19:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:14:09,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 20:14:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.17 | bwd_microstep: 3852.29 | bwd_inner_microstep: 3844.75 | bwd_allreduce_microstep: 7.50 | step_microstep: 20.99 [2024-11-13 20:14:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.15 | bwd: 3852.31 | bwd_inner: 3844.75 | bwd_allreduce: 7.51 | step: 20.99 3%|▎ | 1397/50750 [3:31:31<81:18:57, 5.93s/it] {'loss': 0.0017, 'learning_rate': 3.669074195666448e-05, 'epoch': 1.38} 3%|▎ | 1397/50750 [3:31:31<81:18:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:14:15,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:14:15,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1994.28 | bwd_microstep: 3790.90 | bwd_inner_microstep: 3783.39 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.90 [2024-11-13 20:14:15,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1994.28 | bwd: 3790.92 | bwd_inner: 3783.39 | bwd_allreduce: 7.49 | step: 20.91 3%|▎ | 1398/50750 [3:31:37<80:54:16, 5.90s/it] {'loss': 0.3424, 'learning_rate': 3.6717005909389366e-05, 'epoch': 1.38} 3%|▎ | 1398/50750 [3:31:37<80:54:16, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 20:14:21,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 20:14:21,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.72 | bwd_microstep: 3849.41 | bwd_inner_microstep: 3841.60 | bwd_allreduce_microstep: 7.75 | step_microstep: 21.96 [2024-11-13 20:14:21,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.71 | bwd: 3849.42 | bwd_inner: 3841.60 | bwd_allreduce: 7.77 | step: 21.97 3%|▎ | 1399/50750 [3:31:43<81:02:41, 5.91s/it] {'loss': 0.0093, 'learning_rate': 3.674326986211425e-05, 'epoch': 1.38} 3%|▎ | 1399/50750 [3:31:43<81:02:41, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:14:27,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 20:14:27,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.62 | bwd_microstep: 3849.12 | bwd_inner_microstep: 3841.48 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.24 [2024-11-13 20:14:27,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.61 | bwd: 3849.14 | bwd_inner: 3841.48 | bwd_allreduce: 7.61 | step: 21.24 3%|▎ | 1400/50750 [3:31:49<81:06:13, 5.92s/it] {'loss': 0.7564, 'learning_rate': 3.676953381483914e-05, 'epoch': 1.38} 3%|▎ | 1400/50750 [3:31:49<81:06:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 20:14:33,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.92 [2024-11-13 20:14:33,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.48 | bwd_microstep: 3853.77 | bwd_inner_microstep: 3845.64 | bwd_allreduce_microstep: 8.08 | step_microstep: 22.35 [2024-11-13 20:14:33,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.46 | bwd: 3853.79 | bwd_inner: 3845.64 | bwd_allreduce: 8.10 | step: 22.35 3%|▎ | 1401/50750 [3:31:55<81:12:46, 5.92s/it] {'loss': 0.0068, 'learning_rate': 3.679579776756402e-05, 'epoch': 1.38} 3%|▎ | 1401/50750 [3:31:55<81:12:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:14:39,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.37 | optimizer_step: 5.09 [2024-11-13 20:14:39,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.54 | bwd_microstep: 3858.09 | bwd_inner_microstep: 3850.48 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.87 [2024-11-13 20:14:39,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.52 | bwd: 3858.10 | bwd_inner: 3850.48 | bwd_allreduce: 7.58 | step: 21.87 3%|▎ | 1402/50750 [3:32:01<81:16:26, 5.93s/it] {'loss': 0.5366, 'learning_rate': 3.682206172028891e-05, 'epoch': 1.38} 3%|▎ | 1402/50750 [3:32:01<81:16:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 20:14:45,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:14:45,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.46 | bwd_microstep: 3858.01 | bwd_inner_microstep: 3850.17 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.12 [2024-11-13 20:14:45,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.45 | bwd: 3858.03 | bwd_inner: 3850.17 | bwd_allreduce: 7.81 | step: 22.12 3%|▎ | 1403/50750 [3:32:07<81:18:38, 5.93s/it] {'loss': 0.0014, 'learning_rate': 3.684832567301379e-05, 'epoch': 1.38} 3%|▎ | 1403/50750 [3:32:07<81:18:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:14:51,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 20:14:51,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.54 | bwd_microstep: 3848.80 | bwd_inner_microstep: 3841.21 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.17 [2024-11-13 20:14:51,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.54 | bwd: 3848.81 | bwd_inner: 3841.21 | bwd_allreduce: 7.56 | step: 21.18 3%|▎ | 1404/50750 [3:32:13<81:17:22, 5.93s/it] {'loss': 0.0054, 'learning_rate': 3.6874589625738675e-05, 'epoch': 1.38} 3%|▎ | 1404/50750 [3:32:13<81:17:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:14:57,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 20:14:57,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.77 | bwd_microstep: 3858.28 | bwd_inner_microstep: 3850.78 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.14 [2024-11-13 20:14:57,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.75 | bwd: 3858.29 | bwd_inner: 3850.78 | bwd_allreduce: 7.47 | step: 21.14 3%|▎ | 1405/50750 [3:32:19<81:19:26, 5.93s/it] {'loss': 0.0522, 'learning_rate': 3.690085357846356e-05, 'epoch': 1.38} 3%|▎ | 1405/50750 [3:32:19<81:19:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:15:03,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 20:15:03,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.05 | bwd_microstep: 3851.27 | bwd_inner_microstep: 3843.30 | bwd_allreduce_microstep: 7.93 | step_microstep: 21.17 [2024-11-13 20:15:03,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.05 | bwd: 3851.28 | bwd_inner: 3843.30 | bwd_allreduce: 7.94 | step: 21.18 3%|▎ | 1406/50750 [3:32:25<81:18:39, 5.93s/it] {'loss': 0.9077, 'learning_rate': 3.692711753118844e-05, 'epoch': 1.39} 3%|▎ | 1406/50750 [3:32:25<81:18:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:15:09,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.78 | optimizer_step: 4.96 [2024-11-13 20:15:09,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.84 | bwd_microstep: 3851.62 | bwd_inner_microstep: 3844.08 | bwd_allreduce_microstep: 7.50 | step_microstep: 22.95 [2024-11-13 20:15:09,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.84 | bwd: 3851.63 | bwd_inner: 3844.08 | bwd_allreduce: 7.51 | step: 22.97 3%|▎ | 1407/50750 [3:32:31<81:18:10, 5.93s/it] {'loss': 0.1576, 'learning_rate': 3.695338148391333e-05, 'epoch': 1.39} 3%|▎ | 1407/50750 [3:32:31<81:18:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:15:15,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 20:15:15,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.58 | bwd_microstep: 3850.95 | bwd_inner_microstep: 3843.29 | bwd_allreduce_microstep: 7.61 | step_microstep: 22.01 [2024-11-13 20:15:15,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.58 | bwd: 3850.96 | bwd_inner: 3843.29 | bwd_allreduce: 7.63 | step: 22.01 3%|▎ | 1408/50750 [3:32:36<81:16:24, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.697964543663822e-05, 'epoch': 1.39} 3%|▎ | 1408/50750 [3:32:36<81:16:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:15:20,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 20:15:20,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.24 | bwd_microstep: 3855.06 | bwd_inner_microstep: 3847.52 | bwd_allreduce_microstep: 7.50 | step_microstep: 22.73 [2024-11-13 20:15:20,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.23 | bwd: 3855.07 | bwd_inner: 3847.52 | bwd_allreduce: 7.51 | step: 22.73 3%|▎ | 1409/50750 [3:32:42<81:17:37, 5.93s/it] {'loss': 0.7059, 'learning_rate': 3.7005909389363105e-05, 'epoch': 1.39} 3%|▎ | 1409/50750 [3:32:42<81:17:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2203 [2024-11-13 20:15:26,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:15:26,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.19 | bwd_microstep: 3852.45 | bwd_inner_microstep: 3844.97 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.96 [2024-11-13 20:15:26,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.19 | bwd: 3852.46 | bwd_inner: 3844.97 | bwd_allreduce: 7.45 | step: 20.97 3%|▎ | 1410/50750 [3:32:48<81:17:15, 5.93s/it] {'loss': 0.787, 'learning_rate': 3.7032173342087985e-05, 'epoch': 1.39} 3%|▎ | 1410/50750 [3:32:48<81:17:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:15:32,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 20:15:32,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.90 | bwd_microstep: 3856.88 | bwd_inner_microstep: 3849.40 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.12 [2024-11-13 20:15:32,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.90 | bwd: 3856.89 | bwd_inner: 3849.40 | bwd_allreduce: 7.46 | step: 21.12 3%|▎ | 1411/50750 [3:32:54<81:17:33, 5.93s/it] {'loss': 0.0239, 'learning_rate': 3.705843729481287e-05, 'epoch': 1.39} 3%|▎ | 1411/50750 [3:32:54<81:17:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:15:38,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 20:15:38,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.71 | bwd_microstep: 3860.34 | bwd_inner_microstep: 3852.43 | bwd_allreduce_microstep: 7.86 | step_microstep: 22.03 [2024-11-13 20:15:38,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.71 | bwd: 3860.36 | bwd_inner: 3852.43 | bwd_allreduce: 7.88 | step: 22.03 3%|▎ | 1412/50750 [3:33:00<81:18:14, 5.93s/it] {'loss': 0.2985, 'learning_rate': 3.708470124753776e-05, 'epoch': 1.39} 3%|▎ | 1412/50750 [3:33:00<81:18:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:15:44,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.94 [2024-11-13 20:15:44,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.37 | bwd_microstep: 3847.91 | bwd_inner_microstep: 3840.17 | bwd_allreduce_microstep: 7.69 | step_microstep: 24.19 [2024-11-13 20:15:44,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.36 | bwd: 3847.93 | bwd_inner: 3840.17 | bwd_allreduce: 7.71 | step: 24.18 3%|▎ | 1413/50750 [3:33:06<81:16:40, 5.93s/it] {'loss': 0.2507, 'learning_rate': 3.711096520026264e-05, 'epoch': 1.39} 3%|▎ | 1413/50750 [3:33:06<81:16:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:15:50,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 20:15:50,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.91 | bwd_microstep: 3848.71 | bwd_inner_microstep: 3841.00 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.17 [2024-11-13 20:15:50,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.91 | bwd: 3848.72 | bwd_inner: 3841.00 | bwd_allreduce: 7.68 | step: 21.17 3%|▎ | 1414/50750 [3:33:12<81:15:04, 5.93s/it] {'loss': 0.0297, 'learning_rate': 3.713722915298753e-05, 'epoch': 1.39} 3%|▎ | 1414/50750 [3:33:12<81:15:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:15:56,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 20:15:56,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.00 | bwd_microstep: 3848.99 | bwd_inner_microstep: 3841.47 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.15 [2024-11-13 20:15:56,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.00 | bwd: 3849.01 | bwd_inner: 3841.47 | bwd_allreduce: 7.50 | step: 21.16 3%|▎ | 1415/50750 [3:33:18<81:12:55, 5.93s/it] {'loss': 0.6038, 'learning_rate': 3.7163493105712414e-05, 'epoch': 1.39} 3%|▎ | 1415/50750 [3:33:18<81:12:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:16:02,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:16:02,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.69 | bwd_microstep: 3851.18 | bwd_inner_microstep: 3843.65 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.02 [2024-11-13 20:16:02,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.69 | bwd: 3851.19 | bwd_inner: 3843.65 | bwd_allreduce: 7.50 | step: 21.03 3%|▎ | 1416/50750 [3:33:24<81:11:38, 5.92s/it] {'loss': 0.3233, 'learning_rate': 3.71897570584373e-05, 'epoch': 1.4} 3%|▎ | 1416/50750 [3:33:24<81:11:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:16:08,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:16:08,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.28 | bwd_microstep: 3851.76 | bwd_inner_microstep: 3844.25 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.12 [2024-11-13 20:16:08,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.28 | bwd: 3851.77 | bwd_inner: 3844.25 | bwd_allreduce: 7.49 | step: 21.13 3%|▎ | 1417/50750 [3:33:30<81:11:47, 5.93s/it] {'loss': 0.4445, 'learning_rate': 3.721602101116218e-05, 'epoch': 1.4} 3%|▎ | 1417/50750 [3:33:30<81:11:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:16:14,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.94 [2024-11-13 20:16:14,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.93 | bwd_microstep: 3847.98 | bwd_inner_microstep: 3840.42 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.18 [2024-11-13 20:16:14,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.93 | bwd: 3847.99 | bwd_inner: 3840.42 | bwd_allreduce: 7.53 | step: 21.19 3%|▎ | 1418/50750 [3:33:36<81:11:16, 5.92s/it] {'loss': 0.0488, 'learning_rate': 3.724228496388707e-05, 'epoch': 1.4} 3%|▎ | 1418/50750 [3:33:36<81:11:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:16:20,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.84 | optimizer_step: 4.92 [2024-11-13 20:16:20,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.08 | bwd_microstep: 3852.05 | bwd_inner_microstep: 3844.38 | bwd_allreduce_microstep: 7.62 | step_microstep: 23.03 [2024-11-13 20:16:20,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.08 | bwd: 3852.06 | bwd_inner: 3844.38 | bwd_allreduce: 7.64 | step: 23.05 3%|▎ | 1419/50750 [3:33:42<81:14:10, 5.93s/it] {'loss': 0.0996, 'learning_rate': 3.7268548916611956e-05, 'epoch': 1.4} 3%|▎ | 1419/50750 [3:33:42<81:14:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:16:26,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 20:16:26,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.32 | bwd_microstep: 3848.92 | bwd_inner_microstep: 3841.22 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.56 [2024-11-13 20:16:26,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.31 | bwd: 3848.93 | bwd_inner: 3841.22 | bwd_allreduce: 7.67 | step: 21.57 3%|▎ | 1420/50750 [3:33:48<81:15:35, 5.93s/it] {'loss': 0.0133, 'learning_rate': 3.7294812869336836e-05, 'epoch': 1.4} 3%|▎ | 1420/50750 [3:33:48<81:15:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:16:32,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:16:32,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.45 | bwd_microstep: 3854.67 | bwd_inner_microstep: 3846.98 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.34 [2024-11-13 20:16:32,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.44 | bwd: 3854.68 | bwd_inner: 3846.98 | bwd_allreduce: 7.66 | step: 21.34 3%|▎ | 1421/50750 [3:33:54<81:15:48, 5.93s/it] {'loss': 0.002, 'learning_rate': 3.732107682206172e-05, 'epoch': 1.4} 3%|▎ | 1421/50750 [3:33:54<81:15:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:16:38,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 20:16:38,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.60 | bwd_microstep: 3853.59 | bwd_inner_microstep: 3846.09 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.31 [2024-11-13 20:16:38,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.60 | bwd: 3853.60 | bwd_inner: 3846.09 | bwd_allreduce: 7.47 | step: 21.31 3%|▎ | 1422/50750 [3:33:59<81:14:19, 5.93s/it] {'loss': 0.0129, 'learning_rate': 3.7347340774786604e-05, 'epoch': 1.4} 3%|▎ | 1422/50750 [3:33:59<81:14:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:16:43,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 20:16:43,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.45 | bwd_microstep: 3857.03 | bwd_inner_microstep: 3849.22 | bwd_allreduce_microstep: 7.76 | step_microstep: 21.58 [2024-11-13 20:16:43,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.44 | bwd: 3857.04 | bwd_inner: 3849.22 | bwd_allreduce: 7.78 | step: 21.59 3%|▎ | 1423/50750 [3:34:05<81:16:42, 5.93s/it] {'loss': 0.2091, 'learning_rate': 3.73736047275115e-05, 'epoch': 1.4} 3%|▎ | 1423/50750 [3:34:05<81:16:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:16:49,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:16:49,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.05 | bwd_microstep: 3851.01 | bwd_inner_microstep: 3843.34 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.62 [2024-11-13 20:16:49,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.03 | bwd: 3851.02 | bwd_inner: 3843.34 | bwd_allreduce: 7.64 | step: 21.62 3%|▎ | 1424/50750 [3:34:11<81:16:05, 5.93s/it] {'loss': 0.02, 'learning_rate': 3.739986868023638e-05, 'epoch': 1.4} 3%|▎ | 1424/50750 [3:34:11<81:16:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:16:55,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 20:16:55,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.81 | bwd_microstep: 3847.99 | bwd_inner_microstep: 3840.54 | bwd_allreduce_microstep: 7.41 | step_microstep: 21.06 [2024-11-13 20:16:55,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.81 | bwd: 3848.00 | bwd_inner: 3840.54 | bwd_allreduce: 7.43 | step: 21.07 3%|▎ | 1425/50750 [3:34:17<81:13:45, 5.93s/it] {'loss': 0.1131, 'learning_rate': 3.7426132632961265e-05, 'epoch': 1.4} 3%|▎ | 1425/50750 [3:34:17<81:13:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:17:01,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:17:01,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.45 | bwd_microstep: 3859.18 | bwd_inner_microstep: 3851.70 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.94 [2024-11-13 20:17:01,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.45 | bwd: 3859.19 | bwd_inner: 3851.70 | bwd_allreduce: 7.46 | step: 20.94 3%|▎ | 1426/50750 [3:34:23<81:13:42, 5.93s/it] {'loss': 0.0082, 'learning_rate': 3.7452396585686146e-05, 'epoch': 1.4} 3%|▎ | 1426/50750 [3:34:23<81:13:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:17:07,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 20:17:07,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.31 | bwd_microstep: 3852.84 | bwd_inner_microstep: 3845.10 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.26 [2024-11-13 20:17:07,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.31 | bwd: 3852.85 | bwd_inner: 3845.10 | bwd_allreduce: 7.71 | step: 21.26 3%|▎ | 1427/50750 [3:34:29<81:12:34, 5.93s/it] {'loss': 0.0145, 'learning_rate': 3.747866053841103e-05, 'epoch': 1.41} 3%|▎ | 1427/50750 [3:34:29<81:12:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:17:13,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.92 [2024-11-13 20:17:13,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.83 | bwd_microstep: 3848.21 | bwd_inner_microstep: 3840.46 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.87 [2024-11-13 20:17:13,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.83 | bwd: 3848.22 | bwd_inner: 3840.46 | bwd_allreduce: 7.72 | step: 21.88 3%|▎ | 1428/50750 [3:34:35<81:11:54, 5.93s/it] {'loss': 0.2594, 'learning_rate': 3.750492449113592e-05, 'epoch': 1.41} 3%|▎ | 1428/50750 [3:34:35<81:11:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:17:19,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.46 | optimizer_step: 4.92 [2024-11-13 20:17:19,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.96 | bwd_microstep: 3853.42 | bwd_inner_microstep: 3845.02 | bwd_allreduce_microstep: 8.34 | step_microstep: 30.26 [2024-11-13 20:17:19,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.95 | bwd: 3853.44 | bwd_inner: 3845.02 | bwd_allreduce: 8.37 | step: 30.26 3%|▎ | 1429/50750 [3:34:41<81:15:55, 5.93s/it] {'loss': 0.083, 'learning_rate': 3.75311884438608e-05, 'epoch': 1.41} 3%|▎ | 1429/50750 [3:34:41<81:15:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:17:25,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.92 [2024-11-13 20:17:25,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.99 | bwd_microstep: 3851.66 | bwd_inner_microstep: 3843.93 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.66 [2024-11-13 20:17:25,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.97 | bwd: 3851.67 | bwd_inner: 3843.93 | bwd_allreduce: 7.70 | step: 21.66 3%|▎ | 1430/50750 [3:34:47<81:16:05, 5.93s/it] {'loss': 0.0106, 'learning_rate': 3.755745239658569e-05, 'epoch': 1.41} 3%|▎ | 1430/50750 [3:34:47<81:16:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:17:31,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 20:17:31,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.83 | bwd_microstep: 3853.64 | bwd_inner_microstep: 3845.34 | bwd_allreduce_microstep: 8.04 | step_microstep: 24.99 [2024-11-13 20:17:31,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.81 | bwd: 3853.66 | bwd_inner: 3845.34 | bwd_allreduce: 8.07 | step: 24.98 3%|▎ | 1431/50750 [3:34:53<81:17:15, 5.93s/it] {'loss': 0.4197, 'learning_rate': 3.7583716349310575e-05, 'epoch': 1.41} 3%|▎ | 1431/50750 [3:34:53<81:17:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:17:37,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 20:17:37,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.41 | bwd_microstep: 3856.13 | bwd_inner_microstep: 3848.61 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.06 [2024-11-13 20:17:37,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.40 | bwd: 3856.14 | bwd_inner: 3848.61 | bwd_allreduce: 7.49 | step: 21.07 3%|▎ | 1432/50750 [3:34:59<81:18:24, 5.94s/it] {'loss': 0.0141, 'learning_rate': 3.760998030203546e-05, 'epoch': 1.41} 3%|▎ | 1432/50750 [3:34:59<81:18:24, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:17:43,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:17:43,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.42 | bwd_microstep: 3852.66 | bwd_inner_microstep: 3845.13 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.30 [2024-11-13 20:17:43,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.42 | bwd: 3852.67 | bwd_inner: 3845.13 | bwd_allreduce: 7.50 | step: 21.30 3%|▎ | 1433/50750 [3:35:05<81:15:30, 5.93s/it] {'loss': 0.0181, 'learning_rate': 3.763624425476034e-05, 'epoch': 1.41} 3%|▎ | 1433/50750 [3:35:05<81:15:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:17:49,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 20:17:49,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.10 | bwd_microstep: 3846.12 | bwd_inner_microstep: 3838.57 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.14 [2024-11-13 20:17:49,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.10 | bwd: 3846.13 | bwd_inner: 3838.57 | bwd_allreduce: 7.52 | step: 21.14 3%|▎ | 1434/50750 [3:35:11<81:12:49, 5.93s/it] {'loss': 0.008, 'learning_rate': 3.766250820748523e-05, 'epoch': 1.41} 3%|▎ | 1434/50750 [3:35:11<81:12:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:17:55,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:17:55,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.14 | bwd_microstep: 3848.80 | bwd_inner_microstep: 3841.07 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.32 [2024-11-13 20:17:55,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.14 | bwd: 3848.81 | bwd_inner: 3841.07 | bwd_allreduce: 7.70 | step: 21.31 3%|▎ | 1435/50750 [3:35:17<81:11:05, 5.93s/it] {'loss': 0.3993, 'learning_rate': 3.7688772160210116e-05, 'epoch': 1.41} 3%|▎ | 1435/50750 [3:35:17<81:11:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:18:01,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.69 | optimizer_step: 4.93 [2024-11-13 20:18:01,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.11 | bwd_microstep: 3844.69 | bwd_inner_microstep: 3837.21 | bwd_allreduce_microstep: 7.44 | step_microstep: 23.11 [2024-11-13 20:18:01,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.11 | bwd: 3844.70 | bwd_inner: 3837.21 | bwd_allreduce: 7.45 | step: 23.13 3%|▎ | 1436/50750 [3:35:22<81:09:41, 5.92s/it] {'loss': 0.0115, 'learning_rate': 3.7715036112935004e-05, 'epoch': 1.41} 3%|▎ | 1436/50750 [3:35:22<81:09:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:18:06,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.82 | optimizer_step: 4.93 [2024-11-13 20:18:06,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.69 | bwd_microstep: 3845.66 | bwd_inner_microstep: 3838.09 | bwd_allreduce_microstep: 7.52 | step_microstep: 26.06 [2024-11-13 20:18:06,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.67 | bwd: 3845.67 | bwd_inner: 3838.09 | bwd_allreduce: 7.54 | step: 26.07 3%|▎ | 1437/50750 [3:35:28<81:09:16, 5.92s/it] {'loss': 0.0173, 'learning_rate': 3.7741300065659884e-05, 'epoch': 1.42} 3%|▎ | 1437/50750 [3:35:28<81:09:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:18:12,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 20:18:12,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.40 | bwd_microstep: 3862.13 | bwd_inner_microstep: 3852.47 | bwd_allreduce_microstep: 9.58 | step_microstep: 21.76 [2024-11-13 20:18:12,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.40 | bwd: 3862.16 | bwd_inner: 3852.47 | bwd_allreduce: 9.62 | step: 21.75 3%|▎ | 1438/50750 [3:35:34<81:14:27, 5.93s/it] {'loss': 0.0106, 'learning_rate': 3.776756401838477e-05, 'epoch': 1.42} 3%|▎ | 1438/50750 [3:35:34<81:14:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:18:18,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 20:18:18,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.47 | bwd_microstep: 3852.46 | bwd_inner_microstep: 3844.79 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.45 [2024-11-13 20:18:18,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.47 | bwd: 3852.47 | bwd_inner: 3844.79 | bwd_allreduce: 7.64 | step: 21.46 3%|▎ | 1439/50750 [3:35:40<81:14:23, 5.93s/it] {'loss': 0.0141, 'learning_rate': 3.779382797110966e-05, 'epoch': 1.42} 3%|▎ | 1439/50750 [3:35:40<81:14:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:18:24,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 20:18:24,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.94 | bwd_microstep: 3852.21 | bwd_inner_microstep: 3844.55 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.92 [2024-11-13 20:18:24,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.94 | bwd: 3852.22 | bwd_inner: 3844.55 | bwd_allreduce: 7.63 | step: 21.92 3%|▎ | 1440/50750 [3:35:46<81:16:24, 5.93s/it] {'loss': 0.0057, 'learning_rate': 3.782009192383454e-05, 'epoch': 1.42} 3%|▎ | 1440/50750 [3:35:46<81:16:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:18:30,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 5.09 [2024-11-13 20:18:30,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.39 | bwd_microstep: 3846.35 | bwd_inner_microstep: 3838.76 | bwd_allreduce_microstep: 7.55 | step_microstep: 22.04 [2024-11-13 20:18:30,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.38 | bwd: 3846.36 | bwd_inner: 3838.76 | bwd_allreduce: 7.56 | step: 22.05 3%|▎ | 1441/50750 [3:35:52<81:14:59, 5.93s/it] {'loss': 0.0508, 'learning_rate': 3.7846355876559426e-05, 'epoch': 1.42} 3%|▎ | 1441/50750 [3:35:52<81:14:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:18:36,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 20:18:36,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.28 | bwd_microstep: 3851.55 | bwd_inner_microstep: 3843.60 | bwd_allreduce_microstep: 7.90 | step_microstep: 21.76 [2024-11-13 20:18:36,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.26 | bwd: 3851.56 | bwd_inner: 3843.60 | bwd_allreduce: 7.92 | step: 21.77 3%|▎ | 1442/50750 [3:35:58<81:16:16, 5.93s/it] {'loss': 0.0138, 'learning_rate': 3.7872619829284306e-05, 'epoch': 1.42} 3%|▎ | 1442/50750 [3:35:58<81:16:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:18:42,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 5.08 [2024-11-13 20:18:42,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.31 | bwd_microstep: 3853.95 | bwd_inner_microstep: 3846.20 | bwd_allreduce_microstep: 7.70 | step_microstep: 22.25 [2024-11-13 20:18:42,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.29 | bwd: 3853.97 | bwd_inner: 3846.20 | bwd_allreduce: 7.72 | step: 22.25 3%|▎ | 1443/50750 [3:36:04<81:16:48, 5.93s/it] {'loss': 0.2971, 'learning_rate': 3.78988837820092e-05, 'epoch': 1.42} 3%|▎ | 1443/50750 [3:36:04<81:16:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:18:48,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:18:48,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.91 | bwd_microstep: 3858.42 | bwd_inner_microstep: 3850.89 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.90 [2024-11-13 20:18:48,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.89 | bwd: 3858.44 | bwd_inner: 3850.89 | bwd_allreduce: 7.50 | step: 20.90 3%|▎ | 1444/50750 [3:36:10<81:17:01, 5.93s/it] {'loss': 0.1039, 'learning_rate': 3.792514773473408e-05, 'epoch': 1.42} 3%|▎ | 1444/50750 [3:36:10<81:17:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:18:54,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 20:18:54,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.89 | bwd_microstep: 3854.73 | bwd_inner_microstep: 3847.19 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.10 [2024-11-13 20:18:54,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.89 | bwd: 3854.74 | bwd_inner: 3847.19 | bwd_allreduce: 7.51 | step: 21.10 3%|▎ | 1445/50750 [3:36:16<81:16:29, 5.93s/it] {'loss': 0.4921, 'learning_rate': 3.795141168745897e-05, 'epoch': 1.42} 3%|▎ | 1445/50750 [3:36:16<81:16:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:19:00,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.34 | optimizer_step: 4.92 [2024-11-13 20:19:00,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.25 | bwd_microstep: 3847.62 | bwd_inner_microstep: 3839.51 | bwd_allreduce_microstep: 8.04 | step_microstep: 27.93 [2024-11-13 20:19:00,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.25 | bwd: 3847.64 | bwd_inner: 3839.51 | bwd_allreduce: 8.06 | step: 27.92 3%|▎ | 1446/50750 [3:36:22<81:15:15, 5.93s/it] {'loss': 0.0069, 'learning_rate': 3.797767564018385e-05, 'epoch': 1.42} 3%|▎ | 1446/50750 [3:36:22<81:15:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:19:06,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.81 | optimizer_step: 4.93 [2024-11-13 20:19:06,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.10 | bwd_microstep: 3857.94 | bwd_inner_microstep: 3850.45 | bwd_allreduce_microstep: 7.44 | step_microstep: 23.07 [2024-11-13 20:19:06,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.09 | bwd: 3857.95 | bwd_inner: 3850.45 | bwd_allreduce: 7.46 | step: 23.09 3%|▎ | 1447/50750 [3:36:28<81:15:44, 5.93s/it] {'loss': 0.537, 'learning_rate': 3.8003939592908735e-05, 'epoch': 1.43} 3%|▎ | 1447/50750 [3:36:28<81:15:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:19:12,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 20:19:12,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.37 | bwd_microstep: 3849.03 | bwd_inner_microstep: 3841.43 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.66 [2024-11-13 20:19:12,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.36 | bwd: 3849.05 | bwd_inner: 3841.43 | bwd_allreduce: 7.58 | step: 21.66 3%|▎ | 1448/50750 [3:36:34<81:14:52, 5.93s/it] {'loss': 0.0021, 'learning_rate': 3.803020354563362e-05, 'epoch': 1.43} 3%|▎ | 1448/50750 [3:36:34<81:14:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:19:18,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 20:19:18,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.35 | bwd_microstep: 3847.56 | bwd_inner_microstep: 3840.02 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.24 [2024-11-13 20:19:18,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.35 | bwd: 3847.57 | bwd_inner: 3840.02 | bwd_allreduce: 7.50 | step: 21.25 3%|▎ | 1449/50750 [3:36:40<81:12:12, 5.93s/it] {'loss': 0.0013, 'learning_rate': 3.80564674983585e-05, 'epoch': 1.43} 3%|▎ | 1449/50750 [3:36:40<81:12:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:19:24,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:19:24,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.46 | bwd_microstep: 3849.42 | bwd_inner_microstep: 3841.90 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.37 [2024-11-13 20:19:24,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.46 | bwd: 3849.43 | bwd_inner: 3841.90 | bwd_allreduce: 7.49 | step: 21.37 3%|▎ | 1450/50750 [3:36:46<81:09:55, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.808273145108339e-05, 'epoch': 1.43} 3%|▎ | 1450/50750 [3:36:46<81:09:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:19:29,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 20:19:29,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.60 | bwd_microstep: 3853.94 | bwd_inner_microstep: 3846.41 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.26 [2024-11-13 20:19:29,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.60 | bwd: 3853.95 | bwd_inner: 3846.41 | bwd_allreduce: 7.50 | step: 21.26 3%|▎ | 1451/50750 [3:36:51<81:10:16, 5.93s/it] {'loss': 1.3568, 'learning_rate': 3.810899540380828e-05, 'epoch': 1.43} 3%|▎ | 1451/50750 [3:36:51<81:10:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:19:35,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.97 [2024-11-13 20:19:35,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.99 | bwd_microstep: 3845.79 | bwd_inner_microstep: 3838.28 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.34 [2024-11-13 20:19:35,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.99 | bwd: 3845.80 | bwd_inner: 3838.28 | bwd_allreduce: 7.48 | step: 21.34 3%|▎ | 1452/50750 [3:36:57<81:08:03, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.8135259356533164e-05, 'epoch': 1.43} 3%|▎ | 1452/50750 [3:36:57<81:08:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:19:41,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 20:19:41,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.13 | bwd_microstep: 3848.37 | bwd_inner_microstep: 3840.84 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.06 [2024-11-13 20:19:41,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.13 | bwd: 3848.39 | bwd_inner: 3840.84 | bwd_allreduce: 7.50 | step: 21.07 3%|▎ | 1453/50750 [3:37:03<81:07:20, 5.92s/it] {'loss': 0.0015, 'learning_rate': 3.8161523309258045e-05, 'epoch': 1.43} 3%|▎ | 1453/50750 [3:37:03<81:07:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:19:47,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:19:47,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.10 | bwd_microstep: 3859.50 | bwd_inner_microstep: 3851.99 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.23 [2024-11-13 20:19:47,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.10 | bwd: 3859.51 | bwd_inner: 3851.99 | bwd_allreduce: 7.48 | step: 21.23 3%|▎ | 1454/50750 [3:37:09<81:08:40, 5.93s/it] {'loss': 0.0021, 'learning_rate': 3.818778726198293e-05, 'epoch': 1.43} 3%|▎ | 1454/50750 [3:37:09<81:08:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:19:53,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 20:19:53,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.10 | bwd_microstep: 3850.32 | bwd_inner_microstep: 3842.80 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.07 [2024-11-13 20:19:53,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.10 | bwd: 3850.33 | bwd_inner: 3842.80 | bwd_allreduce: 7.50 | step: 21.07 3%|▎ | 1455/50750 [3:37:15<81:08:29, 5.93s/it] {'loss': 0.0159, 'learning_rate': 3.821405121470782e-05, 'epoch': 1.43} 3%|▎ | 1455/50750 [3:37:15<81:08:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:19:59,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:19:59,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.38 | bwd_microstep: 3850.91 | bwd_inner_microstep: 3843.35 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.10 [2024-11-13 20:19:59,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.38 | bwd: 3850.93 | bwd_inner: 3843.35 | bwd_allreduce: 7.53 | step: 21.10 3%|▎ | 1456/50750 [3:37:21<81:07:40, 5.92s/it] {'loss': 0.1558, 'learning_rate': 3.82403151674327e-05, 'epoch': 1.43} 3%|▎ | 1456/50750 [3:37:21<81:07:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:20:05,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 20:20:05,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.32 | bwd_microstep: 3853.47 | bwd_inner_microstep: 3845.95 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.01 [2024-11-13 20:20:05,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.32 | bwd: 3853.48 | bwd_inner: 3845.95 | bwd_allreduce: 7.50 | step: 21.01 3%|▎ | 1457/50750 [3:37:27<81:08:42, 5.93s/it] {'loss': 0.0114, 'learning_rate': 3.8266579120157586e-05, 'epoch': 1.44} 3%|▎ | 1457/50750 [3:37:27<81:08:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:20:11,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 20:20:11,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.76 | bwd_microstep: 3851.38 | bwd_inner_microstep: 3843.88 | bwd_allreduce_microstep: 7.46 | step_microstep: 22.20 [2024-11-13 20:20:11,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.77 | bwd: 3851.39 | bwd_inner: 3843.88 | bwd_allreduce: 7.47 | step: 22.21 3%|▎ | 1458/50750 [3:37:33<81:09:08, 5.93s/it] {'loss': 0.3825, 'learning_rate': 3.8292843072882474e-05, 'epoch': 1.44} 3%|▎ | 1458/50750 [3:37:33<81:09:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 20:20:17,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 20:20:17,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.91 | bwd_microstep: 3854.07 | bwd_inner_microstep: 3846.48 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.36 [2024-11-13 20:20:17,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.91 | bwd: 3854.09 | bwd_inner: 3846.48 | bwd_allreduce: 7.57 | step: 21.36 3%|▎ | 1459/50750 [3:37:39<81:09:58, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.831910702560736e-05, 'epoch': 1.44} 3%|▎ | 1459/50750 [3:37:39<81:09:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:20:23,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:20:23,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.14 | bwd_microstep: 3855.17 | bwd_inner_microstep: 3847.70 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.94 [2024-11-13 20:20:23,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.13 | bwd: 3855.18 | bwd_inner: 3847.70 | bwd_allreduce: 7.44 | step: 20.94 3%|▎ | 1460/50750 [3:37:45<81:09:59, 5.93s/it] {'loss': 0.0019, 'learning_rate': 3.834537097833224e-05, 'epoch': 1.44} 3%|▎ | 1460/50750 [3:37:45<81:09:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:20:29,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:20:29,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.34 | bwd_microstep: 3843.51 | bwd_inner_microstep: 3836.03 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.92 [2024-11-13 20:20:29,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.34 | bwd: 3843.52 | bwd_inner: 3836.03 | bwd_allreduce: 7.45 | step: 20.92 3%|▎ | 1461/50750 [3:37:51<81:06:26, 5.92s/it] {'loss': 0.0144, 'learning_rate': 3.837163493105713e-05, 'epoch': 1.44} 3%|▎ | 1461/50750 [3:37:51<81:06:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:20:35,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 20:20:35,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.93 | bwd_microstep: 3849.70 | bwd_inner_microstep: 3842.23 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.77 [2024-11-13 20:20:35,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.93 | bwd: 3849.71 | bwd_inner: 3842.23 | bwd_allreduce: 7.44 | step: 20.78 3%|▎ | 1462/50750 [3:37:57<81:06:29, 5.92s/it] {'loss': 0.0043, 'learning_rate': 3.8397898883782015e-05, 'epoch': 1.44} 3%|▎ | 1462/50750 [3:37:57<81:06:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:20:41,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:20:41,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.31 | bwd_microstep: 3846.28 | bwd_inner_microstep: 3838.84 | bwd_allreduce_microstep: 7.40 | step_microstep: 21.20 [2024-11-13 20:20:41,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.31 | bwd: 3846.29 | bwd_inner: 3838.84 | bwd_allreduce: 7.41 | step: 21.21 3%|▎ | 1463/50750 [3:38:03<81:04:11, 5.92s/it] {'loss': 0.0027, 'learning_rate': 3.8424162836506896e-05, 'epoch': 1.44} 3%|▎ | 1463/50750 [3:38:03<81:04:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:20:47,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:20:47,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.49 | bwd_microstep: 3849.88 | bwd_inner_microstep: 3842.23 | bwd_allreduce_microstep: 7.62 | step_microstep: 20.90 [2024-11-13 20:20:47,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.49 | bwd: 3849.90 | bwd_inner: 3842.23 | bwd_allreduce: 7.63 | step: 20.90 3%|▎ | 1464/50750 [3:38:08<81:04:01, 5.92s/it] {'loss': 0.4687, 'learning_rate': 3.845042678923178e-05, 'epoch': 1.44} 3%|▎ | 1464/50750 [3:38:08<81:04:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:20:52,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 20:20:52,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.41 | bwd_microstep: 3845.50 | bwd_inner_microstep: 3838.05 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.85 [2024-11-13 20:20:52,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.41 | bwd: 3845.51 | bwd_inner: 3838.05 | bwd_allreduce: 7.43 | step: 20.85 3%|▎ | 1465/50750 [3:38:14<81:02:21, 5.92s/it] {'loss': 0.6957, 'learning_rate': 3.847669074195666e-05, 'epoch': 1.44} 3%|▎ | 1465/50750 [3:38:14<81:02:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:20:58,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 20:20:58,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.55 | bwd_microstep: 3847.14 | bwd_inner_microstep: 3839.54 | bwd_allreduce_microstep: 7.56 | step_microstep: 20.90 [2024-11-13 20:20:58,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.55 | bwd: 3847.15 | bwd_inner: 3839.54 | bwd_allreduce: 7.57 | step: 20.90 3%|▎ | 1466/50750 [3:38:20<81:01:41, 5.92s/it] {'loss': 0.0038, 'learning_rate': 3.850295469468155e-05, 'epoch': 1.44} 3%|▎ | 1466/50750 [3:38:20<81:01:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:21:04,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 20:21:04,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.99 | bwd_microstep: 3843.39 | bwd_inner_microstep: 3835.93 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.84 [2024-11-13 20:21:04,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.97 | bwd: 3843.41 | bwd_inner: 3835.93 | bwd_allreduce: 7.44 | step: 20.85 3%|▎ | 1467/50750 [3:38:26<81:00:43, 5.92s/it] {'loss': 0.8472, 'learning_rate': 3.852921864740644e-05, 'epoch': 1.45} 3%|▎ | 1467/50750 [3:38:26<81:00:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:21:10,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.92 [2024-11-13 20:21:10,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.62 | bwd_microstep: 3859.67 | bwd_inner_microstep: 3852.03 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.40 [2024-11-13 20:21:10,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.63 | bwd: 3859.68 | bwd_inner: 3852.03 | bwd_allreduce: 7.62 | step: 21.40 3%|▎ | 1468/50750 [3:38:32<81:05:20, 5.92s/it] {'loss': 0.3937, 'learning_rate': 3.8555482600131325e-05, 'epoch': 1.45} 3%|▎ | 1468/50750 [3:38:32<81:05:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:21:16,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 20:21:16,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.36 | bwd_microstep: 3847.38 | bwd_inner_microstep: 3839.86 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.50 [2024-11-13 20:21:16,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.35 | bwd: 3847.40 | bwd_inner: 3839.86 | bwd_allreduce: 7.49 | step: 21.51 3%|▎ | 1469/50750 [3:38:38<81:06:49, 5.93s/it] {'loss': 0.1888, 'learning_rate': 3.8581746552856205e-05, 'epoch': 1.45} 3%|▎ | 1469/50750 [3:38:38<81:06:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:21:22,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 20:21:22,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.50 | bwd_microstep: 3851.39 | bwd_inner_microstep: 3843.92 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.05 [2024-11-13 20:21:22,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.50 | bwd: 3851.40 | bwd_inner: 3843.91 | bwd_allreduce: 7.45 | step: 21.05 3%|▎ | 1470/50750 [3:38:44<81:06:37, 5.93s/it] {'loss': 0.0104, 'learning_rate': 3.860801050558109e-05, 'epoch': 1.45} 3%|▎ | 1470/50750 [3:38:44<81:06:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:21:28,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:21:28,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.65 | bwd_microstep: 3851.55 | bwd_inner_microstep: 3844.07 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.19 [2024-11-13 20:21:28,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.65 | bwd: 3851.56 | bwd_inner: 3844.07 | bwd_allreduce: 7.45 | step: 21.19 3%|▎ | 1471/50750 [3:38:50<81:05:42, 5.92s/it] {'loss': 0.0026, 'learning_rate': 3.863427445830598e-05, 'epoch': 1.45} 3%|▎ | 1471/50750 [3:38:50<81:05:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:21:34,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 20:21:34,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.17 | bwd_microstep: 3859.55 | bwd_inner_microstep: 3851.89 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.62 [2024-11-13 20:21:34,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.17 | bwd: 3859.57 | bwd_inner: 3851.89 | bwd_allreduce: 7.64 | step: 21.62 3%|▎ | 1472/50750 [3:38:56<81:08:48, 5.93s/it] {'loss': 0.0048, 'learning_rate': 3.866053841103086e-05, 'epoch': 1.45} 3%|▎ | 1472/50750 [3:38:56<81:08:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:21:40,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 20:21:40,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.86 | bwd_microstep: 3852.67 | bwd_inner_microstep: 3844.90 | bwd_allreduce_microstep: 7.72 | step_microstep: 21.64 [2024-11-13 20:21:40,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.85 | bwd: 3852.68 | bwd_inner: 3844.90 | bwd_allreduce: 7.74 | step: 21.64 3%|▎ | 1473/50750 [3:39:02<81:09:07, 5.93s/it] {'loss': 0.0034, 'learning_rate': 3.868680236375575e-05, 'epoch': 1.45} 3%|▎ | 1473/50750 [3:39:02<81:09:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:21:46,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 20:21:46,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.98 | bwd_microstep: 3857.04 | bwd_inner_microstep: 3849.52 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.03 [2024-11-13 20:21:46,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.96 | bwd: 3857.05 | bwd_inner: 3849.52 | bwd_allreduce: 7.49 | step: 21.04 3%|▎ | 1474/50750 [3:39:08<81:10:05, 5.93s/it] {'loss': 0.0054, 'learning_rate': 3.8713066316480634e-05, 'epoch': 1.45} 3%|▎ | 1474/50750 [3:39:08<81:10:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:21:52,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:21:52,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.21 | bwd_microstep: 3851.19 | bwd_inner_microstep: 3843.67 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.15 [2024-11-13 20:21:52,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.20 | bwd: 3851.21 | bwd_inner: 3843.68 | bwd_allreduce: 7.49 | step: 21.15 3%|▎ | 1475/50750 [3:39:14<81:08:49, 5.93s/it] {'loss': 0.0015, 'learning_rate': 3.873933026920552e-05, 'epoch': 1.45} 3%|▎ | 1475/50750 [3:39:14<81:08:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2195 [2024-11-13 20:21:58,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:21:58,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.22 | bwd_microstep: 3844.85 | bwd_inner_microstep: 3837.39 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.16 [2024-11-13 20:21:58,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.22 | bwd: 3844.87 | bwd_inner: 3837.39 | bwd_allreduce: 7.44 | step: 21.16 3%|▎ | 1476/50750 [3:39:20<81:05:38, 5.92s/it] {'loss': 0.0071, 'learning_rate': 3.87655942219304e-05, 'epoch': 1.45} 3%|▎ | 1476/50750 [3:39:20<81:05:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:22:04,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 20:22:04,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.64 | bwd_microstep: 3848.66 | bwd_inner_microstep: 3841.14 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.38 [2024-11-13 20:22:04,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.64 | bwd: 3848.67 | bwd_inner: 3841.14 | bwd_allreduce: 7.49 | step: 21.39 3%|▎ | 1477/50750 [3:39:25<81:05:04, 5.92s/it] {'loss': 0.535, 'learning_rate': 3.879185817465529e-05, 'epoch': 1.46} 3%|▎ | 1477/50750 [3:39:26<81:05:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:22:09,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-13 20:22:09,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.25 | bwd_microstep: 3850.82 | bwd_inner_microstep: 3843.33 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.80 [2024-11-13 20:22:09,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.25 | bwd: 3850.83 | bwd_inner: 3843.33 | bwd_allreduce: 7.46 | step: 20.81 3%|▎ | 1478/50750 [3:39:31<81:04:07, 5.92s/it] {'loss': 0.0068, 'learning_rate': 3.8818122127380176e-05, 'epoch': 1.46} 3%|▎ | 1478/50750 [3:39:31<81:04:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:22:15,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 20:22:15,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.69 | bwd_microstep: 3847.19 | bwd_inner_microstep: 3839.67 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.04 [2024-11-13 20:22:15,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.69 | bwd: 3847.20 | bwd_inner: 3839.67 | bwd_allreduce: 7.50 | step: 21.05 3%|▎ | 1479/50750 [3:39:37<81:02:44, 5.92s/it] {'loss': 0.3261, 'learning_rate': 3.884438608010506e-05, 'epoch': 1.46} 3%|▎ | 1479/50750 [3:39:37<81:02:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:22:21,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 20:22:21,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.76 | bwd_microstep: 3846.93 | bwd_inner_microstep: 3839.45 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.22 [2024-11-13 20:22:21,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.76 | bwd: 3846.94 | bwd_inner: 3839.45 | bwd_allreduce: 7.46 | step: 21.22 3%|▎ | 1480/50750 [3:39:43<81:01:57, 5.92s/it] {'loss': 0.007, 'learning_rate': 3.8870650032829944e-05, 'epoch': 1.46} 3%|▎ | 1480/50750 [3:39:43<81:01:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:22:27,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:22:27,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.35 | bwd_microstep: 3847.92 | bwd_inner_microstep: 3840.46 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.78 [2024-11-13 20:22:27,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.35 | bwd: 3847.93 | bwd_inner: 3840.46 | bwd_allreduce: 7.43 | step: 20.79 3%|▎ | 1481/50750 [3:39:49<81:02:11, 5.92s/it] {'loss': 0.0022, 'learning_rate': 3.8896913985554824e-05, 'epoch': 1.46} 3%|▎ | 1481/50750 [3:39:49<81:02:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:22:33,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:22:33,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.60 | bwd_microstep: 3848.76 | bwd_inner_microstep: 3841.25 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.78 [2024-11-13 20:22:33,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.60 | bwd: 3848.77 | bwd_inner: 3841.25 | bwd_allreduce: 7.48 | step: 20.79 3%|▎ | 1482/50750 [3:39:55<81:01:33, 5.92s/it] {'loss': 0.5134, 'learning_rate': 3.892317793827972e-05, 'epoch': 1.46} 3%|▎ | 1482/50750 [3:39:55<81:01:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:22:39,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 20:22:39,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.94 | bwd_microstep: 3849.23 | bwd_inner_microstep: 3841.61 | bwd_allreduce_microstep: 7.56 | step_microstep: 22.44 [2024-11-13 20:22:39,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.94 | bwd: 3849.24 | bwd_inner: 3841.61 | bwd_allreduce: 7.59 | step: 22.44 3%|▎ | 1483/50750 [3:40:01<81:02:01, 5.92s/it] {'loss': 0.6349, 'learning_rate': 3.89494418910046e-05, 'epoch': 1.46} 3%|▎ | 1483/50750 [3:40:01<81:02:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:22:45,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 20:22:45,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.12 | bwd_microstep: 3858.86 | bwd_inner_microstep: 3851.28 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.25 [2024-11-13 20:22:45,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.10 | bwd: 3858.87 | bwd_inner: 3851.28 | bwd_allreduce: 7.55 | step: 21.26 3%|▎ | 1484/50750 [3:40:07<81:06:27, 5.93s/it] {'loss': 0.0497, 'learning_rate': 3.8975705843729485e-05, 'epoch': 1.46} 3%|▎ | 1484/50750 [3:40:07<81:06:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:22:51,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 20:22:51,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.13 | bwd_microstep: 3862.64 | bwd_inner_microstep: 3855.05 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.54 [2024-11-13 20:22:51,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.12 | bwd: 3862.65 | bwd_inner: 3855.05 | bwd_allreduce: 7.56 | step: 21.55 3%|▎ | 1485/50750 [3:40:13<81:09:58, 5.93s/it] {'loss': 0.0119, 'learning_rate': 3.9001969796454366e-05, 'epoch': 1.46} 3%|▎ | 1485/50750 [3:40:13<81:09:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:22:57,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 20:22:57,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.65 | bwd_microstep: 3855.37 | bwd_inner_microstep: 3847.57 | bwd_allreduce_microstep: 7.75 | step_microstep: 21.59 [2024-11-13 20:22:57,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.63 | bwd: 3855.38 | bwd_inner: 3847.57 | bwd_allreduce: 7.77 | step: 21.60 3%|▎ | 1486/50750 [3:40:19<81:11:29, 5.93s/it] {'loss': 0.0036, 'learning_rate': 3.902823374917925e-05, 'epoch': 1.46} 3%|▎ | 1486/50750 [3:40:19<81:11:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:23:03,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 20:23:03,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.19 | bwd_microstep: 3846.68 | bwd_inner_microstep: 3838.96 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.14 [2024-11-13 20:23:03,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.18 | bwd: 3846.70 | bwd_inner: 3838.96 | bwd_allreduce: 7.69 | step: 21.14 3%|▎ | 1487/50750 [3:40:25<81:08:14, 5.93s/it] {'loss': 0.1901, 'learning_rate': 3.905449770190414e-05, 'epoch': 1.47} 3%|▎ | 1487/50750 [3:40:25<81:08:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:23:09,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:23:09,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.31 | bwd_microstep: 3860.97 | bwd_inner_microstep: 3853.46 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.99 [2024-11-13 20:23:09,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.31 | bwd: 3860.98 | bwd_inner: 3853.46 | bwd_allreduce: 7.49 | step: 20.99 3%|▎ | 1488/50750 [3:40:31<81:08:34, 5.93s/it] {'loss': 0.3623, 'learning_rate': 3.908076165462903e-05, 'epoch': 1.47} 3%|▎ | 1488/50750 [3:40:31<81:08:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:23:15,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:23:15,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.84 | bwd_microstep: 3859.79 | bwd_inner_microstep: 3852.28 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.18 [2024-11-13 20:23:15,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.84 | bwd: 3859.81 | bwd_inner: 3852.28 | bwd_allreduce: 7.49 | step: 21.18 3%|▎ | 1489/50750 [3:40:37<81:09:27, 5.93s/it] {'loss': 0.4389, 'learning_rate': 3.910702560735391e-05, 'epoch': 1.47} 3%|▎ | 1489/50750 [3:40:37<81:09:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:23:21,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:23:21,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.30 | bwd_microstep: 3847.45 | bwd_inner_microstep: 3839.91 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.09 [2024-11-13 20:23:21,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.30 | bwd: 3847.47 | bwd_inner: 3839.91 | bwd_allreduce: 7.52 | step: 21.09 3%|▎ | 1490/50750 [3:40:43<81:05:56, 5.93s/it] {'loss': 0.1247, 'learning_rate': 3.9133289560078795e-05, 'epoch': 1.47} 3%|▎ | 1490/50750 [3:40:43<81:05:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:23:26,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:23:27,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.48 | bwd_microstep: 3849.40 | bwd_inner_microstep: 3841.88 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.10 [2024-11-13 20:23:27,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.48 | bwd: 3849.41 | bwd_inner: 3841.88 | bwd_allreduce: 7.50 | step: 21.10 3%|▎ | 1491/50750 [3:40:48<81:04:38, 5.93s/it] {'loss': 0.4107, 'learning_rate': 3.915955351280368e-05, 'epoch': 1.47} 3%|▎ | 1491/50750 [3:40:48<81:04:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:23:32,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.76 | optimizer_step: 4.92 [2024-11-13 20:23:32,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.97 | bwd_microstep: 3855.18 | bwd_inner_microstep: 3847.66 | bwd_allreduce_microstep: 7.48 | step_microstep: 22.89 [2024-11-13 20:23:32,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.97 | bwd: 3855.19 | bwd_inner: 3847.66 | bwd_allreduce: 7.49 | step: 22.91 3%|▎ | 1492/50750 [3:40:54<81:05:20, 5.93s/it] {'loss': 0.0627, 'learning_rate': 3.918581746552856e-05, 'epoch': 1.47} 3%|▎ | 1492/50750 [3:40:54<81:05:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 20:23:38,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.92 [2024-11-13 20:23:38,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.55 | bwd_microstep: 3857.60 | bwd_inner_microstep: 3849.01 | bwd_allreduce_microstep: 8.54 | step_microstep: 26.51 [2024-11-13 20:23:38,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.53 | bwd: 3857.61 | bwd_inner: 3849.01 | bwd_allreduce: 8.56 | step: 26.51 3%|▎ | 1493/50750 [3:41:00<81:09:42, 5.93s/it] {'loss': 0.1312, 'learning_rate': 3.921208141825345e-05, 'epoch': 1.47} 3%|▎ | 1493/50750 [3:41:00<81:09:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:23:44,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:23:44,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.40 | bwd_microstep: 3850.81 | bwd_inner_microstep: 3843.30 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.95 [2024-11-13 20:23:44,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.39 | bwd: 3850.83 | bwd_inner: 3843.30 | bwd_allreduce: 7.48 | step: 20.95 3%|▎ | 1494/50750 [3:41:06<81:09:31, 5.93s/it] {'loss': 0.0102, 'learning_rate': 3.923834537097834e-05, 'epoch': 1.47} 3%|▎ | 1494/50750 [3:41:06<81:09:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:23:50,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 5.08 [2024-11-13 20:23:50,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.27 | bwd_microstep: 3849.85 | bwd_inner_microstep: 3842.25 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.76 [2024-11-13 20:23:50,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.27 | bwd: 3849.86 | bwd_inner: 3842.25 | bwd_allreduce: 7.57 | step: 21.77 3%|▎ | 1495/50750 [3:41:12<81:08:46, 5.93s/it] {'loss': 0.3102, 'learning_rate': 3.9264609323703224e-05, 'epoch': 1.47} 3%|▎ | 1495/50750 [3:41:12<81:08:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:23:56,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 20:23:56,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.20 | bwd_microstep: 3850.18 | bwd_inner_microstep: 3842.66 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.05 [2024-11-13 20:23:56,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.19 | bwd: 3850.19 | bwd_inner: 3842.66 | bwd_allreduce: 7.49 | step: 21.05 3%|▎ | 1496/50750 [3:41:18<81:07:56, 5.93s/it] {'loss': 0.0216, 'learning_rate': 3.9290873276428104e-05, 'epoch': 1.47} 3%|▎ | 1496/50750 [3:41:18<81:07:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:24:02,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 20:24:02,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.52 | bwd_microstep: 3848.37 | bwd_inner_microstep: 3840.61 | bwd_allreduce_microstep: 7.72 | step_microstep: 21.50 [2024-11-13 20:24:02,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.52 | bwd: 3848.38 | bwd_inner: 3840.61 | bwd_allreduce: 7.73 | step: 21.51 3%|▎ | 1497/50750 [3:41:24<81:06:21, 5.93s/it] {'loss': 0.0121, 'learning_rate': 3.931713722915299e-05, 'epoch': 1.47} 3%|▎ | 1497/50750 [3:41:24<81:06:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:24:08,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 20:24:08,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.62 | bwd_microstep: 3855.75 | bwd_inner_microstep: 3848.10 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.28 [2024-11-13 20:24:08,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.60 | bwd: 3855.76 | bwd_inner: 3848.10 | bwd_allreduce: 7.62 | step: 21.29 3%|▎ | 1498/50750 [3:41:30<81:06:49, 5.93s/it] {'loss': 0.0289, 'learning_rate': 3.934340118187788e-05, 'epoch': 1.48} 3%|▎ | 1498/50750 [3:41:30<81:06:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:24:14,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.92 [2024-11-13 20:24:14,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.22 | bwd_microstep: 3852.52 | bwd_inner_microstep: 3844.83 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.66 [2024-11-13 20:24:14,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.21 | bwd: 3852.54 | bwd_inner: 3844.83 | bwd_allreduce: 7.67 | step: 21.67 3%|▎ | 1499/50750 [3:41:36<81:07:50, 5.93s/it] {'loss': 0.1646, 'learning_rate': 3.936966513460276e-05, 'epoch': 1.48} 3%|▎ | 1499/50750 [3:41:36<81:07:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:24:20,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 20:24:20,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.53 | bwd_microstep: 3854.51 | bwd_inner_microstep: 3846.56 | bwd_allreduce_microstep: 7.90 | step_microstep: 21.87 [2024-11-13 20:24:20,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.51 | bwd: 3854.53 | bwd_inner: 3846.56 | bwd_allreduce: 7.92 | step: 21.88 3%|▎ | 1500/50750 [3:41:42<81:09:13, 5.93s/it] {'loss': 0.2189, 'learning_rate': 3.9395929087327646e-05, 'epoch': 1.48} 3%|▎ | 1500/50750 [3:41:42<81:09:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:24:26,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 20:24:26,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.15 | bwd_microstep: 3847.18 | bwd_inner_microstep: 3839.36 | bwd_allreduce_microstep: 7.76 | step_microstep: 25.23 [2024-11-13 20:24:26,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.14 | bwd: 3847.20 | bwd_inner: 3839.36 | bwd_allreduce: 7.79 | step: 25.23 3%|▎ | 1501/50750 [3:41:48<81:08:06, 5.93s/it] {'loss': 0.0, 'learning_rate': 3.942219304005253e-05, 'epoch': 1.48} 3%|▎ | 1501/50750 [3:41:48<81:08:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:24:32,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 20:24:32,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.80 | bwd_microstep: 3849.69 | bwd_inner_microstep: 3842.04 | bwd_allreduce_microstep: 7.61 | step_microstep: 20.94 [2024-11-13 20:24:32,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.80 | bwd: 3849.70 | bwd_inner: 3842.04 | bwd_allreduce: 7.63 | step: 20.95 3%|▎ | 1502/50750 [3:41:54<81:05:16, 5.93s/it] {'loss': 0.0658, 'learning_rate': 3.944845699277742e-05, 'epoch': 1.48} 3%|▎ | 1502/50750 [3:41:54<81:05:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:24:38,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 20:24:38,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.16 | bwd_microstep: 3849.49 | bwd_inner_microstep: 3842.03 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.91 [2024-11-13 20:24:38,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.16 | bwd: 3849.50 | bwd_inner: 3842.03 | bwd_allreduce: 7.43 | step: 20.91 3%|▎ | 1503/50750 [3:42:00<81:03:11, 5.93s/it] {'loss': 0.0427, 'learning_rate': 3.94747209455023e-05, 'epoch': 1.48} 3%|▎ | 1503/50750 [3:42:00<81:03:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:24:44,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-13 20:24:44,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.53 | bwd_microstep: 3843.55 | bwd_inner_microstep: 3836.07 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.75 [2024-11-13 20:24:44,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.53 | bwd: 3843.56 | bwd_inner: 3836.07 | bwd_allreduce: 7.45 | step: 20.75 3%|▎ | 1504/50750 [3:42:06<81:00:06, 5.92s/it] {'loss': 0.0059, 'learning_rate': 3.950098489822719e-05, 'epoch': 1.48} 3%|▎ | 1504/50750 [3:42:06<81:00:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:24:49,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 20:24:49,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.53 | bwd_microstep: 3848.61 | bwd_inner_microstep: 3841.14 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.95 [2024-11-13 20:24:49,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.53 | bwd: 3848.62 | bwd_inner: 3841.14 | bwd_allreduce: 7.44 | step: 20.96 3%|▎ | 1505/50750 [3:42:11<80:59:07, 5.92s/it] {'loss': 0.6257, 'learning_rate': 3.952724885095207e-05, 'epoch': 1.48} 3%|▎ | 1505/50750 [3:42:11<80:59:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:24:55,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 20:24:55,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.69 | bwd_microstep: 3846.40 | bwd_inner_microstep: 3838.92 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.05 [2024-11-13 20:24:55,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.69 | bwd: 3846.41 | bwd_inner: 3838.92 | bwd_allreduce: 7.45 | step: 21.06 3%|▎ | 1506/50750 [3:42:17<80:57:50, 5.92s/it] {'loss': 0.0058, 'learning_rate': 3.9553512803676955e-05, 'epoch': 1.48} 3%|▎ | 1506/50750 [3:42:17<80:57:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:25:01,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:25:01,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.77 | bwd_microstep: 3845.03 | bwd_inner_microstep: 3837.54 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.82 [2024-11-13 20:25:01,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.77 | bwd: 3845.04 | bwd_inner: 3837.54 | bwd_allreduce: 7.46 | step: 20.82 3%|▎ | 1507/50750 [3:42:23<80:56:44, 5.92s/it] {'loss': 0.0015, 'learning_rate': 3.957977675640184e-05, 'epoch': 1.48} 3%|▎ | 1507/50750 [3:42:23<80:56:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:25:07,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 20:25:07,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.49 | bwd_microstep: 3846.93 | bwd_inner_microstep: 3839.46 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.83 [2024-11-13 20:25:07,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.49 | bwd: 3846.94 | bwd_inner: 3839.46 | bwd_allreduce: 7.45 | step: 20.83 3%|▎ | 1508/50750 [3:42:29<80:56:48, 5.92s/it] {'loss': 0.0024, 'learning_rate': 3.960604070912672e-05, 'epoch': 1.49} 3%|▎ | 1508/50750 [3:42:29<80:56:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:25:13,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.65 | optimizer_step: 4.93 [2024-11-13 20:25:13,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.19 | bwd_microstep: 3847.33 | bwd_inner_microstep: 3839.34 | bwd_allreduce_microstep: 7.94 | step_microstep: 27.31 [2024-11-13 20:25:13,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.19 | bwd: 3847.35 | bwd_inner: 3839.34 | bwd_allreduce: 7.96 | step: 27.33 3%|▎ | 1509/50750 [3:42:35<80:59:57, 5.92s/it] {'loss': 0.0764, 'learning_rate': 3.963230466185161e-05, 'epoch': 1.49} 3%|▎ | 1509/50750 [3:42:35<80:59:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:25:19,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 20:25:19,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.60 | bwd_microstep: 3849.74 | bwd_inner_microstep: 3842.25 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.11 [2024-11-13 20:25:19,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.60 | bwd: 3849.75 | bwd_inner: 3842.25 | bwd_allreduce: 7.46 | step: 20.93 3%|▎ | 1510/50750 [3:42:41<81:00:33, 5.92s/it] {'loss': 0.0913, 'learning_rate': 3.96585686145765e-05, 'epoch': 1.49} 3%|▎ | 1510/50750 [3:42:41<81:00:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 20:25:25,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:25:25,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.26 | bwd_microstep: 3855.20 | bwd_inner_microstep: 3847.75 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.04 [2024-11-13 20:25:25,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.26 | bwd: 3855.22 | bwd_inner: 3847.75 | bwd_allreduce: 7.43 | step: 21.05 3%|▎ | 1511/50750 [3:42:47<81:02:03, 5.92s/it] {'loss': 0.0043, 'learning_rate': 3.9684832567301384e-05, 'epoch': 1.49} 3%|▎ | 1511/50750 [3:42:47<81:02:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:25:31,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 20:25:31,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.85 | bwd_microstep: 3849.28 | bwd_inner_microstep: 3841.72 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.32 [2024-11-13 20:25:31,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.85 | bwd: 3849.29 | bwd_inner: 3841.72 | bwd_allreduce: 7.53 | step: 21.32 3%|▎ | 1512/50750 [3:42:53<81:01:36, 5.92s/it] {'loss': 0.0621, 'learning_rate': 3.9711096520026265e-05, 'epoch': 1.49} 3%|▎ | 1512/50750 [3:42:53<81:01:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:25:37,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 20:25:37,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.22 | bwd_microstep: 3851.26 | bwd_inner_microstep: 3843.75 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.11 [2024-11-13 20:25:37,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.22 | bwd: 3851.27 | bwd_inner: 3843.75 | bwd_allreduce: 7.48 | step: 21.11 3%|▎ | 1513/50750 [3:42:59<81:02:46, 5.93s/it] {'loss': 0.5097, 'learning_rate': 3.973736047275115e-05, 'epoch': 1.49} 3%|▎ | 1513/50750 [3:42:59<81:02:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:25:43,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 20:25:43,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.95 | bwd_microstep: 3859.06 | bwd_inner_microstep: 3851.33 | bwd_allreduce_microstep: 7.68 | step_microstep: 22.08 [2024-11-13 20:25:43,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.94 | bwd: 3859.08 | bwd_inner: 3851.33 | bwd_allreduce: 7.70 | step: 22.08 3%|▎ | 1514/50750 [3:43:05<81:05:52, 5.93s/it] {'loss': 0.0524, 'learning_rate': 3.976362442547604e-05, 'epoch': 1.49} 3%|▎ | 1514/50750 [3:43:05<81:05:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 20:25:49,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:25:49,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.03 | bwd_microstep: 3853.35 | bwd_inner_microstep: 3845.87 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.01 [2024-11-13 20:25:49,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.03 | bwd: 3853.36 | bwd_inner: 3845.87 | bwd_allreduce: 7.45 | step: 21.01 3%|▎ | 1515/50750 [3:43:11<81:06:24, 5.93s/it] {'loss': 0.013, 'learning_rate': 3.978988837820092e-05, 'epoch': 1.49} 3%|▎ | 1515/50750 [3:43:11<81:06:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:25:55,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 20:25:55,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.83 | bwd_microstep: 3857.27 | bwd_inner_microstep: 3849.62 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.80 [2024-11-13 20:25:55,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.83 | bwd: 3857.29 | bwd_inner: 3849.62 | bwd_allreduce: 7.44 | step: 20.80 3%|▎ | 1516/50750 [3:43:17<81:06:03, 5.93s/it] {'loss': 0.0117, 'learning_rate': 3.981615233092581e-05, 'epoch': 1.49} 3%|▎ | 1516/50750 [3:43:17<81:06:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 20:26:01,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 20:26:01,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.07 | bwd_microstep: 3851.16 | bwd_inner_microstep: 3843.70 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.03 [2024-11-13 20:26:01,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.07 | bwd: 3851.17 | bwd_inner: 3843.70 | bwd_allreduce: 7.44 | step: 21.04 3%|▎ | 1517/50750 [3:43:23<81:04:02, 5.93s/it] {'loss': 0.2447, 'learning_rate': 3.9842416283650694e-05, 'epoch': 1.49} 3%|▎ | 1517/50750 [3:43:23<81:04:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 20:26:07,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:26:07,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.34 | bwd_microstep: 3855.51 | bwd_inner_microstep: 3847.99 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.13 [2024-11-13 20:26:07,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.34 | bwd: 3855.53 | bwd_inner: 3847.99 | bwd_allreduce: 7.50 | step: 21.13 3%|▎ | 1518/50750 [3:43:28<81:03:42, 5.93s/it] {'loss': 0.0088, 'learning_rate': 3.986868023637558e-05, 'epoch': 1.5} 3%|▎ | 1518/50750 [3:43:28<81:03:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 20:26:12,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 20:26:12,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.05 | bwd_microstep: 3849.95 | bwd_inner_microstep: 3842.19 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.81 [2024-11-13 20:26:12,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.05 | bwd: 3849.96 | bwd_inner: 3842.19 | bwd_allreduce: 7.73 | step: 21.81 3%|▎ | 1519/50750 [3:43:34<81:02:51, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.989494418910046e-05, 'epoch': 1.5} 3%|▎ | 1519/50750 [3:43:34<81:02:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 20:26:18,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 20:26:18,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.54 | bwd_microstep: 3855.55 | bwd_inner_microstep: 3848.08 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.90 [2024-11-13 20:26:18,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.52 | bwd: 3855.56 | bwd_inner: 3848.08 | bwd_allreduce: 7.44 | step: 20.90 3%|▎ | 1520/50750 [3:43:40<81:05:06, 5.93s/it] {'loss': 0.0038, 'learning_rate': 3.992120814182535e-05, 'epoch': 1.5} 3%|▎ | 1520/50750 [3:43:40<81:05:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 20:26:24,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 20:26:24,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.81 | bwd_microstep: 3852.36 | bwd_inner_microstep: 3844.79 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.77 [2024-11-13 20:26:24,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.81 | bwd: 3852.37 | bwd_inner: 3844.79 | bwd_allreduce: 7.54 | step: 21.78 3%|▎ | 1521/50750 [3:43:46<81:04:53, 5.93s/it] {'loss': 0.0106, 'learning_rate': 3.9947472094550236e-05, 'epoch': 1.5} 3%|▎ | 1521/50750 [3:43:46<81:04:53, 5.93s/it]evaluate! dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B Results saved to qa_abcd_lora.csv Accuracy: 0.9114173228346457 New best accuracy: 0.9114173228346457. Saving model... [INFO|trainer.py:2936] 2024-11-13 21:01:34,588 >> Saving model checkpoint to work_dirs/QA2/qa_abcd_lora [INFO|configuration_utils.py:473] 2024-11-13 21:01:34,590 >> Configuration saved in work_dirs/QA2/qa_abcd_lora/config.json [INFO|configuration_utils.py:594] 2024-11-13 21:01:34,591 >> Configuration saved in work_dirs/QA2/qa_abcd_lora/generation_config.json [INFO|modeling_utils.py:2501] 2024-11-13 21:02:18,693 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at work_dirs/QA2/qa_abcd_lora/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2433] 2024-11-13 21:02:18,695 >> tokenizer config file saved in work_dirs/QA2/qa_abcd_lora/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2024-11-13 21:02:18,695 >> Special tokens file saved in work_dirs/QA2/qa_abcd_lora/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2024-11-13 21:02:18,695 >> added tokens file saved in work_dirs/QA2/qa_abcd_lora/added_tokens.json 11/13/2024 21:02:20 - INFO - __main__ - Saved LoRA weights to work_dirs/QA2/qa_abcd_lora/lora_weights.pth dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:02:26,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:02:26,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2003.32 | bwd_microstep: 3813.09 | bwd_inner_microstep: 3805.57 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.35 [2024-11-13 21:02:26,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2003.29 | bwd: 3813.10 | bwd_inner: 3805.58 | bwd_allreduce: 7.49 | step: 21.35 3%|▎ | 1522/50750 [4:19:48<8924:15:28, 652.62s/it] {'loss': 0.0007, 'learning_rate': 3.997373604727512e-05, 'epoch': 1.5} 3%|▎ | 1522/50750 [4:19:48<8924:15:28, 652.62s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:02:32,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 21:02:32,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2008.22 | bwd_microstep: 3815.03 | bwd_inner_microstep: 3807.53 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.38 [2024-11-13 21:02:32,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2008.22 | bwd: 3815.05 | bwd_inner: 3807.53 | bwd_allreduce: 7.48 | step: 21.38 3%|▎ | 1523/50750 [4:19:54<6270:56:23, 458.60s/it] {'loss': 0.0537, 'learning_rate': 4e-05, 'epoch': 1.5} 3%|▎ | 1523/50750 [4:19:54<6270:56:23, 458.60s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:02:38,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.92 [2024-11-13 21:02:38,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2010.45 | bwd_microstep: 3832.24 | bwd_inner_microstep: 3824.53 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.69 [2024-11-13 21:02:38,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2010.45 | bwd: 3832.25 | bwd_inner: 3824.53 | bwd_allreduce: 7.68 | step: 21.70 3%|▎ | 1524/50750 [4:20:00<4413:44:40, 322.79s/it] {'loss': 0.1675, 'learning_rate': 3.9999999959272016e-05, 'epoch': 1.5} 3%|▎ | 1524/50750 [4:20:00<4413:44:40, 322.79s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:02:44,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:02:44,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2013.27 | bwd_microstep: 3839.49 | bwd_inner_microstep: 3832.00 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.26 [2024-11-13 21:02:44,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2013.27 | bwd: 3839.51 | bwd_inner: 3832.00 | bwd_allreduce: 7.47 | step: 21.27 3%|▎ | 1525/50750 [4:20:06<3113:45:45, 227.72s/it] {'loss': 0.0006, 'learning_rate': 3.9999999837088036e-05, 'epoch': 1.5} 3%|▎ | 1525/50750 [4:20:06<3113:45:45, 227.72s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:02:49,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:02:49,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.06 | bwd_microstep: 3842.50 | bwd_inner_microstep: 3835.01 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.92 [2024-11-13 21:02:49,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.06 | bwd: 3842.51 | bwd_inner: 3835.01 | bwd_allreduce: 7.47 | step: 20.92 3%|▎ | 1526/50750 [4:20:11<2203:49:10, 161.18s/it] {'loss': 0.3234, 'learning_rate': 3.999999963344807e-05, 'epoch': 1.5} 3%|▎ | 1526/50750 [4:20:11<2203:49:10, 161.18s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:02:55,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:02:55,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.35 | bwd_microstep: 3845.98 | bwd_inner_microstep: 3838.47 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.52 [2024-11-13 21:02:55,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.35 | bwd: 3845.99 | bwd_inner: 3838.47 | bwd_allreduce: 7.48 | step: 21.52 3%|▎ | 1527/50750 [4:20:17<1566:54:14, 114.60s/it] {'loss': 0.3765, 'learning_rate': 3.999999934835212e-05, 'epoch': 1.5} 3%|▎ | 1527/50750 [4:20:17<1566:54:14, 114.60s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:03:01,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.74 | optimizer_step: 4.93 [2024-11-13 21:03:01,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.80 | bwd_microstep: 3845.61 | bwd_inner_microstep: 3837.44 | bwd_allreduce_microstep: 8.10 | step_microstep: 28.79 [2024-11-13 21:03:01,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.79 | bwd: 3845.63 | bwd_inner: 3837.44 | bwd_allreduce: 8.13 | step: 28.80 3%|▎ | 1528/50750 [4:20:23<1121:08:33, 82.00s/it] {'loss': 0.0121, 'learning_rate': 3.9999998981800197e-05, 'epoch': 1.51} 3%|▎ | 1528/50750 [4:20:23<1121:08:33, 82.00s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:03:07,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 21:03:07,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.17 | bwd_microstep: 3846.86 | bwd_inner_microstep: 3839.05 | bwd_allreduce_microstep: 7.76 | step_microstep: 22.55 [2024-11-13 21:03:07,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.14 | bwd: 3846.88 | bwd_inner: 3839.05 | bwd_allreduce: 7.78 | step: 22.55 3%|▎ | 1529/50750 [4:20:29<809:06:33, 59.18s/it] {'loss': 0.0204, 'learning_rate': 3.999999853379229e-05, 'epoch': 1.51} 3%|▎ | 1529/50750 [4:20:29<809:06:33, 59.18s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:03:13,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:03:13,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.09 | bwd_microstep: 3818.65 | bwd_inner_microstep: 3810.96 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.77 [2024-11-13 21:03:13,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.08 | bwd: 3818.67 | bwd_inner: 3810.96 | bwd_allreduce: 7.67 | step: 21.78 3%|▎ | 1530/50750 [4:20:35<590:30:41, 43.19s/it] {'loss': 0.0001, 'learning_rate': 3.999999800432839e-05, 'epoch': 1.51} 3%|▎ | 1530/50750 [4:20:35<590:30:41, 43.19s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:03:19,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 21:03:19,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.89 | bwd_microstep: 3827.72 | bwd_inner_microstep: 3820.08 | bwd_allreduce_microstep: 7.59 | step_microstep: 22.04 [2024-11-13 21:03:19,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.89 | bwd: 3827.73 | bwd_inner: 3820.08 | bwd_allreduce: 7.61 | step: 22.04 3%|▎ | 1531/50750 [4:20:41<437:33:34, 32.00s/it] {'loss': 0.0014, 'learning_rate': 3.9999997393408527e-05, 'epoch': 1.51} 3%|▎ | 1531/50750 [4:20:41<437:33:34, 32.00s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:03:25,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:03:25,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.81 | bwd_microstep: 3826.85 | bwd_inner_microstep: 3819.23 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.61 [2024-11-13 21:03:25,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.80 | bwd: 3826.87 | bwd_inner: 3819.23 | bwd_allreduce: 7.60 | step: 21.62 3%|▎ | 1532/50750 [4:20:47<330:27:56, 24.17s/it] {'loss': 0.2667, 'learning_rate': 3.999999670103269e-05, 'epoch': 1.51} 3%|▎ | 1532/50750 [4:20:47<330:27:56, 24.17s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:03:31,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 21:03:31,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2015.19 | bwd_microstep: 3822.68 | bwd_inner_microstep: 3814.85 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.44 [2024-11-13 21:03:31,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2015.19 | bwd: 3822.70 | bwd_inner: 3814.85 | bwd_allreduce: 7.80 | step: 22.44 3%|▎ | 1533/50750 [4:20:53<255:29:04, 18.69s/it] {'loss': 0.0081, 'learning_rate': 3.999999592720087e-05, 'epoch': 1.51} 3%|▎ | 1533/50750 [4:20:53<255:29:04, 18.69s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:03:37,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 21:03:37,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.55 | bwd_microstep: 3835.58 | bwd_inner_microstep: 3827.83 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.95 [2024-11-13 21:03:37,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.53 | bwd: 3835.59 | bwd_inner: 3827.83 | bwd_allreduce: 7.71 | step: 21.94 3%|▎ | 1534/50750 [4:20:59<203:06:46, 14.86s/it] {'loss': 0.803, 'learning_rate': 3.999999507191309e-05, 'epoch': 1.51} 3%|▎ | 1534/50750 [4:20:59<203:06:46, 14.86s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:03:43,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 21:03:43,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.93 | bwd_microstep: 3826.32 | bwd_inner_microstep: 3818.76 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.53 [2024-11-13 21:03:43,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.93 | bwd: 3826.33 | bwd_inner: 3818.76 | bwd_allreduce: 7.53 | step: 21.53 3%|▎ | 1535/50750 [4:21:05<166:21:28, 12.17s/it] {'loss': 0.0422, 'learning_rate': 3.999999413516935e-05, 'epoch': 1.51} 3%|▎ | 1535/50750 [4:21:05<166:21:28, 12.17s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:03:49,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:03:49,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.39 | bwd_microstep: 3833.92 | bwd_inner_microstep: 3826.37 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.42 [2024-11-13 21:03:49,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.38 | bwd: 3833.93 | bwd_inner: 3826.37 | bwd_allreduce: 7.52 | step: 21.43 3%|▎ | 1536/50750 [4:21:10<140:39:25, 10.29s/it] {'loss': 0.0132, 'learning_rate': 3.999999311696964e-05, 'epoch': 1.51} 3%|▎ | 1536/50750 [4:21:10<140:39:25, 10.29s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:03:54,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:03:54,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.98 | bwd_microstep: 3831.81 | bwd_inner_microstep: 3824.23 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.71 [2024-11-13 21:03:54,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.97 | bwd: 3831.83 | bwd_inner: 3824.23 | bwd_allreduce: 7.55 | step: 21.72 3%|▎ | 1537/50750 [4:21:16<122:39:58, 8.97s/it] {'loss': 0.0208, 'learning_rate': 3.999999201731397e-05, 'epoch': 1.51} 3%|▎ | 1537/50750 [4:21:16<122:39:58, 8.97s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:04:00,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:04:00,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.57 | bwd_microstep: 3832.13 | bwd_inner_microstep: 3824.58 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.53 [2024-11-13 21:04:00,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.57 | bwd: 3832.15 | bwd_inner: 3824.58 | bwd_allreduce: 7.52 | step: 21.54 3%|▎ | 1538/50750 [4:21:22<110:03:15, 8.05s/it] {'loss': 0.0012, 'learning_rate': 3.999999083620235e-05, 'epoch': 1.52} 3%|▎ | 1538/50750 [4:21:22<110:03:15, 8.05s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:04:06,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.31 | optimizer_step: 4.93 [2024-11-13 21:04:06,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.76 | bwd_microstep: 3831.78 | bwd_inner_microstep: 3824.25 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.92 [2024-11-13 21:04:06,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.76 | bwd: 3831.79 | bwd_inner: 3824.25 | bwd_allreduce: 7.50 | step: 21.92 3%|▎ | 1539/50750 [4:21:28<101:13:33, 7.41s/it] {'loss': 0.0001, 'learning_rate': 3.999998957363478e-05, 'epoch': 1.52} 3%|▎ | 1539/50750 [4:21:28<101:13:33, 7.41s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:04:12,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:04:12,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.65 | bwd_microstep: 3835.65 | bwd_inner_microstep: 3828.09 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.39 [2024-11-13 21:04:12,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.65 | bwd: 3835.66 | bwd_inner: 3828.09 | bwd_allreduce: 7.52 | step: 21.39 3%|▎ | 1540/50750 [4:21:34<95:04:14, 6.95s/it] {'loss': 0.0214, 'learning_rate': 3.9999988229611276e-05, 'epoch': 1.52} 3%|▎ | 1540/50750 [4:21:34<95:04:14, 6.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:04:18,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:04:18,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.60 | bwd_microstep: 3842.38 | bwd_inner_microstep: 3834.63 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.32 [2024-11-13 21:04:18,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.60 | bwd: 3842.39 | bwd_inner: 3834.63 | bwd_allreduce: 7.73 | step: 22.32 3%|▎ | 1541/50750 [4:21:40<90:47:54, 6.64s/it] {'loss': 0.4438, 'learning_rate': 3.9999986804131824e-05, 'epoch': 1.52} 3%|▎ | 1541/50750 [4:21:40<90:47:54, 6.64s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:04:24,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:04:24,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.36 | bwd_microstep: 3850.76 | bwd_inner_microstep: 3843.16 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.94 [2024-11-13 21:04:24,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.35 | bwd: 3850.77 | bwd_inner: 3843.16 | bwd_allreduce: 7.57 | step: 21.94 3%|▎ | 1542/50750 [4:21:46<87:50:09, 6.43s/it] {'loss': 0.011, 'learning_rate': 3.9999985297196453e-05, 'epoch': 1.52} 3%|▎ | 1542/50750 [4:21:46<87:50:09, 6.43s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:04:30,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:04:30,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.96 | bwd_microstep: 3846.62 | bwd_inner_microstep: 3839.02 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.48 [2024-11-13 21:04:30,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.96 | bwd: 3846.63 | bwd_inner: 3839.02 | bwd_allreduce: 7.57 | step: 21.49 3%|▎ | 1543/50750 [4:21:52<85:45:06, 6.27s/it] {'loss': 0.0029, 'learning_rate': 3.999998370880515e-05, 'epoch': 1.52} 3%|▎ | 1543/50750 [4:21:52<85:45:06, 6.27s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:04:36,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:04:36,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.93 | bwd_microstep: 3847.00 | bwd_inner_microstep: 3839.44 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.31 [2024-11-13 21:04:36,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.93 | bwd: 3847.02 | bwd_inner: 3839.44 | bwd_allreduce: 7.54 | step: 21.32 3%|▎ | 1544/50750 [4:21:58<84:17:31, 6.17s/it] {'loss': 0.1315, 'learning_rate': 3.999998203895792e-05, 'epoch': 1.52} 3%|▎ | 1544/50750 [4:21:58<84:17:31, 6.17s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:04:42,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 21:04:42,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.78 | bwd_microstep: 3846.48 | bwd_inner_microstep: 3838.62 | bwd_allreduce_microstep: 7.79 | step_microstep: 22.51 [2024-11-13 21:04:42,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.78 | bwd: 3846.50 | bwd_inner: 3838.62 | bwd_allreduce: 7.82 | step: 22.51 3%|▎ | 1545/50750 [4:22:04<83:16:39, 6.09s/it] {'loss': 0.1235, 'learning_rate': 3.999998028765479e-05, 'epoch': 1.52} 3%|▎ | 1545/50750 [4:22:04<83:16:39, 6.09s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:04:48,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.64 | optimizer_step: 4.93 [2024-11-13 21:04:48,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.79 | bwd_microstep: 3845.22 | bwd_inner_microstep: 3837.25 | bwd_allreduce_microstep: 7.90 | step_microstep: 30.01 [2024-11-13 21:04:48,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.78 | bwd: 3845.24 | bwd_inner: 3837.25 | bwd_allreduce: 7.93 | step: 30.01 3%|▎ | 1546/50750 [4:22:10<82:37:42, 6.05s/it] {'loss': 0.085, 'learning_rate': 3.999997845489575e-05, 'epoch': 1.52} 3%|▎ | 1546/50750 [4:22:10<82:37:42, 6.05s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:04:54,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 21:04:54,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.75 | bwd_microstep: 3848.82 | bwd_inner_microstep: 3840.89 | bwd_allreduce_microstep: 7.88 | step_microstep: 22.34 [2024-11-13 21:04:54,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.74 | bwd: 3848.84 | bwd_inner: 3840.89 | bwd_allreduce: 7.90 | step: 22.35 3%|▎ | 1547/50750 [4:22:16<82:09:00, 6.01s/it] {'loss': 0.0264, 'learning_rate': 3.9999976540680806e-05, 'epoch': 1.52} 3%|▎ | 1547/50750 [4:22:16<82:09:00, 6.01s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:05:00,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:05:00,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.28 | bwd_microstep: 3850.91 | bwd_inner_microstep: 3843.34 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.55 [2024-11-13 21:05:00,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.26 | bwd: 3850.92 | bwd_inner: 3843.34 | bwd_allreduce: 7.55 | step: 21.55 3%|▎ | 1548/50750 [4:22:21<81:49:02, 5.99s/it] {'loss': 0.0003, 'learning_rate': 3.999997454500998e-05, 'epoch': 1.53} 3%|▎ | 1548/50750 [4:22:21<81:49:02, 5.99s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:05:05,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 21:05:05,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.64 | bwd_microstep: 3854.48 | bwd_inner_microstep: 3846.90 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.68 [2024-11-13 21:05:05,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.64 | bwd: 3854.49 | bwd_inner: 3846.90 | bwd_allreduce: 7.55 | step: 21.68 3%|▎ | 1549/50750 [4:22:27<81:34:13, 5.97s/it] {'loss': 0.0002, 'learning_rate': 3.999997246788327e-05, 'epoch': 1.53} 3%|▎ | 1549/50750 [4:22:27<81:34:13, 5.97s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:05:11,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:05:11,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.43 | bwd_microstep: 3855.25 | bwd_inner_microstep: 3847.74 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.05 [2024-11-13 21:05:11,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.42 | bwd: 3855.26 | bwd_inner: 3847.74 | bwd_allreduce: 7.47 | step: 21.05 3%|▎ | 1550/50750 [4:22:33<81:25:17, 5.96s/it] {'loss': 0.474, 'learning_rate': 3.999997030930069e-05, 'epoch': 1.53} 3%|▎ | 1550/50750 [4:22:33<81:25:17, 5.96s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:05:17,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:05:17,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.34 | bwd_microstep: 3856.34 | bwd_inner_microstep: 3848.85 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.10 [2024-11-13 21:05:17,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.34 | bwd: 3856.35 | bwd_inner: 3848.85 | bwd_allreduce: 7.47 | step: 21.10 3%|▎ | 1551/50750 [4:22:39<81:17:16, 5.95s/it] {'loss': 0.2868, 'learning_rate': 3.9999968069262246e-05, 'epoch': 1.53} 3%|▎ | 1551/50750 [4:22:39<81:17:16, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:05:23,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:05:23,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.07 | bwd_microstep: 3858.11 | bwd_inner_microstep: 3850.58 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.96 [2024-11-13 21:05:23,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.07 | bwd: 3858.12 | bwd_inner: 3850.58 | bwd_allreduce: 7.50 | step: 20.97 3%|▎ | 1552/50750 [4:22:45<81:12:03, 5.94s/it] {'loss': 0.7533, 'learning_rate': 3.999996574776794e-05, 'epoch': 1.53} 3%|▎ | 1552/50750 [4:22:45<81:12:03, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:05:29,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:05:29,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.94 | bwd_microstep: 3853.81 | bwd_inner_microstep: 3846.29 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.19 [2024-11-13 21:05:29,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.94 | bwd: 3853.82 | bwd_inner: 3846.29 | bwd_allreduce: 7.49 | step: 21.20 3%|▎ | 1553/50750 [4:22:51<81:07:59, 5.94s/it] {'loss': 0.0005, 'learning_rate': 3.999996334481779e-05, 'epoch': 1.53} 3%|▎ | 1553/50750 [4:22:51<81:07:59, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:05:35,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 21:05:35,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.45 | bwd_microstep: 3858.05 | bwd_inner_microstep: 3850.53 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.00 [2024-11-13 21:05:35,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.44 | bwd: 3858.06 | bwd_inner: 3850.53 | bwd_allreduce: 7.50 | step: 21.00 3%|▎ | 1554/50750 [4:22:57<81:05:59, 5.93s/it] {'loss': 0.3816, 'learning_rate': 3.999996086041181e-05, 'epoch': 1.53} 3%|▎ | 1554/50750 [4:22:57<81:05:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:05:41,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 21:05:41,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.07 | bwd_microstep: 3851.02 | bwd_inner_microstep: 3843.51 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.07 [2024-11-13 21:05:41,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.06 | bwd: 3851.04 | bwd_inner: 3843.51 | bwd_allreduce: 7.49 | step: 21.07 3%|▎ | 1555/50750 [4:23:03<81:02:18, 5.93s/it] {'loss': 0.2114, 'learning_rate': 3.999995829455e-05, 'epoch': 1.53} 3%|▎ | 1555/50750 [4:23:03<81:02:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:05:47,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:05:47,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.65 | bwd_microstep: 3860.29 | bwd_inner_microstep: 3852.76 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.85 [2024-11-13 21:05:47,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.65 | bwd: 3860.30 | bwd_inner: 3852.76 | bwd_allreduce: 7.50 | step: 20.85 3%|▎ | 1556/50750 [4:23:09<81:02:03, 5.93s/it] {'loss': 0.0951, 'learning_rate': 3.9999955647232376e-05, 'epoch': 1.53} 3%|▎ | 1556/50750 [4:23:09<81:02:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:05:53,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 21:05:53,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.79 | bwd_microstep: 3857.58 | bwd_inner_microstep: 3849.76 | bwd_allreduce_microstep: 7.77 | step_microstep: 21.94 [2024-11-13 21:05:53,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.79 | bwd: 3857.60 | bwd_inner: 3849.76 | bwd_allreduce: 7.79 | step: 21.95 3%|▎ | 1557/50750 [4:23:15<81:02:00, 5.93s/it] {'loss': 0.0008, 'learning_rate': 3.9999952918458945e-05, 'epoch': 1.53} 3%|▎ | 1557/50750 [4:23:15<81:02:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:05:59,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:05:59,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.77 | bwd_microstep: 3856.97 | bwd_inner_microstep: 3849.47 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.98 [2024-11-13 21:05:59,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.76 | bwd: 3856.99 | bwd_inner: 3849.47 | bwd_allreduce: 7.48 | step: 20.99 3%|▎ | 1558/50750 [4:23:21<81:02:31, 5.93s/it] {'loss': 0.0009, 'learning_rate': 3.9999950108229716e-05, 'epoch': 1.53} 3%|▎ | 1558/50750 [4:23:21<81:02:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:06:05,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:06:05,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.67 | bwd_microstep: 3853.30 | bwd_inner_microstep: 3845.79 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.03 [2024-11-13 21:06:05,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.67 | bwd: 3853.31 | bwd_inner: 3845.79 | bwd_allreduce: 7.48 | step: 21.04 3%|▎ | 1559/50750 [4:23:27<81:00:07, 5.93s/it] {'loss': 0.4125, 'learning_rate': 3.9999947216544717e-05, 'epoch': 1.54} 3%|▎ | 1559/50750 [4:23:27<81:00:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:06:11,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:06:11,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.42 | bwd_microstep: 3856.25 | bwd_inner_microstep: 3848.76 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.99 [2024-11-13 21:06:11,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.42 | bwd: 3856.27 | bwd_inner: 3848.76 | bwd_allreduce: 7.47 | step: 20.99 3%|▎ | 1560/50750 [4:23:33<80:58:59, 5.93s/it] {'loss': 0.5377, 'learning_rate': 3.999994424340394e-05, 'epoch': 1.54} 3%|▎ | 1560/50750 [4:23:33<80:58:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:06:17,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.97 [2024-11-13 21:06:17,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.93 | bwd_microstep: 3862.90 | bwd_inner_microstep: 3854.91 | bwd_allreduce_microstep: 7.92 | step_microstep: 26.45 [2024-11-13 21:06:17,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.90 | bwd: 3862.92 | bwd_inner: 3854.91 | bwd_allreduce: 7.95 | step: 26.44 3%|▎ | 1561/50750 [4:23:39<81:02:32, 5.93s/it] {'loss': 0.4748, 'learning_rate': 3.99999411888074e-05, 'epoch': 1.54} 3%|▎ | 1561/50750 [4:23:39<81:02:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:06:23,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.92 [2024-11-13 21:06:23,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.40 | bwd_microstep: 3857.89 | bwd_inner_microstep: 3849.82 | bwd_allreduce_microstep: 8.02 | step_microstep: 21.89 [2024-11-13 21:06:23,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.39 | bwd: 3857.90 | bwd_inner: 3849.82 | bwd_allreduce: 8.04 | step: 21.89 3%|▎ | 1562/50750 [4:23:44<81:03:04, 5.93s/it] {'loss': 0.0142, 'learning_rate': 3.999993805275512e-05, 'epoch': 1.54} 3%|▎ | 1562/50750 [4:23:44<81:03:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:06:28,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 21:06:28,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.32 | bwd_microstep: 3860.21 | bwd_inner_microstep: 3852.36 | bwd_allreduce_microstep: 7.81 | step_microstep: 21.78 [2024-11-13 21:06:28,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.30 | bwd: 3860.22 | bwd_inner: 3852.36 | bwd_allreduce: 7.82 | step: 21.78 3%|▎ | 1563/50750 [4:23:50<81:04:40, 5.93s/it] {'loss': 0.1358, 'learning_rate': 3.999993483524711e-05, 'epoch': 1.54} 3%|▎ | 1563/50750 [4:23:50<81:04:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:06:34,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.94 [2024-11-13 21:06:34,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.53 | bwd_microstep: 3858.05 | bwd_inner_microstep: 3850.32 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.29 [2024-11-13 21:06:34,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.52 | bwd: 3858.06 | bwd_inner: 3850.32 | bwd_allreduce: 7.70 | step: 21.30 3%|▎ | 1564/50750 [4:23:56<81:03:32, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.999993153628338e-05, 'epoch': 1.54} 3%|▎ | 1564/50750 [4:23:56<81:03:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:06:40,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:06:40,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.89 | bwd_microstep: 3860.01 | bwd_inner_microstep: 3852.53 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.83 [2024-11-13 21:06:40,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.87 | bwd: 3860.02 | bwd_inner: 3852.53 | bwd_allreduce: 7.46 | step: 20.84 3%|▎ | 1565/50750 [4:24:02<81:04:47, 5.93s/it] {'loss': 0.0786, 'learning_rate': 3.999992815586394e-05, 'epoch': 1.54} 3%|▎ | 1565/50750 [4:24:02<81:04:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:06:46,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:06:46,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.67 | bwd_microstep: 3857.18 | bwd_inner_microstep: 3849.67 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.07 [2024-11-13 21:06:46,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.67 | bwd: 3857.19 | bwd_inner: 3849.67 | bwd_allreduce: 7.49 | step: 21.07 3%|▎ | 1566/50750 [4:24:08<81:05:27, 5.94s/it] {'loss': 0.5396, 'learning_rate': 3.999992469398881e-05, 'epoch': 1.54} 3%|▎ | 1566/50750 [4:24:08<81:05:27, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:06:52,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 21:06:52,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.82 | bwd_microstep: 3860.67 | bwd_inner_microstep: 3852.96 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.91 [2024-11-13 21:06:52,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.82 | bwd: 3860.68 | bwd_inner: 3852.96 | bwd_allreduce: 7.68 | step: 21.91 3%|▎ | 1567/50750 [4:24:14<81:04:36, 5.93s/it] {'loss': 0.051, 'learning_rate': 3.9999921150658e-05, 'epoch': 1.54} 3%|▎ | 1567/50750 [4:24:14<81:04:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:06:58,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-13 21:06:58,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.57 | bwd_microstep: 3858.36 | bwd_inner_microstep: 3850.70 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.66 [2024-11-13 21:06:58,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.56 | bwd: 3858.37 | bwd_inner: 3850.70 | bwd_allreduce: 7.63 | step: 21.66 3%|▎ | 1568/50750 [4:24:20<81:05:20, 5.94s/it] {'loss': 0.0315, 'learning_rate': 3.999991752587152e-05, 'epoch': 1.54} 3%|▎ | 1568/50750 [4:24:20<81:05:20, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:07:04,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.92 [2024-11-13 21:07:04,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.52 | bwd_microstep: 3857.80 | bwd_inner_microstep: 3849.93 | bwd_allreduce_microstep: 7.83 | step_microstep: 21.85 [2024-11-13 21:07:04,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.51 | bwd: 3857.81 | bwd_inner: 3849.93 | bwd_allreduce: 7.84 | step: 21.85 3%|▎ | 1569/50750 [4:24:26<81:06:04, 5.94s/it] {'loss': 0.2734, 'learning_rate': 3.99999138196294e-05, 'epoch': 1.55} 3%|▎ | 1569/50750 [4:24:26<81:06:04, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:07:10,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 21:07:10,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.85 | bwd_microstep: 3861.50 | bwd_inner_microstep: 3853.33 | bwd_allreduce_microstep: 8.12 | step_microstep: 21.89 [2024-11-13 21:07:10,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.84 | bwd: 3861.51 | bwd_inner: 3853.33 | bwd_allreduce: 8.14 | step: 21.89 3%|▎ | 1570/50750 [4:24:32<81:07:00, 5.94s/it] {'loss': 0.465, 'learning_rate': 3.999991003193164e-05, 'epoch': 1.55} 3%|▎ | 1570/50750 [4:24:32<81:07:00, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:07:16,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 21:07:16,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.36 | bwd_microstep: 3857.11 | bwd_inner_microstep: 3849.43 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.98 [2024-11-13 21:07:16,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.36 | bwd: 3857.13 | bwd_inner: 3849.43 | bwd_allreduce: 7.66 | step: 21.98 3%|▎ | 1571/50750 [4:24:38<81:05:22, 5.94s/it] {'loss': 0.3288, 'learning_rate': 3.999990616277826e-05, 'epoch': 1.55} 3%|▎ | 1571/50750 [4:24:38<81:05:22, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:07:22,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:07:22,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.75 | bwd_microstep: 3858.60 | bwd_inner_microstep: 3851.06 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.04 [2024-11-13 21:07:22,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.73 | bwd: 3858.61 | bwd_inner: 3851.06 | bwd_allreduce: 7.51 | step: 21.04 3%|▎ | 1572/50750 [4:24:44<81:04:07, 5.93s/it] {'loss': 0.5405, 'learning_rate': 3.999990221216928e-05, 'epoch': 1.55} 3%|▎ | 1572/50750 [4:24:44<81:04:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:07:28,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:07:28,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.18 | bwd_microstep: 3860.75 | bwd_inner_microstep: 3853.22 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.15 [2024-11-13 21:07:28,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.18 | bwd: 3860.76 | bwd_inner: 3853.22 | bwd_allreduce: 7.50 | step: 21.17 3%|▎ | 1573/50750 [4:24:50<81:02:54, 5.93s/it] {'loss': 0.0381, 'learning_rate': 3.999989818010471e-05, 'epoch': 1.55} 3%|▎ | 1573/50750 [4:24:50<81:02:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:07:34,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:07:34,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.03 | bwd_microstep: 3857.22 | bwd_inner_microstep: 3849.72 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.89 [2024-11-13 21:07:34,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.03 | bwd: 3857.23 | bwd_inner: 3849.72 | bwd_allreduce: 7.48 | step: 20.89 3%|▎ | 1574/50750 [4:24:56<81:01:58, 5.93s/it] {'loss': 0.003, 'learning_rate': 3.999989406658457e-05, 'epoch': 1.55} 3%|▎ | 1574/50750 [4:24:56<81:01:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:07:40,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.30 | optimizer_step: 4.93 [2024-11-13 21:07:40,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.91 | bwd_microstep: 3860.52 | bwd_inner_microstep: 3852.63 | bwd_allreduce_microstep: 7.83 | step_microstep: 23.69 [2024-11-13 21:07:40,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.91 | bwd: 3860.54 | bwd_inner: 3852.63 | bwd_allreduce: 7.86 | step: 23.69 3%|▎ | 1575/50750 [4:25:02<81:05:05, 5.94s/it] {'loss': 0.0872, 'learning_rate': 3.9999889871608876e-05, 'epoch': 1.55} 3%|▎ | 1575/50750 [4:25:02<81:05:05, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:07:46,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:07:46,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.82 | bwd_microstep: 3856.21 | bwd_inner_microstep: 3848.70 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.23 [2024-11-13 21:07:46,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.82 | bwd: 3856.22 | bwd_inner: 3848.70 | bwd_allreduce: 7.49 | step: 21.24 3%|▎ | 1576/50750 [4:25:08<81:03:13, 5.93s/it] {'loss': 0.4637, 'learning_rate': 3.999988559517765e-05, 'epoch': 1.55} 3%|▎ | 1576/50750 [4:25:08<81:03:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:07:52,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:07:52,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.54 | bwd_microstep: 3850.21 | bwd_inner_microstep: 3842.72 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.20 [2024-11-13 21:07:52,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.54 | bwd: 3850.22 | bwd_inner: 3842.72 | bwd_allreduce: 7.46 | step: 21.20 3%|▎ | 1577/50750 [4:25:13<80:59:55, 5.93s/it] {'loss': 0.2035, 'learning_rate': 3.99998812372909e-05, 'epoch': 1.55} 3%|▎ | 1577/50750 [4:25:13<80:59:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:07:57,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:07:57,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.16 | bwd_microstep: 3848.35 | bwd_inner_microstep: 3840.72 | bwd_allreduce_microstep: 7.59 | step_microstep: 20.92 [2024-11-13 21:07:57,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.16 | bwd: 3848.36 | bwd_inner: 3840.72 | bwd_allreduce: 7.60 | step: 20.92 3%|▎ | 1578/50750 [4:25:19<80:56:48, 5.93s/it] {'loss': 0.1184, 'learning_rate': 3.999987679794865e-05, 'epoch': 1.55} 3%|▎ | 1578/50750 [4:25:19<80:56:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:08:03,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:08:03,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.02 | bwd_microstep: 3845.54 | bwd_inner_microstep: 3838.09 | bwd_allreduce_microstep: 7.40 | step_microstep: 20.96 [2024-11-13 21:08:03,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.02 | bwd: 3845.55 | bwd_inner: 3838.09 | bwd_allreduce: 7.42 | step: 20.96 3%|▎ | 1579/50750 [4:25:25<80:54:36, 5.92s/it] {'loss': 0.4194, 'learning_rate': 3.9999872277150915e-05, 'epoch': 1.56} 3%|▎ | 1579/50750 [4:25:25<80:54:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:08:09,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:08:09,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.84 | bwd_microstep: 3842.47 | bwd_inner_microstep: 3834.80 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.59 [2024-11-13 21:08:09,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.83 | bwd: 3842.49 | bwd_inner: 3834.80 | bwd_allreduce: 7.64 | step: 21.60 3%|▎ | 1580/50750 [4:25:31<80:52:22, 5.92s/it] {'loss': 0.0502, 'learning_rate': 3.999986767489772e-05, 'epoch': 1.56} 3%|▎ | 1580/50750 [4:25:31<80:52:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:08:15,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 21:08:15,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.34 | bwd_microstep: 3845.04 | bwd_inner_microstep: 3837.58 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.94 [2024-11-13 21:08:15,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.34 | bwd: 3845.05 | bwd_inner: 3837.58 | bwd_allreduce: 7.44 | step: 20.94 3%|▎ | 1581/50750 [4:25:37<80:52:05, 5.92s/it] {'loss': 0.1008, 'learning_rate': 3.999986299118908e-05, 'epoch': 1.56} 3%|▎ | 1581/50750 [4:25:37<80:52:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:08:21,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:08:21,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.57 | bwd_microstep: 3848.87 | bwd_inner_microstep: 3841.40 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.90 [2024-11-13 21:08:21,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.57 | bwd: 3848.88 | bwd_inner: 3841.40 | bwd_allreduce: 7.44 | step: 20.91 3%|▎ | 1582/50750 [4:25:43<80:51:57, 5.92s/it] {'loss': 0.8442, 'learning_rate': 3.9999858226025e-05, 'epoch': 1.56} 3%|▎ | 1582/50750 [4:25:43<80:51:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:08:27,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:08:27,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.40 | bwd_microstep: 3853.51 | bwd_inner_microstep: 3846.01 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.98 [2024-11-13 21:08:27,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.40 | bwd: 3853.52 | bwd_inner: 3846.01 | bwd_allreduce: 7.47 | step: 20.99 3%|▎ | 1583/50750 [4:25:49<80:52:33, 5.92s/it] {'loss': 0.4314, 'learning_rate': 3.999985337940552e-05, 'epoch': 1.56} 3%|▎ | 1583/50750 [4:25:49<80:52:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:08:33,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 21:08:33,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.72 | bwd_microstep: 3845.06 | bwd_inner_microstep: 3837.41 | bwd_allreduce_microstep: 7.59 | step_microstep: 20.80 [2024-11-13 21:08:33,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.72 | bwd: 3845.07 | bwd_inner: 3837.41 | bwd_allreduce: 7.61 | step: 20.79 3%|▎ | 1584/50750 [4:25:55<80:51:05, 5.92s/it] {'loss': 0.0637, 'learning_rate': 3.999984845133065e-05, 'epoch': 1.56} 3%|▎ | 1584/50750 [4:25:55<80:51:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 21:08:39,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 21:08:39,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.33 | bwd_microstep: 3849.84 | bwd_inner_microstep: 3842.36 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.98 [2024-11-13 21:08:39,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.33 | bwd: 3849.85 | bwd_inner: 3842.36 | bwd_allreduce: 7.45 | step: 20.98 3%|▎ | 1585/50750 [4:26:01<80:50:34, 5.92s/it] {'loss': 0.3433, 'learning_rate': 3.999984344180042e-05, 'epoch': 1.56} 3%|▎ | 1585/50750 [4:26:01<80:50:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:08:45,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:08:45,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.47 | bwd_microstep: 3849.30 | bwd_inner_microstep: 3841.82 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.80 [2024-11-13 21:08:45,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.47 | bwd: 3849.31 | bwd_inner: 3841.82 | bwd_allreduce: 7.45 | step: 20.81 3%|▎ | 1586/50750 [4:26:07<80:50:11, 5.92s/it] {'loss': 0.5967, 'learning_rate': 3.999983835081483e-05, 'epoch': 1.56} 3%|▎ | 1586/50750 [4:26:07<80:50:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:08:51,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:08:51,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.75 | bwd_microstep: 3848.32 | bwd_inner_microstep: 3840.84 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.98 [2024-11-13 21:08:51,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.75 | bwd: 3848.34 | bwd_inner: 3840.84 | bwd_allreduce: 7.46 | step: 20.98 3%|▎ | 1587/50750 [4:26:13<80:50:13, 5.92s/it] {'loss': 0.1189, 'learning_rate': 3.999983317837392e-05, 'epoch': 1.56} 3%|▎ | 1587/50750 [4:26:13<80:50:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:08:57,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:08:57,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.29 | bwd_microstep: 3846.27 | bwd_inner_microstep: 3838.78 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.89 [2024-11-13 21:08:57,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.29 | bwd: 3846.29 | bwd_inner: 3838.78 | bwd_allreduce: 7.46 | step: 20.89 3%|▎ | 1588/50750 [4:26:19<80:50:45, 5.92s/it] {'loss': 0.18, 'learning_rate': 3.99998279244777e-05, 'epoch': 1.56} 3%|▎ | 1588/50750 [4:26:19<80:50:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:09:03,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:09:03,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.51 | bwd_microstep: 3843.63 | bwd_inner_microstep: 3836.14 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.88 [2024-11-13 21:09:03,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.51 | bwd: 3843.64 | bwd_inner: 3836.14 | bwd_allreduce: 7.46 | step: 20.89 3%|▎ | 1589/50750 [4:26:25<80:49:14, 5.92s/it] {'loss': 0.8603, 'learning_rate': 3.99998225891262e-05, 'epoch': 1.57} 3%|▎ | 1589/50750 [4:26:25<80:49:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:09:08,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.35 | optimizer_step: 4.93 [2024-11-13 21:09:08,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.70 | bwd_microstep: 3851.56 | bwd_inner_microstep: 3843.71 | bwd_allreduce_microstep: 7.80 | step_microstep: 22.06 [2024-11-13 21:09:08,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.70 | bwd: 3851.57 | bwd_inner: 3843.71 | bwd_allreduce: 7.82 | step: 22.07 3%|▎ | 1590/50750 [4:26:30<80:51:24, 5.92s/it] {'loss': 0.0599, 'learning_rate': 3.999981717231944e-05, 'epoch': 1.57} 3%|▎ | 1590/50750 [4:26:30<80:51:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:09:14,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:09:14,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.80 | bwd_microstep: 3852.20 | bwd_inner_microstep: 3844.66 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.22 [2024-11-13 21:09:14,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.80 | bwd: 3852.21 | bwd_inner: 3844.66 | bwd_allreduce: 7.51 | step: 21.23 3%|▎ | 1591/50750 [4:26:36<80:53:12, 5.92s/it] {'loss': 0.0655, 'learning_rate': 3.999981167405743e-05, 'epoch': 1.57} 3%|▎ | 1591/50750 [4:26:36<80:53:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:09:20,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:09:20,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.14 | bwd_microstep: 3855.15 | bwd_inner_microstep: 3847.64 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.95 [2024-11-13 21:09:20,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.13 | bwd: 3855.17 | bwd_inner: 3847.64 | bwd_allreduce: 7.49 | step: 20.95 3%|▎ | 1592/50750 [4:26:42<80:54:56, 5.93s/it] {'loss': 0.0739, 'learning_rate': 3.9999806094340205e-05, 'epoch': 1.57} 3%|▎ | 1592/50750 [4:26:42<80:54:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:09:26,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:09:26,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.16 | bwd_microstep: 3857.33 | bwd_inner_microstep: 3849.84 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.94 [2024-11-13 21:09:26,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.16 | bwd: 3857.34 | bwd_inner: 3849.84 | bwd_allreduce: 7.46 | step: 20.95 3%|▎ | 1593/50750 [4:26:48<80:55:43, 5.93s/it] {'loss': 0.1391, 'learning_rate': 3.999980043316779e-05, 'epoch': 1.57} 3%|▎ | 1593/50750 [4:26:48<80:55:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:09:32,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:09:32,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.98 | bwd_microstep: 3847.58 | bwd_inner_microstep: 3839.90 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.65 [2024-11-13 21:09:32,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.97 | bwd: 3847.60 | bwd_inner: 3839.90 | bwd_allreduce: 7.65 | step: 21.65 3%|▎ | 1594/50750 [4:26:54<80:54:01, 5.92s/it] {'loss': 0.0166, 'learning_rate': 3.9999794690540195e-05, 'epoch': 1.57} 3%|▎ | 1594/50750 [4:26:54<80:54:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:09:38,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 21:09:38,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.83 | bwd_microstep: 3859.03 | bwd_inner_microstep: 3851.49 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.55 [2024-11-13 21:09:38,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.83 | bwd: 3859.05 | bwd_inner: 3851.49 | bwd_allreduce: 7.52 | step: 21.56 3%|▎ | 1595/50750 [4:27:00<80:55:00, 5.93s/it] {'loss': 0.0081, 'learning_rate': 3.999978886645745e-05, 'epoch': 1.57} 3%|▎ | 1595/50750 [4:27:00<80:55:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:09:44,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:09:44,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.97 | bwd_microstep: 3853.08 | bwd_inner_microstep: 3845.01 | bwd_allreduce_microstep: 8.00 | step_microstep: 22.13 [2024-11-13 21:09:44,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.96 | bwd: 3853.14 | bwd_inner: 3845.01 | bwd_allreduce: 8.03 | step: 22.13 3%|▎ | 1596/50750 [4:27:06<80:57:30, 5.93s/it] {'loss': 0.0245, 'learning_rate': 3.999978296091959e-05, 'epoch': 1.57} 3%|▎ | 1596/50750 [4:27:06<80:57:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:09:50,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:09:50,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.61 | bwd_microstep: 3846.59 | bwd_inner_microstep: 3839.14 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.88 [2024-11-13 21:09:50,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.61 | bwd: 3846.60 | bwd_inner: 3839.14 | bwd_allreduce: 7.42 | step: 20.88 3%|▎ | 1597/50750 [4:27:12<80:54:00, 5.93s/it] {'loss': 0.0284, 'learning_rate': 3.999977697392662e-05, 'epoch': 1.57} 3%|▎ | 1597/50750 [4:27:12<80:54:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:09:56,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:09:56,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.61 | bwd_microstep: 3852.00 | bwd_inner_microstep: 3844.26 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.38 [2024-11-13 21:09:56,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.61 | bwd: 3852.01 | bwd_inner: 3844.26 | bwd_allreduce: 7.71 | step: 21.39 3%|▎ | 1598/50750 [4:27:18<80:54:03, 5.93s/it] {'loss': 0.0145, 'learning_rate': 3.999977090547858e-05, 'epoch': 1.57} 3%|▎ | 1598/50750 [4:27:18<80:54:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:10:02,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 21:10:02,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.11 | bwd_microstep: 3860.42 | bwd_inner_microstep: 3850.23 | bwd_allreduce_microstep: 10.10 | step_microstep: 21.80 [2024-11-13 21:10:02,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.11 | bwd: 3860.45 | bwd_inner: 3850.23 | bwd_allreduce: 10.14 | step: 21.79 3%|▎ | 1599/50750 [4:27:24<80:57:06, 5.93s/it] {'loss': 0.459, 'learning_rate': 3.999976475557548e-05, 'epoch': 1.58} 3%|▎ | 1599/50750 [4:27:24<80:57:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:10:08,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:10:08,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.31 | bwd_microstep: 3848.10 | bwd_inner_microstep: 3840.54 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.34 [2024-11-13 21:10:08,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.31 | bwd: 3848.11 | bwd_inner: 3840.54 | bwd_allreduce: 7.53 | step: 21.34 3%|▎ | 1600/50750 [4:27:30<80:55:23, 5.93s/it] {'loss': 0.8293, 'learning_rate': 3.9999758524217356e-05, 'epoch': 1.58} 3%|▎ | 1600/50750 [4:27:30<80:55:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:10:14,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 21:10:14,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.98 | bwd_microstep: 3845.96 | bwd_inner_microstep: 3838.31 | bwd_allreduce_microstep: 7.60 | step_microstep: 22.38 [2024-11-13 21:10:14,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.98 | bwd: 3845.98 | bwd_inner: 3838.31 | bwd_allreduce: 7.62 | step: 22.38 3%|▎ | 1601/50750 [4:27:36<80:54:36, 5.93s/it] {'loss': 0.5661, 'learning_rate': 3.999975221140423e-05, 'epoch': 1.58} 3%|▎ | 1601/50750 [4:27:36<80:54:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:10:20,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:10:20,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.70 | bwd_microstep: 3847.39 | bwd_inner_microstep: 3839.86 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.42 [2024-11-13 21:10:20,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.69 | bwd: 3847.40 | bwd_inner: 3839.86 | bwd_allreduce: 7.50 | step: 21.43 3%|▎ | 1602/50750 [4:27:42<80:53:02, 5.92s/it] {'loss': 0.0106, 'learning_rate': 3.9999745817136126e-05, 'epoch': 1.58} 3%|▎ | 1602/50750 [4:27:42<80:53:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:10:26,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:10:26,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.28 | bwd_microstep: 3846.20 | bwd_inner_microstep: 3838.51 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.14 [2024-11-13 21:10:26,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.27 | bwd: 3846.21 | bwd_inner: 3838.51 | bwd_allreduce: 7.67 | step: 21.14 3%|▎ | 1603/50750 [4:27:47<80:51:03, 5.92s/it] {'loss': 0.1727, 'learning_rate': 3.999973934141308e-05, 'epoch': 1.58} 3%|▎ | 1603/50750 [4:27:47<80:51:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:10:31,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 21:10:31,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.03 | bwd_microstep: 3851.72 | bwd_inner_microstep: 3844.20 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.51 [2024-11-13 21:10:31,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.03 | bwd: 3851.73 | bwd_inner: 3844.20 | bwd_allreduce: 7.49 | step: 21.51 3%|▎ | 1604/50750 [4:27:53<80:51:32, 5.92s/it] {'loss': 0.0394, 'learning_rate': 3.99997327842351e-05, 'epoch': 1.58} 3%|▎ | 1604/50750 [4:27:53<80:51:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 21:10:37,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:10:37,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.85 | bwd_microstep: 3847.86 | bwd_inner_microstep: 3840.36 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.12 [2024-11-13 21:10:37,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.85 | bwd: 3847.87 | bwd_inner: 3840.36 | bwd_allreduce: 7.46 | step: 21.13 3%|▎ | 1605/50750 [4:27:59<80:52:12, 5.92s/it] {'loss': 0.1709, 'learning_rate': 3.9999726145602226e-05, 'epoch': 1.58} 3%|▎ | 1605/50750 [4:27:59<80:52:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:10:43,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:10:43,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.21 | bwd_microstep: 3856.15 | bwd_inner_microstep: 3848.62 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.98 [2024-11-13 21:10:43,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.21 | bwd: 3856.16 | bwd_inner: 3848.62 | bwd_allreduce: 7.50 | step: 20.98 3%|▎ | 1606/50750 [4:28:05<80:54:44, 5.93s/it] {'loss': 0.2697, 'learning_rate': 3.999971942551448e-05, 'epoch': 1.58} 3%|▎ | 1606/50750 [4:28:05<80:54:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:10:49,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:10:49,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.00 | bwd_microstep: 3857.11 | bwd_inner_microstep: 3849.64 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.02 [2024-11-13 21:10:49,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.00 | bwd: 3857.12 | bwd_inner: 3849.64 | bwd_allreduce: 7.44 | step: 21.02 3%|▎ | 1607/50750 [4:28:11<80:55:02, 5.93s/it] {'loss': 0.0108, 'learning_rate': 3.99997126239719e-05, 'epoch': 1.58} 3%|▎ | 1607/50750 [4:28:11<80:55:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:10:55,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 21:10:55,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.36 | bwd_microstep: 3849.76 | bwd_inner_microstep: 3842.05 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.42 [2024-11-13 21:10:55,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.36 | bwd: 3849.78 | bwd_inner: 3842.05 | bwd_allreduce: 7.69 | step: 21.42 3%|▎ | 1608/50750 [4:28:17<80:54:06, 5.93s/it] {'loss': 0.5852, 'learning_rate': 3.99997057409745e-05, 'epoch': 1.58} 3%|▎ | 1608/50750 [4:28:17<80:54:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:11:01,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:11:01,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.03 | bwd_microstep: 3845.05 | bwd_inner_microstep: 3837.56 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.16 [2024-11-13 21:11:01,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.03 | bwd: 3845.06 | bwd_inner: 3837.56 | bwd_allreduce: 7.45 | step: 21.17 3%|▎ | 1609/50750 [4:28:23<80:52:24, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.999969877652231e-05, 'epoch': 1.59} 3%|▎ | 1609/50750 [4:28:23<80:52:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:11:07,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:11:07,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.96 | bwd_microstep: 3848.83 | bwd_inner_microstep: 3841.32 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.83 [2024-11-13 21:11:07,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.96 | bwd: 3848.84 | bwd_inner: 3841.32 | bwd_allreduce: 7.48 | step: 20.83 3%|▎ | 1610/50750 [4:28:29<80:51:07, 5.92s/it] {'loss': 0.032, 'learning_rate': 3.999969173061536e-05, 'epoch': 1.59} 3%|▎ | 1610/50750 [4:28:29<80:51:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:11:13,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:11:13,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.88 | bwd_microstep: 3849.80 | bwd_inner_microstep: 3842.31 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.12 [2024-11-13 21:11:13,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.88 | bwd: 3849.81 | bwd_inner: 3842.31 | bwd_allreduce: 7.46 | step: 21.12 3%|▎ | 1611/50750 [4:28:35<80:50:26, 5.92s/it] {'loss': 0.0135, 'learning_rate': 3.999968460325369e-05, 'epoch': 1.59} 3%|▎ | 1611/50750 [4:28:35<80:50:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:11:19,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 21:11:19,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.30 | bwd_microstep: 3851.99 | bwd_inner_microstep: 3844.50 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.30 [2024-11-13 21:11:19,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.30 | bwd: 3852.00 | bwd_inner: 3844.50 | bwd_allreduce: 7.46 | step: 21.30 3%|▎ | 1612/50750 [4:28:41<80:50:13, 5.92s/it] {'loss': 0.0034, 'learning_rate': 3.999967739443731e-05, 'epoch': 1.59} 3%|▎ | 1612/50750 [4:28:41<80:50:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:11:25,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:11:25,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.57 | bwd_microstep: 3844.22 | bwd_inner_microstep: 3836.75 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.91 [2024-11-13 21:11:25,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.57 | bwd: 3844.23 | bwd_inner: 3836.75 | bwd_allreduce: 7.45 | step: 20.91 3%|▎ | 1613/50750 [4:28:47<80:48:41, 5.92s/it] {'loss': 0.2473, 'learning_rate': 3.999967010416627e-05, 'epoch': 1.59} 3%|▎ | 1613/50750 [4:28:47<80:48:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:11:31,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:11:31,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.46 | bwd_microstep: 3847.81 | bwd_inner_microstep: 3840.35 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.85 [2024-11-13 21:11:31,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.46 | bwd: 3847.82 | bwd_inner: 3840.35 | bwd_allreduce: 7.43 | step: 20.85 3%|▎ | 1614/50750 [4:28:53<80:48:03, 5.92s/it] {'loss': 0.016, 'learning_rate': 3.9999662732440576e-05, 'epoch': 1.59} 3%|▎ | 1614/50750 [4:28:53<80:48:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:11:37,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:11:37,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.39 | bwd_microstep: 3847.50 | bwd_inner_microstep: 3839.75 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.44 [2024-11-13 21:11:37,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.39 | bwd: 3847.51 | bwd_inner: 3839.75 | bwd_allreduce: 7.72 | step: 21.44 3%|▎ | 1615/50750 [4:28:59<80:48:14, 5.92s/it] {'loss': 0.009, 'learning_rate': 3.999965527926027e-05, 'epoch': 1.59} 3%|▎ | 1615/50750 [4:28:59<80:48:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:11:43,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-13 21:11:43,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.96 | bwd_microstep: 3846.95 | bwd_inner_microstep: 3839.24 | bwd_allreduce_microstep: 7.66 | step_microstep: 22.92 [2024-11-13 21:11:43,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.96 | bwd: 3846.96 | bwd_inner: 3839.24 | bwd_allreduce: 7.68 | step: 22.92 3%|▎ | 1616/50750 [4:29:04<80:49:43, 5.92s/it] {'loss': 0.0142, 'learning_rate': 3.9999647744625394e-05, 'epoch': 1.59} 3%|▎ | 1616/50750 [4:29:04<80:49:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:11:48,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 21:11:48,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.44 | bwd_microstep: 3847.17 | bwd_inner_microstep: 3839.29 | bwd_allreduce_microstep: 7.84 | step_microstep: 21.75 [2024-11-13 21:11:48,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.43 | bwd: 3847.18 | bwd_inner: 3839.29 | bwd_allreduce: 7.86 | step: 21.75 3%|▎ | 1617/50750 [4:29:10<80:52:36, 5.93s/it] {'loss': 0.1291, 'learning_rate': 3.999964012853596e-05, 'epoch': 1.59} 3%|▎ | 1617/50750 [4:29:10<80:52:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:11:54,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 4.06 | optimizer_step: 4.93 [2024-11-13 21:11:54,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.15 | bwd_microstep: 3851.88 | bwd_inner_microstep: 3843.64 | bwd_allreduce_microstep: 8.17 | step_microstep: 31.07 [2024-11-13 21:11:54,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.14 | bwd: 3851.89 | bwd_inner: 3843.65 | bwd_allreduce: 8.19 | step: 31.07 3%|▎ | 1618/50750 [4:29:16<80:59:03, 5.93s/it] {'loss': 0.1886, 'learning_rate': 3.999963243099201e-05, 'epoch': 1.59} 3%|▎ | 1618/50750 [4:29:16<80:59:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:12:00,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 21:12:00,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.52 | bwd_microstep: 3849.22 | bwd_inner_microstep: 3841.44 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.91 [2024-11-13 21:12:00,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.50 | bwd: 3849.23 | bwd_inner: 3841.44 | bwd_allreduce: 7.75 | step: 21.91 3%|▎ | 1619/50750 [4:29:22<80:58:44, 5.93s/it] {'loss': 0.1381, 'learning_rate': 3.999962465199357e-05, 'epoch': 1.6} 3%|▎ | 1619/50750 [4:29:22<80:58:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:12:06,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 21:12:06,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.60 | bwd_microstep: 3846.20 | bwd_inner_microstep: 3838.64 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.36 [2024-11-13 21:12:06,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.60 | bwd: 3846.22 | bwd_inner: 3838.64 | bwd_allreduce: 7.53 | step: 21.37 3%|▎ | 1620/50750 [4:29:28<80:54:48, 5.93s/it] {'loss': 0.0185, 'learning_rate': 3.999961679154067e-05, 'epoch': 1.6} 3%|▎ | 1620/50750 [4:29:28<80:54:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:12:12,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:12:12,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.14 | bwd_microstep: 3842.62 | bwd_inner_microstep: 3835.07 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.38 [2024-11-13 21:12:12,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.12 | bwd: 3842.64 | bwd_inner: 3835.08 | bwd_allreduce: 7.52 | step: 21.39 3%|▎ | 1621/50750 [4:29:34<80:51:52, 5.93s/it] {'loss': 0.452, 'learning_rate': 3.999960884963335e-05, 'epoch': 1.6} 3%|▎ | 1621/50750 [4:29:34<80:51:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:12:18,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 21:12:18,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.37 | bwd_microstep: 3854.94 | bwd_inner_microstep: 3846.16 | bwd_allreduce_microstep: 8.73 | step_microstep: 21.90 [2024-11-13 21:12:18,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.37 | bwd: 3854.95 | bwd_inner: 3846.16 | bwd_allreduce: 8.75 | step: 21.90 3%|▎ | 1622/50750 [4:29:40<80:53:00, 5.93s/it] {'loss': 0.0018, 'learning_rate': 3.999960082627163e-05, 'epoch': 1.6} 3%|▎ | 1622/50750 [4:29:40<80:53:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:12:24,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:12:24,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.93 | bwd_microstep: 3863.40 | bwd_inner_microstep: 3855.87 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.18 [2024-11-13 21:12:24,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.92 | bwd: 3863.41 | bwd_inner: 3855.87 | bwd_allreduce: 7.50 | step: 21.19 3%|▎ | 1623/50750 [4:29:46<80:54:40, 5.93s/it] {'loss': 0.0035, 'learning_rate': 3.9999592721455556e-05, 'epoch': 1.6} 3%|▎ | 1623/50750 [4:29:46<80:54:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2193 [2024-11-13 21:12:30,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 21:12:30,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.63 | bwd_microstep: 3855.18 | bwd_inner_microstep: 3847.49 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.24 [2024-11-13 21:12:30,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.62 | bwd: 3855.19 | bwd_inner: 3847.49 | bwd_allreduce: 7.66 | step: 21.24 3%|▎ | 1624/50750 [4:29:52<80:53:42, 5.93s/it] {'loss': 0.0, 'learning_rate': 3.9999584535185156e-05, 'epoch': 1.6} 3%|▎ | 1624/50750 [4:29:52<80:53:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:12:36,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 21:12:36,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.30 | bwd_microstep: 3856.76 | bwd_inner_microstep: 3849.25 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.36 [2024-11-13 21:12:36,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.30 | bwd: 3856.77 | bwd_inner: 3849.25 | bwd_allreduce: 7.48 | step: 21.37 3%|▎ | 1625/50750 [4:29:58<80:54:09, 5.93s/it] {'loss': 0.6577, 'learning_rate': 3.999957626746046e-05, 'epoch': 1.6} 3%|▎ | 1625/50750 [4:29:58<80:54:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:12:42,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:12:42,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.61 | bwd_microstep: 3859.17 | bwd_inner_microstep: 3851.65 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.21 [2024-11-13 21:12:42,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.61 | bwd: 3859.19 | bwd_inner: 3851.65 | bwd_allreduce: 7.49 | step: 21.22 3%|▎ | 1626/50750 [4:30:04<80:54:31, 5.93s/it] {'loss': 1.3523, 'learning_rate': 3.999956791828151e-05, 'epoch': 1.6} 3%|▎ | 1626/50750 [4:30:04<80:54:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:12:48,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 21:12:48,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.23 | bwd_microstep: 3856.33 | bwd_inner_microstep: 3848.81 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.38 [2024-11-13 21:12:48,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.23 | bwd: 3856.34 | bwd_inner: 3848.81 | bwd_allreduce: 7.50 | step: 21.38 3%|▎ | 1627/50750 [4:30:10<80:53:54, 5.93s/it] {'loss': 0.4476, 'learning_rate': 3.999955948764833e-05, 'epoch': 1.6} 3%|▎ | 1627/50750 [4:30:10<80:53:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:12:54,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 21:12:54,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.18 | bwd_microstep: 3855.01 | bwd_inner_microstep: 3847.31 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.61 [2024-11-13 21:12:54,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.16 | bwd: 3855.02 | bwd_inner: 3847.31 | bwd_allreduce: 7.67 | step: 21.61 3%|▎ | 1628/50750 [4:30:16<80:54:42, 5.93s/it] {'loss': 0.6863, 'learning_rate': 3.999955097556096e-05, 'epoch': 1.6} 3%|▎ | 1628/50750 [4:30:16<80:54:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:13:00,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:13:00,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.00 | bwd_microstep: 3856.70 | bwd_inner_microstep: 3849.07 | bwd_allreduce_microstep: 7.59 | step_microstep: 22.50 [2024-11-13 21:13:00,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.98 | bwd: 3856.71 | bwd_inner: 3849.07 | bwd_allreduce: 7.60 | step: 22.50 3%|▎ | 1629/50750 [4:30:22<80:54:52, 5.93s/it] {'loss': 0.0186, 'learning_rate': 3.999954238201944e-05, 'epoch': 1.6} 3%|▎ | 1629/50750 [4:30:22<80:54:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:13:06,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:13:06,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.58 | bwd_microstep: 3850.34 | bwd_inner_microstep: 3842.75 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.84 [2024-11-13 21:13:06,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.58 | bwd: 3850.35 | bwd_inner: 3842.75 | bwd_allreduce: 7.55 | step: 21.85 3%|▎ | 1630/50750 [4:30:28<80:52:35, 5.93s/it] {'loss': 0.0041, 'learning_rate': 3.999953370702379e-05, 'epoch': 1.61} 3%|▎ | 1630/50750 [4:30:28<80:52:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:13:11,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:13:11,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.34 | bwd_microstep: 3857.96 | bwd_inner_microstep: 3850.47 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.15 [2024-11-13 21:13:11,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.34 | bwd: 3857.98 | bwd_inner: 3850.47 | bwd_allreduce: 7.47 | step: 21.15 3%|▎ | 1631/50750 [4:30:33<80:52:41, 5.93s/it] {'loss': 0.1981, 'learning_rate': 3.999952495057405e-05, 'epoch': 1.61} 3%|▎ | 1631/50750 [4:30:33<80:52:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:13:17,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 21:13:17,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.82 | bwd_microstep: 3860.42 | bwd_inner_microstep: 3852.84 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.52 [2024-11-13 21:13:17,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.82 | bwd: 3860.44 | bwd_inner: 3852.84 | bwd_allreduce: 7.55 | step: 21.53 3%|▎ | 1632/50750 [4:30:39<80:54:40, 5.93s/it] {'loss': 0.0346, 'learning_rate': 3.999951611267027e-05, 'epoch': 1.61} 3%|▎ | 1632/50750 [4:30:39<80:54:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:13:23,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.46 | optimizer_step: 4.93 [2024-11-13 21:13:23,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.50 | bwd_microstep: 3859.04 | bwd_inner_microstep: 3851.47 | bwd_allreduce_microstep: 7.53 | step_microstep: 22.42 [2024-11-13 21:13:23,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.48 | bwd: 3859.05 | bwd_inner: 3851.47 | bwd_allreduce: 7.54 | step: 22.43 3%|▎ | 1633/50750 [4:30:45<80:55:54, 5.93s/it] {'loss': 0.0129, 'learning_rate': 3.999950719331246e-05, 'epoch': 1.61} 3%|▎ | 1633/50750 [4:30:45<80:55:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:13:29,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:13:29,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.64 | bwd_microstep: 3856.14 | bwd_inner_microstep: 3848.62 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.93 [2024-11-13 21:13:29,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.64 | bwd: 3856.16 | bwd_inner: 3848.62 | bwd_allreduce: 7.50 | step: 20.94 3%|▎ | 1634/50750 [4:30:51<80:54:34, 5.93s/it] {'loss': 0.0019, 'learning_rate': 3.999949819250069e-05, 'epoch': 1.61} 3%|▎ | 1634/50750 [4:30:51<80:54:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:13:35,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.92 [2024-11-13 21:13:35,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.99 | bwd_microstep: 3860.86 | bwd_inner_microstep: 3853.13 | bwd_allreduce_microstep: 7.69 | step_microstep: 22.04 [2024-11-13 21:13:35,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.99 | bwd: 3860.88 | bwd_inner: 3853.13 | bwd_allreduce: 7.70 | step: 22.04 3%|▎ | 1635/50750 [4:30:57<80:56:42, 5.93s/it] {'loss': 0.0022, 'learning_rate': 3.999948911023497e-05, 'epoch': 1.61} 3%|▎ | 1635/50750 [4:30:57<80:56:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:13:41,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:13:41,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.07 | bwd_microstep: 3852.18 | bwd_inner_microstep: 3844.46 | bwd_allreduce_microstep: 7.67 | step_microstep: 23.82 [2024-11-13 21:13:41,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.06 | bwd: 3852.20 | bwd_inner: 3844.46 | bwd_allreduce: 7.69 | step: 23.81 3%|▎ | 1636/50750 [4:31:03<80:55:42, 5.93s/it] {'loss': 0.008, 'learning_rate': 3.9999479946515345e-05, 'epoch': 1.61} 3%|▎ | 1636/50750 [4:31:03<80:55:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:13:47,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:13:47,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.28 | bwd_microstep: 3843.74 | bwd_inner_microstep: 3836.19 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.18 [2024-11-13 21:13:47,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.26 | bwd: 3843.75 | bwd_inner: 3836.19 | bwd_allreduce: 7.52 | step: 21.18 3%|▎ | 1637/50750 [4:31:09<80:51:52, 5.93s/it] {'loss': 0.4141, 'learning_rate': 3.999947070134186e-05, 'epoch': 1.61} 3%|▎ | 1637/50750 [4:31:09<80:51:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:13:53,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:13:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.92 | bwd_microstep: 3851.24 | bwd_inner_microstep: 3843.72 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.95 [2024-11-13 21:13:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.92 | bwd: 3851.25 | bwd_inner: 3843.72 | bwd_allreduce: 7.49 | step: 20.97 3%|▎ | 1638/50750 [4:31:15<80:50:58, 5.93s/it] {'loss': 0.0839, 'learning_rate': 3.999946137471454e-05, 'epoch': 1.61} 3%|▎ | 1638/50750 [4:31:15<80:50:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:13:59,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 21:13:59,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.52 | bwd_microstep: 3845.13 | bwd_inner_microstep: 3837.64 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.14 [2024-11-13 21:13:59,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.52 | bwd: 3845.15 | bwd_inner: 3837.64 | bwd_allreduce: 7.47 | step: 21.14 3%|▎ | 1639/50750 [4:31:21<80:48:05, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.9999451966633425e-05, 'epoch': 1.61} 3%|▎ | 1639/50750 [4:31:21<80:48:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:14:05,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 21:14:05,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.09 | bwd_microstep: 3851.99 | bwd_inner_microstep: 3844.47 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.97 [2024-11-13 21:14:05,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.09 | bwd: 3852.00 | bwd_inner: 3844.47 | bwd_allreduce: 7.49 | step: 20.98 3%|▎ | 1640/50750 [4:31:27<80:47:51, 5.92s/it] {'loss': 0.678, 'learning_rate': 3.9999442477098555e-05, 'epoch': 1.62} 3%|▎ | 1640/50750 [4:31:27<80:47:51, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:14:11,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:14:11,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.75 | bwd_microstep: 3849.66 | bwd_inner_microstep: 3842.14 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.09 [2024-11-13 21:14:11,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.75 | bwd: 3849.67 | bwd_inner: 3842.14 | bwd_allreduce: 7.49 | step: 21.10 3%|▎ | 1641/50750 [4:31:33<80:46:44, 5.92s/it] {'loss': 0.0232, 'learning_rate': 3.999943290610998e-05, 'epoch': 1.62} 3%|▎ | 1641/50750 [4:31:33<80:46:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:14:17,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:14:17,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.85 | bwd_microstep: 3848.66 | bwd_inner_microstep: 3841.15 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.40 [2024-11-13 21:14:17,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.85 | bwd: 3848.67 | bwd_inner: 3841.15 | bwd_allreduce: 7.49 | step: 21.40 3%|▎ | 1642/50750 [4:31:39<80:46:20, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.999942325366772e-05, 'epoch': 1.62} 3%|▎ | 1642/50750 [4:31:39<80:46:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:14:23,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:14:23,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.62 | bwd_microstep: 3855.58 | bwd_inner_microstep: 3848.02 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.40 [2024-11-13 21:14:23,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.62 | bwd: 3855.60 | bwd_inner: 3848.02 | bwd_allreduce: 7.53 | step: 21.40 3%|▎ | 1643/50750 [4:31:45<80:48:02, 5.92s/it] {'loss': 0.0787, 'learning_rate': 3.9999413519771825e-05, 'epoch': 1.62} 3%|▎ | 1643/50750 [4:31:45<80:48:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:14:29,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:14:29,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.81 | bwd_microstep: 3844.43 | bwd_inner_microstep: 3836.86 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.64 [2024-11-13 21:14:29,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.80 | bwd: 3844.45 | bwd_inner: 3836.86 | bwd_allreduce: 7.54 | step: 21.65 3%|▎ | 1644/50750 [4:31:50<80:47:24, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.9999403704422343e-05, 'epoch': 1.62} 3%|▎ | 1644/50750 [4:31:50<80:47:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:14:34,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 21:14:34,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.70 | bwd_microstep: 3845.29 | bwd_inner_microstep: 3837.81 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.48 [2024-11-13 21:14:34,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.68 | bwd: 3845.30 | bwd_inner: 3837.81 | bwd_allreduce: 7.46 | step: 21.49 3%|▎ | 1645/50750 [4:31:56<80:47:57, 5.92s/it] {'loss': 0.0023, 'learning_rate': 3.9999393807619294e-05, 'epoch': 1.62} 3%|▎ | 1645/50750 [4:31:56<80:47:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:14:40,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.99 [2024-11-13 21:14:40,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.23 | bwd_microstep: 3847.42 | bwd_inner_microstep: 3839.67 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.69 [2024-11-13 21:14:40,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.20 | bwd: 3847.43 | bwd_inner: 3839.67 | bwd_allreduce: 7.71 | step: 21.69 3%|▎ | 1646/50750 [4:32:02<80:48:41, 5.92s/it] {'loss': 0.0084, 'learning_rate': 3.999938382936274e-05, 'epoch': 1.62} 3%|▎ | 1646/50750 [4:32:02<80:48:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:14:46,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:14:46,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.07 | bwd_microstep: 3848.20 | bwd_inner_microstep: 3840.61 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.40 [2024-11-13 21:14:46,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.06 | bwd: 3848.21 | bwd_inner: 3840.61 | bwd_allreduce: 7.56 | step: 21.40 3%|▎ | 1647/50750 [4:32:08<80:49:28, 5.93s/it] {'loss': 0.7528, 'learning_rate': 3.9999373769652694e-05, 'epoch': 1.62} 3%|▎ | 1647/50750 [4:32:08<80:49:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:14:52,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:14:52,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.60 | bwd_microstep: 3849.39 | bwd_inner_microstep: 3838.52 | bwd_allreduce_microstep: 10.82 | step_microstep: 21.27 [2024-11-13 21:14:52,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.60 | bwd: 3849.40 | bwd_inner: 3838.52 | bwd_allreduce: 10.84 | step: 21.27 3%|▎ | 1648/50750 [4:32:14<80:48:52, 5.93s/it] {'loss': 0.0043, 'learning_rate': 3.999936362848922e-05, 'epoch': 1.62} 3%|▎ | 1648/50750 [4:32:14<80:48:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:14:58,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:14:58,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.28 | bwd_microstep: 3849.04 | bwd_inner_microstep: 3841.53 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.13 [2024-11-13 21:14:58,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.26 | bwd: 3849.05 | bwd_inner: 3841.53 | bwd_allreduce: 7.48 | step: 21.13 3%|▎ | 1649/50750 [4:32:20<80:49:35, 5.93s/it] {'loss': 0.6122, 'learning_rate': 3.999935340587236e-05, 'epoch': 1.62} 3%|▎ | 1649/50750 [4:32:20<80:49:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:15:04,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:15:04,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.88 | bwd_microstep: 3856.07 | bwd_inner_microstep: 3848.55 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.18 [2024-11-13 21:15:04,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.87 | bwd: 3856.08 | bwd_inner: 3848.55 | bwd_allreduce: 7.49 | step: 21.19 3%|▎ | 1650/50750 [4:32:26<80:49:18, 5.93s/it] {'loss': 0.0218, 'learning_rate': 3.999934310180214e-05, 'epoch': 1.63} 3%|▎ | 1650/50750 [4:32:26<80:49:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:15:10,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.60 | optimizer_step: 4.93 [2024-11-13 21:15:10,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.17 | bwd_microstep: 3853.80 | bwd_inner_microstep: 3845.78 | bwd_allreduce_microstep: 7.96 | step_microstep: 29.99 [2024-11-13 21:15:10,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.17 | bwd: 3853.82 | bwd_inner: 3845.78 | bwd_allreduce: 7.99 | step: 29.98 3%|▎ | 1651/50750 [4:32:32<80:51:52, 5.93s/it] {'loss': 0.4707, 'learning_rate': 3.999933271627861e-05, 'epoch': 1.63} 3%|▎ | 1651/50750 [4:32:32<80:51:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:15:16,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:15:16,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.65 | bwd_microstep: 3857.76 | bwd_inner_microstep: 3850.21 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.55 [2024-11-13 21:15:16,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.64 | bwd: 3857.77 | bwd_inner: 3850.21 | bwd_allreduce: 7.53 | step: 21.56 3%|▎ | 1652/50750 [4:32:38<80:55:27, 5.93s/it] {'loss': 0.0018, 'learning_rate': 3.9999322249301815e-05, 'epoch': 1.63} 3%|▎ | 1652/50750 [4:32:38<80:55:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:15:22,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.39 | optimizer_step: 4.93 [2024-11-13 21:15:22,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.64 | bwd_microstep: 3856.45 | bwd_inner_microstep: 3848.96 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.94 [2024-11-13 21:15:22,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.62 | bwd: 3856.46 | bwd_inner: 3848.96 | bwd_allreduce: 7.46 | step: 21.94 3%|▎ | 1653/50750 [4:32:44<80:55:48, 5.93s/it] {'loss': 0.0468, 'learning_rate': 3.999931170087179e-05, 'epoch': 1.63} 3%|▎ | 1653/50750 [4:32:44<80:55:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:15:28,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 21:15:28,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.21 | bwd_microstep: 3849.37 | bwd_inner_microstep: 3841.72 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.55 [2024-11-13 21:15:28,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.21 | bwd: 3849.38 | bwd_inner: 3841.72 | bwd_allreduce: 7.62 | step: 21.56 3%|▎ | 1654/50750 [4:32:50<80:53:14, 5.93s/it] {'loss': 0.3241, 'learning_rate': 3.999930107098859e-05, 'epoch': 1.63} 3%|▎ | 1654/50750 [4:32:50<80:53:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:15:34,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.47 | optimizer_step: 4.93 [2024-11-13 21:15:34,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.39 | bwd_microstep: 3849.84 | bwd_inner_microstep: 3842.25 | bwd_allreduce_microstep: 7.54 | step_microstep: 22.68 [2024-11-13 21:15:34,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.38 | bwd: 3849.86 | bwd_inner: 3842.25 | bwd_allreduce: 7.56 | step: 22.68 3%|▎ | 1655/50750 [4:32:56<80:54:24, 5.93s/it] {'loss': 0.1777, 'learning_rate': 3.999929035965225e-05, 'epoch': 1.63} 3%|▎ | 1655/50750 [4:32:56<80:54:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:15:40,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:15:40,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.74 | bwd_microstep: 3843.74 | bwd_inner_microstep: 3836.27 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.78 [2024-11-13 21:15:40,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.74 | bwd: 3843.75 | bwd_inner: 3836.27 | bwd_allreduce: 7.44 | step: 20.78 3%|▎ | 1656/50750 [4:33:02<80:50:36, 5.93s/it] {'loss': 0.3202, 'learning_rate': 3.999927956686281e-05, 'epoch': 1.63} 3%|▎ | 1656/50750 [4:33:02<80:50:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:15:46,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:15:46,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.84 | bwd_microstep: 3851.25 | bwd_inner_microstep: 3843.76 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.10 [2024-11-13 21:15:46,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.84 | bwd: 3851.27 | bwd_inner: 3843.76 | bwd_allreduce: 7.47 | step: 21.11 3%|▎ | 1657/50750 [4:33:08<80:50:36, 5.93s/it] {'loss': 0.1018, 'learning_rate': 3.9999268692620326e-05, 'epoch': 1.63} 3%|▎ | 1657/50750 [4:33:08<80:50:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:15:52,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.90 | optimizer_step: 4.93 [2024-11-13 21:15:52,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.31 | bwd_microstep: 3845.75 | bwd_inner_microstep: 3838.17 | bwd_allreduce_microstep: 7.54 | step_microstep: 23.87 [2024-11-13 21:15:52,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.31 | bwd: 3845.77 | bwd_inner: 3838.17 | bwd_allreduce: 7.55 | step: 23.90 3%|▎ | 1658/50750 [4:33:13<80:49:56, 5.93s/it] {'loss': 0.0079, 'learning_rate': 3.999925773692483e-05, 'epoch': 1.63} 3%|▎ | 1658/50750 [4:33:13<80:49:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:15:57,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 21:15:57,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.30 | bwd_microstep: 3859.08 | bwd_inner_microstep: 3851.49 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.85 [2024-11-13 21:15:57,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.30 | bwd: 3859.09 | bwd_inner: 3851.49 | bwd_allreduce: 7.56 | step: 21.85 3%|▎ | 1659/50750 [4:33:19<80:52:45, 5.93s/it] {'loss': 0.3956, 'learning_rate': 3.999924669977637e-05, 'epoch': 1.63} 3%|▎ | 1659/50750 [4:33:19<80:52:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:16:03,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:16:03,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.12 | bwd_microstep: 3854.01 | bwd_inner_microstep: 3846.50 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.98 [2024-11-13 21:16:03,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.12 | bwd: 3854.02 | bwd_inner: 3846.50 | bwd_allreduce: 7.49 | step: 20.99 3%|▎ | 1660/50750 [4:33:25<80:52:12, 5.93s/it] {'loss': 0.1282, 'learning_rate': 3.9999235581175e-05, 'epoch': 1.64} 3%|▎ | 1660/50750 [4:33:25<80:52:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:16:09,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 21:16:09,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.87 | bwd_microstep: 3851.10 | bwd_inner_microstep: 3842.67 | bwd_allreduce_microstep: 8.38 | step_microstep: 21.93 [2024-11-13 21:16:09,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.87 | bwd: 3851.11 | bwd_inner: 3842.67 | bwd_allreduce: 8.40 | step: 21.93 3%|▎ | 1661/50750 [4:33:31<80:50:58, 5.93s/it] {'loss': 0.0203, 'learning_rate': 3.9999224381120754e-05, 'epoch': 1.64} 3%|▎ | 1661/50750 [4:33:31<80:50:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:16:15,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 21:16:15,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.13 | bwd_microstep: 3847.21 | bwd_inner_microstep: 3839.66 | bwd_allreduce_microstep: 7.50 | step_microstep: 20.74 [2024-11-13 21:16:15,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.11 | bwd: 3847.23 | bwd_inner: 3839.66 | bwd_allreduce: 7.52 | step: 20.74 3%|▎ | 1662/50750 [4:33:37<80:51:40, 5.93s/it] {'loss': 0.0045, 'learning_rate': 3.999921309961369e-05, 'epoch': 1.64} 3%|▎ | 1662/50750 [4:33:37<80:51:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:16:21,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:16:21,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.90 | bwd_microstep: 3848.06 | bwd_inner_microstep: 3840.56 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.18 [2024-11-13 21:16:21,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.86 | bwd: 3848.07 | bwd_inner: 3840.55 | bwd_allreduce: 7.48 | step: 21.18 3%|▎ | 1663/50750 [4:33:43<80:52:04, 5.93s/it] {'loss': 0.3925, 'learning_rate': 3.999920173665383e-05, 'epoch': 1.64} 3%|▎ | 1663/50750 [4:33:43<80:52:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:16:27,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-13 21:16:27,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.73 | bwd_microstep: 3848.80 | bwd_inner_microstep: 3841.02 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.68 [2024-11-13 21:16:27,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.73 | bwd: 3848.82 | bwd_inner: 3841.02 | bwd_allreduce: 7.75 | step: 21.69 3%|▎ | 1664/50750 [4:33:49<80:50:07, 5.93s/it] {'loss': 0.6515, 'learning_rate': 3.999919029224124e-05, 'epoch': 1.64} 3%|▎ | 1664/50750 [4:33:49<80:50:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:16:33,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:16:33,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.80 | bwd_microstep: 3849.44 | bwd_inner_microstep: 3841.91 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.28 [2024-11-13 21:16:33,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.80 | bwd: 3849.45 | bwd_inner: 3841.91 | bwd_allreduce: 7.50 | step: 21.28 3%|▎ | 1665/50750 [4:33:55<80:49:07, 5.93s/it] {'loss': 0.132, 'learning_rate': 3.9999178766375964e-05, 'epoch': 1.64} 3%|▎ | 1665/50750 [4:33:55<80:49:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:16:39,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 21:16:39,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.42 | bwd_microstep: 3848.30 | bwd_inner_microstep: 3840.39 | bwd_allreduce_microstep: 7.86 | step_microstep: 21.61 [2024-11-13 21:16:39,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.42 | bwd: 3848.31 | bwd_inner: 3840.40 | bwd_allreduce: 7.87 | step: 21.62 3%|▎ | 1666/50750 [4:34:01<80:48:09, 5.93s/it] {'loss': 0.0256, 'learning_rate': 3.999916715905805e-05, 'epoch': 1.64} 3%|▎ | 1666/50750 [4:34:01<80:48:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:16:45,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:16:45,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.34 | bwd_microstep: 3848.75 | bwd_inner_microstep: 3840.95 | bwd_allreduce_microstep: 7.74 | step_microstep: 24.24 [2024-11-13 21:16:45,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.32 | bwd: 3848.77 | bwd_inner: 3840.95 | bwd_allreduce: 7.76 | step: 24.24 3%|▎ | 1667/50750 [4:34:07<80:49:36, 5.93s/it] {'loss': 0.169, 'learning_rate': 3.9999155470287544e-05, 'epoch': 1.64} 3%|▎ | 1667/50750 [4:34:07<80:49:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:16:51,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:16:51,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.70 | bwd_microstep: 3844.96 | bwd_inner_microstep: 3837.44 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.09 [2024-11-13 21:16:51,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.70 | bwd: 3844.97 | bwd_inner: 3837.44 | bwd_allreduce: 7.49 | step: 21.09 3%|▎ | 1668/50750 [4:34:13<80:48:25, 5.93s/it] {'loss': 0.66, 'learning_rate': 3.99991437000645e-05, 'epoch': 1.64} 3%|▎ | 1668/50750 [4:34:13<80:48:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:16:57,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 21:16:57,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.32 | bwd_microstep: 3854.41 | bwd_inner_microstep: 3846.91 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.25 [2024-11-13 21:16:57,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.32 | bwd: 3854.43 | bwd_inner: 3846.91 | bwd_allreduce: 7.48 | step: 21.26 3%|▎ | 1669/50750 [4:34:19<80:48:34, 5.93s/it] {'loss': 0.0235, 'learning_rate': 3.999913184838894e-05, 'epoch': 1.64} 3%|▎ | 1669/50750 [4:34:19<80:48:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:17:03,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:17:03,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.84 | bwd_microstep: 3850.53 | bwd_inner_microstep: 3842.99 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.04 [2024-11-13 21:17:03,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.84 | bwd: 3850.54 | bwd_inner: 3842.99 | bwd_allreduce: 7.51 | step: 21.04 3%|▎ | 1670/50750 [4:34:25<80:47:10, 5.93s/it] {'loss': 0.0658, 'learning_rate': 3.9999119915260936e-05, 'epoch': 1.65} 3%|▎ | 1670/50750 [4:34:25<80:47:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:17:09,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:17:09,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.18 | bwd_microstep: 3857.35 | bwd_inner_microstep: 3849.82 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.27 [2024-11-13 21:17:09,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.18 | bwd: 3857.36 | bwd_inner: 3849.82 | bwd_allreduce: 7.50 | step: 21.28 3%|▎ | 1671/50750 [4:34:31<80:49:17, 5.93s/it] {'loss': 0.1096, 'learning_rate': 3.999910790068054e-05, 'epoch': 1.65} 3%|▎ | 1671/50750 [4:34:31<80:49:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:17:15,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:17:15,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.17 | bwd_microstep: 3851.18 | bwd_inner_microstep: 3843.65 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.08 [2024-11-13 21:17:15,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.17 | bwd: 3851.19 | bwd_inner: 3843.65 | bwd_allreduce: 7.50 | step: 21.10 3%|▎ | 1672/50750 [4:34:36<80:47:54, 5.93s/it] {'loss': 0.3582, 'learning_rate': 3.999909580464778e-05, 'epoch': 1.65} 3%|▎ | 1672/50750 [4:34:36<80:47:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:17:20,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:17:20,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.85 | bwd_microstep: 3849.03 | bwd_inner_microstep: 3841.56 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.49 [2024-11-13 21:17:20,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.85 | bwd: 3849.05 | bwd_inner: 3841.56 | bwd_allreduce: 7.45 | step: 21.49 3%|▎ | 1673/50750 [4:34:42<80:46:48, 5.93s/it] {'loss': 0.0433, 'learning_rate': 3.9999083627162725e-05, 'epoch': 1.65} 3%|▎ | 1673/50750 [4:34:42<80:46:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:17:26,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 21:17:26,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.21 | bwd_microstep: 3845.46 | bwd_inner_microstep: 3837.98 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.07 [2024-11-13 21:17:26,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.21 | bwd: 3845.47 | bwd_inner: 3837.98 | bwd_allreduce: 7.45 | step: 21.07 3%|▎ | 1674/50750 [4:34:48<80:44:32, 5.92s/it] {'loss': 0.0012, 'learning_rate': 3.999907136822542e-05, 'epoch': 1.65} 3%|▎ | 1674/50750 [4:34:48<80:44:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:17:32,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:17:32,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.51 | bwd_microstep: 3851.27 | bwd_inner_microstep: 3843.80 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.90 [2024-11-13 21:17:32,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.51 | bwd: 3851.28 | bwd_inner: 3843.80 | bwd_allreduce: 7.45 | step: 20.90 3%|▎ | 1675/50750 [4:34:54<80:44:49, 5.92s/it] {'loss': 0.0023, 'learning_rate': 3.99990590278359e-05, 'epoch': 1.65} 3%|▎ | 1675/50750 [4:34:54<80:44:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2195 [2024-11-13 21:17:38,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:17:38,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.69 | bwd_microstep: 3847.34 | bwd_inner_microstep: 3839.81 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.02 [2024-11-13 21:17:38,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.69 | bwd: 3847.35 | bwd_inner: 3839.81 | bwd_allreduce: 7.50 | step: 21.02 3%|▎ | 1676/50750 [4:35:00<80:45:33, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.999904660599423e-05, 'epoch': 1.65} 3%|▎ | 1676/50750 [4:35:00<80:45:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:17:44,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:17:44,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.86 | bwd_microstep: 3850.77 | bwd_inner_microstep: 3843.17 | bwd_allreduce_microstep: 7.55 | step_microstep: 22.10 [2024-11-13 21:17:44,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.85 | bwd: 3850.79 | bwd_inner: 3843.17 | bwd_allreduce: 7.57 | step: 22.11 3%|▎ | 1677/50750 [4:35:06<80:47:00, 5.93s/it] {'loss': 0.0041, 'learning_rate': 3.999903410270046e-05, 'epoch': 1.65} 3%|▎ | 1677/50750 [4:35:06<80:47:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:17:50,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 21:17:50,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.16 | bwd_microstep: 3849.13 | bwd_inner_microstep: 3841.61 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.30 [2024-11-13 21:17:50,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.16 | bwd: 3849.14 | bwd_inner: 3841.61 | bwd_allreduce: 7.50 | step: 21.30 3%|▎ | 1678/50750 [4:35:12<80:45:56, 5.93s/it] {'loss': 0.0476, 'learning_rate': 3.999902151795465e-05, 'epoch': 1.65} 3%|▎ | 1678/50750 [4:35:12<80:45:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:17:56,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:17:56,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.08 | bwd_microstep: 3844.23 | bwd_inner_microstep: 3836.46 | bwd_allreduce_microstep: 7.70 | step_microstep: 22.85 [2024-11-13 21:17:56,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.08 | bwd: 3844.25 | bwd_inner: 3836.46 | bwd_allreduce: 7.73 | step: 22.86 3%|▎ | 1679/50750 [4:35:18<80:43:18, 5.92s/it] {'loss': 0.0501, 'learning_rate': 3.9999008851756826e-05, 'epoch': 1.65} 3%|▎ | 1679/50750 [4:35:18<80:43:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:18:02,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.92 [2024-11-13 21:18:02,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.28 | bwd_microstep: 3839.12 | bwd_inner_microstep: 3831.39 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.70 [2024-11-13 21:18:02,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.28 | bwd: 3839.13 | bwd_inner: 3831.39 | bwd_allreduce: 7.70 | step: 21.70 3%|▎ | 1680/50750 [4:35:24<80:40:56, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.9998996104107054e-05, 'epoch': 1.66} 3%|▎ | 1680/50750 [4:35:24<80:40:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:18:08,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:18:08,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.65 | bwd_microstep: 3847.66 | bwd_inner_microstep: 3840.14 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.06 [2024-11-13 21:18:08,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.63 | bwd: 3847.67 | bwd_inner: 3840.14 | bwd_allreduce: 7.48 | step: 21.07 3%|▎ | 1681/50750 [4:35:30<80:41:17, 5.92s/it] {'loss': 0.4454, 'learning_rate': 3.999898327500539e-05, 'epoch': 1.66} 3%|▎ | 1681/50750 [4:35:30<80:41:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:18:14,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:18:14,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.02 | bwd_microstep: 3851.11 | bwd_inner_microstep: 3843.58 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.97 [2024-11-13 21:18:14,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.02 | bwd: 3851.12 | bwd_inner: 3843.58 | bwd_allreduce: 7.50 | step: 20.97 3%|▎ | 1682/50750 [4:35:36<80:42:15, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.999897036445188e-05, 'epoch': 1.66} 3%|▎ | 1682/50750 [4:35:36<80:42:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:18:20,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:18:20,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.02 | bwd_microstep: 3847.52 | bwd_inner_microstep: 3839.77 | bwd_allreduce_microstep: 7.70 | step_microstep: 25.30 [2024-11-13 21:18:20,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.02 | bwd: 3847.54 | bwd_inner: 3839.77 | bwd_allreduce: 7.72 | step: 25.29 3%|▎ | 1683/50750 [4:35:42<80:42:41, 5.92s/it] {'loss': 0.6581, 'learning_rate': 3.9998957372446584e-05, 'epoch': 1.66} 3%|▎ | 1683/50750 [4:35:42<80:42:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:18:26,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:18:26,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.56 | bwd_microstep: 3850.12 | bwd_inner_microstep: 3842.62 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.95 [2024-11-13 21:18:26,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.56 | bwd: 3850.13 | bwd_inner: 3842.62 | bwd_allreduce: 7.47 | step: 20.95 3%|▎ | 1684/50750 [4:35:48<80:42:21, 5.92s/it] {'loss': 0.1435, 'learning_rate': 3.999894429898954e-05, 'epoch': 1.66} 3%|▎ | 1684/50750 [4:35:48<80:42:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:18:31,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 21:18:31,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.67 | bwd_microstep: 3843.57 | bwd_inner_microstep: 3836.05 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.97 [2024-11-13 21:18:31,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.67 | bwd: 3843.58 | bwd_inner: 3836.05 | bwd_allreduce: 7.49 | step: 20.97 3%|▎ | 1685/50750 [4:35:53<80:40:18, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.999893114408081e-05, 'epoch': 1.66} 3%|▎ | 1685/50750 [4:35:53<80:40:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:18:37,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:18:37,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.10 | bwd_microstep: 3845.02 | bwd_inner_microstep: 3837.48 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.97 [2024-11-13 21:18:37,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.10 | bwd: 3845.03 | bwd_inner: 3837.48 | bwd_allreduce: 7.50 | step: 20.98 3%|▎ | 1686/50750 [4:35:59<80:39:24, 5.92s/it] {'loss': 1.94, 'learning_rate': 3.999891790772045e-05, 'epoch': 1.66} 3%|▎ | 1686/50750 [4:35:59<80:39:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:18:43,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:18:43,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.01 | bwd_microstep: 3845.98 | bwd_inner_microstep: 3838.43 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.32 [2024-11-13 21:18:43,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.01 | bwd: 3845.99 | bwd_inner: 3838.43 | bwd_allreduce: 7.52 | step: 21.33 3%|▎ | 1687/50750 [4:36:05<80:38:56, 5.92s/it] {'loss': 0.0986, 'learning_rate': 3.999890458990852e-05, 'epoch': 1.66} 3%|▎ | 1687/50750 [4:36:05<80:38:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:18:49,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:18:49,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.87 | bwd_microstep: 3843.44 | bwd_inner_microstep: 3835.90 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.96 [2024-11-13 21:18:49,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.87 | bwd: 3843.45 | bwd_inner: 3835.90 | bwd_allreduce: 7.51 | step: 20.96 3%|▎ | 1688/50750 [4:36:11<80:38:07, 5.92s/it] {'loss': 0.0104, 'learning_rate': 3.9998891190645056e-05, 'epoch': 1.66} 3%|▎ | 1688/50750 [4:36:11<80:38:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:18:55,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:18:55,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.02 | bwd_microstep: 3848.18 | bwd_inner_microstep: 3840.64 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.23 [2024-11-13 21:18:55,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.02 | bwd: 3848.19 | bwd_inner: 3840.64 | bwd_allreduce: 7.51 | step: 21.23 3%|▎ | 1689/50750 [4:36:17<80:39:31, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.999887770993013e-05, 'epoch': 1.66} 3%|▎ | 1689/50750 [4:36:17<80:39:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 21:19:01,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:19:01,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.99 | bwd_microstep: 3851.17 | bwd_inner_microstep: 3843.66 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.97 [2024-11-13 21:19:01,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.99 | bwd: 3851.18 | bwd_inner: 3843.66 | bwd_allreduce: 7.48 | step: 20.98 3%|▎ | 1690/50750 [4:36:23<80:40:25, 5.92s/it] {'loss': 0.0121, 'learning_rate': 3.999886414776378e-05, 'epoch': 1.67} 3%|▎ | 1690/50750 [4:36:23<80:40:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:19:07,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:19:07,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.62 | bwd_microstep: 3848.24 | bwd_inner_microstep: 3840.72 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.28 [2024-11-13 21:19:07,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.62 | bwd: 3848.25 | bwd_inner: 3840.72 | bwd_allreduce: 7.49 | step: 21.29 3%|▎ | 1691/50750 [4:36:29<80:39:55, 5.92s/it] {'loss': 0.4756, 'learning_rate': 3.999885050414608e-05, 'epoch': 1.67} 3%|▎ | 1691/50750 [4:36:29<80:39:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:19:13,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:19:13,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.07 | bwd_microstep: 3845.23 | bwd_inner_microstep: 3837.71 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.97 [2024-11-13 21:19:13,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.07 | bwd: 3845.24 | bwd_inner: 3837.71 | bwd_allreduce: 7.49 | step: 20.97 3%|▎ | 1692/50750 [4:36:35<80:38:55, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.999883677907707e-05, 'epoch': 1.67} 3%|▎ | 1692/50750 [4:36:35<80:38:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:19:19,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 21:19:19,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.11 | bwd_microstep: 3853.33 | bwd_inner_microstep: 3845.82 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.95 [2024-11-13 21:19:19,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.11 | bwd: 3853.34 | bwd_inner: 3845.82 | bwd_allreduce: 7.48 | step: 20.95 3%|▎ | 1693/50750 [4:36:41<80:39:39, 5.92s/it] {'loss': 0.0104, 'learning_rate': 3.999882297255682e-05, 'epoch': 1.67} 3%|▎ | 1693/50750 [4:36:41<80:39:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:19:25,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:19:25,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.76 | bwd_microstep: 3847.84 | bwd_inner_microstep: 3840.33 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.91 [2024-11-13 21:19:25,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.76 | bwd: 3847.85 | bwd_inner: 3840.33 | bwd_allreduce: 7.48 | step: 20.92 3%|▎ | 1694/50750 [4:36:47<80:39:39, 5.92s/it] {'loss': 0.0107, 'learning_rate': 3.999880908458537e-05, 'epoch': 1.67} 3%|▎ | 1694/50750 [4:36:47<80:39:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:19:31,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:19:31,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.50 | bwd_microstep: 3844.31 | bwd_inner_microstep: 3836.79 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.10 [2024-11-13 21:19:31,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.50 | bwd: 3844.32 | bwd_inner: 3836.79 | bwd_allreduce: 7.49 | step: 21.11 3%|▎ | 1695/50750 [4:36:53<80:38:33, 5.92s/it] {'loss': 0.6593, 'learning_rate': 3.9998795115162796e-05, 'epoch': 1.67} 3%|▎ | 1695/50750 [4:36:53<80:38:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:19:37,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 21:19:37,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.56 | bwd_microstep: 3845.94 | bwd_inner_microstep: 3838.14 | bwd_allreduce_microstep: 7.75 | step_microstep: 21.65 [2024-11-13 21:19:37,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.55 | bwd: 3845.96 | bwd_inner: 3838.14 | bwd_allreduce: 7.78 | step: 21.66 3%|▎ | 1696/50750 [4:36:59<80:39:08, 5.92s/it] {'loss': 0.1092, 'learning_rate': 3.999878106428913e-05, 'epoch': 1.67} 3%|▎ | 1696/50750 [4:36:59<80:39:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:19:43,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:19:43,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.15 | bwd_microstep: 3848.62 | bwd_inner_microstep: 3841.12 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.18 [2024-11-13 21:19:43,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.15 | bwd: 3848.64 | bwd_inner: 3841.12 | bwd_allreduce: 7.47 | step: 21.18 3%|▎ | 1697/50750 [4:37:04<80:39:13, 5.92s/it] {'loss': 0.0031, 'learning_rate': 3.999876693196445e-05, 'epoch': 1.67} 3%|▎ | 1697/50750 [4:37:04<80:39:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 21:19:48,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 21:19:48,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.50 | bwd_microstep: 3851.33 | bwd_inner_microstep: 3843.82 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.06 [2024-11-13 21:19:48,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.50 | bwd: 3851.34 | bwd_inner: 3843.82 | bwd_allreduce: 7.49 | step: 21.08 3%|▎ | 1698/50750 [4:37:10<80:39:57, 5.92s/it] {'loss': 0.0112, 'learning_rate': 3.999875271818881e-05, 'epoch': 1.67} 3%|▎ | 1698/50750 [4:37:10<80:39:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:19:54,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:19:54,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.83 | bwd_microstep: 3851.39 | bwd_inner_microstep: 3843.87 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.07 [2024-11-13 21:19:54,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.83 | bwd: 3851.40 | bwd_inner: 3843.87 | bwd_allreduce: 7.49 | step: 21.07 3%|▎ | 1699/50750 [4:37:16<80:40:22, 5.92s/it] {'loss': 0.0028, 'learning_rate': 3.999873842296226e-05, 'epoch': 1.67} 3%|▎ | 1699/50750 [4:37:16<80:40:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:20:00,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.92 [2024-11-13 21:20:00,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.10 | bwd_microstep: 3844.72 | bwd_inner_microstep: 3836.75 | bwd_allreduce_microstep: 7.92 | step_microstep: 24.87 [2024-11-13 21:20:00,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.10 | bwd: 3844.74 | bwd_inner: 3836.75 | bwd_allreduce: 7.94 | step: 24.87 3%|▎ | 1700/50750 [4:37:22<80:40:33, 5.92s/it] {'loss': 0.0093, 'learning_rate': 3.999872404628487e-05, 'epoch': 1.67} 3%|▎ | 1700/50750 [4:37:22<80:40:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:20:06,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:20:06,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.89 | bwd_microstep: 3851.95 | bwd_inner_microstep: 3844.35 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.28 [2024-11-13 21:20:06,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.87 | bwd: 3851.96 | bwd_inner: 3844.35 | bwd_allreduce: 7.57 | step: 21.28 3%|▎ | 1701/50750 [4:37:28<80:42:04, 5.92s/it] {'loss': 0.0059, 'learning_rate': 3.999870958815668e-05, 'epoch': 1.68} 3%|▎ | 1701/50750 [4:37:28<80:42:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:20:12,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:20:12,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.11 | bwd_microstep: 3847.50 | bwd_inner_microstep: 3839.97 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.96 [2024-11-13 21:20:12,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.11 | bwd: 3847.51 | bwd_inner: 3839.97 | bwd_allreduce: 7.49 | step: 20.97 3%|▎ | 1702/50750 [4:37:34<80:40:15, 5.92s/it] {'loss': 0.5678, 'learning_rate': 3.999869504857776e-05, 'epoch': 1.68} 3%|▎ | 1702/50750 [4:37:34<80:40:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:20:18,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 21:20:18,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.23 | bwd_microstep: 3848.47 | bwd_inner_microstep: 3840.96 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.26 [2024-11-13 21:20:18,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.23 | bwd: 3848.49 | bwd_inner: 3840.96 | bwd_allreduce: 7.48 | step: 21.26 3%|▎ | 1703/50750 [4:37:40<80:39:52, 5.92s/it] {'loss': 0.0024, 'learning_rate': 3.999868042754818e-05, 'epoch': 1.68} 3%|▎ | 1703/50750 [4:37:40<80:39:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:20:24,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:20:24,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.18 | bwd_microstep: 3856.92 | bwd_inner_microstep: 3849.41 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.99 [2024-11-13 21:20:24,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.18 | bwd: 3856.94 | bwd_inner: 3849.41 | bwd_allreduce: 7.49 | step: 20.99 3%|▎ | 1704/50750 [4:37:46<80:41:17, 5.92s/it] {'loss': 0.0034, 'learning_rate': 3.999866572506798e-05, 'epoch': 1.68} 3%|▎ | 1704/50750 [4:37:46<80:41:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:20:30,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:20:30,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.90 | bwd_microstep: 3850.78 | bwd_inner_microstep: 3843.03 | bwd_allreduce_microstep: 7.69 | step_microstep: 22.68 [2024-11-13 21:20:30,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.89 | bwd: 3850.79 | bwd_inner: 3843.03 | bwd_allreduce: 7.71 | step: 22.68 3%|▎ | 1705/50750 [4:37:52<80:41:19, 5.92s/it] {'loss': 1.3219, 'learning_rate': 3.9998650941137234e-05, 'epoch': 1.68} 3%|▎ | 1705/50750 [4:37:52<80:41:19, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:20:36,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:20:36,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.64 | bwd_microstep: 3852.31 | bwd_inner_microstep: 3844.75 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.09 [2024-11-13 21:20:36,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.64 | bwd: 3852.32 | bwd_inner: 3844.75 | bwd_allreduce: 7.53 | step: 21.09 3%|▎ | 1706/50750 [4:37:58<80:42:36, 5.92s/it] {'loss': 0.2929, 'learning_rate': 3.9998636075755996e-05, 'epoch': 1.68} 3%|▎ | 1706/50750 [4:37:58<80:42:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:20:42,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:20:42,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.57 | bwd_microstep: 3854.82 | bwd_inner_microstep: 3847.10 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.07 [2024-11-13 21:20:42,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.57 | bwd: 3854.84 | bwd_inner: 3847.10 | bwd_allreduce: 7.69 | step: 21.07 3%|▎ | 1707/50750 [4:38:04<80:42:55, 5.92s/it] {'loss': 0.0137, 'learning_rate': 3.999862112892433e-05, 'epoch': 1.68} 3%|▎ | 1707/50750 [4:38:04<80:42:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:20:48,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:20:48,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.33 | bwd_microstep: 3851.00 | bwd_inner_microstep: 3843.50 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.95 [2024-11-13 21:20:48,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.33 | bwd: 3851.02 | bwd_inner: 3843.50 | bwd_allreduce: 7.48 | step: 20.95 3%|▎ | 1708/50750 [4:38:10<80:41:48, 5.92s/it] {'loss': 0.0114, 'learning_rate': 3.999860610064229e-05, 'epoch': 1.68} 3%|▎ | 1708/50750 [4:38:10<80:41:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:20:54,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 21:20:54,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.50 | bwd_microstep: 3851.51 | bwd_inner_microstep: 3843.84 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.58 [2024-11-13 21:20:54,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.50 | bwd: 3851.52 | bwd_inner: 3843.84 | bwd_allreduce: 7.64 | step: 21.58 3%|▎ | 1709/50750 [4:38:16<80:42:55, 5.93s/it] {'loss': 0.0187, 'learning_rate': 3.999859099090994e-05, 'epoch': 1.68} 3%|▎ | 1709/50750 [4:38:16<80:42:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:21:00,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:21:00,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.63 | bwd_microstep: 3850.55 | bwd_inner_microstep: 3840.36 | bwd_allreduce_microstep: 10.09 | step_microstep: 21.59 [2024-11-13 21:21:00,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.61 | bwd: 3850.57 | bwd_inner: 3840.36 | bwd_allreduce: 10.13 | step: 21.58 3%|▎ | 1710/50750 [4:38:21<80:43:03, 5.93s/it] {'loss': 0.0258, 'learning_rate': 3.9998575799727355e-05, 'epoch': 1.68} 3%|▎ | 1710/50750 [4:38:21<80:43:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:21:05,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:21:05,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.96 | bwd_microstep: 3844.98 | bwd_inner_microstep: 3837.51 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.09 [2024-11-13 21:21:05,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.96 | bwd: 3845.00 | bwd_inner: 3837.51 | bwd_allreduce: 7.45 | step: 21.09 3%|▎ | 1711/50750 [4:38:27<80:41:43, 5.92s/it] {'loss': 0.0791, 'learning_rate': 3.9998560527094574e-05, 'epoch': 1.69} 3%|▎ | 1711/50750 [4:38:27<80:41:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:21:11,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.51 | optimizer_step: 4.93 [2024-11-13 21:21:11,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.71 | bwd_microstep: 3845.70 | bwd_inner_microstep: 3838.15 | bwd_allreduce_microstep: 7.51 | step_microstep: 22.72 [2024-11-13 21:21:11,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.71 | bwd: 3845.72 | bwd_inner: 3838.15 | bwd_allreduce: 7.52 | step: 22.73 3%|▎ | 1712/50750 [4:38:33<80:40:57, 5.92s/it] {'loss': 0.0209, 'learning_rate': 3.999854517301167e-05, 'epoch': 1.69} 3%|▎ | 1712/50750 [4:38:33<80:40:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:21:17,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:21:17,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.36 | bwd_microstep: 3854.30 | bwd_inner_microstep: 3846.73 | bwd_allreduce_microstep: 7.53 | step_microstep: 22.52 [2024-11-13 21:21:17,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.33 | bwd: 3854.32 | bwd_inner: 3846.73 | bwd_allreduce: 7.54 | step: 22.53 3%|▎ | 1713/50750 [4:38:39<80:42:20, 5.92s/it] {'loss': 0.1059, 'learning_rate': 3.9998529737478715e-05, 'epoch': 1.69} 3%|▎ | 1713/50750 [4:38:39<80:42:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:21:23,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:21:23,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.29 | bwd_microstep: 3847.55 | bwd_inner_microstep: 3840.00 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.03 [2024-11-13 21:21:23,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.29 | bwd: 3847.57 | bwd_inner: 3840.00 | bwd_allreduce: 7.52 | step: 21.03 3%|▎ | 1714/50750 [4:38:45<80:42:10, 5.92s/it] {'loss': 0.0015, 'learning_rate': 3.999851422049576e-05, 'epoch': 1.69} 3%|▎ | 1714/50750 [4:38:45<80:42:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:21:29,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:21:29,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.28 | bwd_microstep: 3848.80 | bwd_inner_microstep: 3841.26 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.40 [2024-11-13 21:21:29,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.28 | bwd: 3848.82 | bwd_inner: 3841.26 | bwd_allreduce: 7.52 | step: 21.40 3%|▎ | 1715/50750 [4:38:51<80:41:46, 5.92s/it] {'loss': 0.0021, 'learning_rate': 3.999849862206287e-05, 'epoch': 1.69} 3%|▎ | 1715/50750 [4:38:51<80:41:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:21:35,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 21:21:35,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.03 | bwd_microstep: 3849.13 | bwd_inner_microstep: 3841.63 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.15 [2024-11-13 21:21:35,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.03 | bwd: 3849.14 | bwd_inner: 3841.63 | bwd_allreduce: 7.48 | step: 21.15 3%|▎ | 1716/50750 [4:38:57<80:42:12, 5.93s/it] {'loss': 0.0913, 'learning_rate': 3.99984829421801e-05, 'epoch': 1.69} 3%|▎ | 1716/50750 [4:38:57<80:42:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:21:41,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.92 [2024-11-13 21:21:41,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.06 | bwd_microstep: 3854.35 | bwd_inner_microstep: 3846.87 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.69 [2024-11-13 21:21:41,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.06 | bwd: 3854.36 | bwd_inner: 3846.87 | bwd_allreduce: 7.45 | step: 21.69 3%|▎ | 1717/50750 [4:39:03<80:42:17, 5.93s/it] {'loss': 0.0031, 'learning_rate': 3.9998467180847534e-05, 'epoch': 1.69} 3%|▎ | 1717/50750 [4:39:03<80:42:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:21:47,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:21:47,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.29 | bwd_microstep: 3851.34 | bwd_inner_microstep: 3843.86 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.49 [2024-11-13 21:21:47,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.29 | bwd: 3851.35 | bwd_inner: 3843.86 | bwd_allreduce: 7.45 | step: 21.50 3%|▎ | 1718/50750 [4:39:09<80:41:53, 5.92s/it] {'loss': 0.4533, 'learning_rate': 3.999845133806522e-05, 'epoch': 1.69} 3%|▎ | 1718/50750 [4:39:09<80:41:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:21:53,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 5.06 [2024-11-13 21:21:53,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.48 | bwd_microstep: 3846.13 | bwd_inner_microstep: 3838.41 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.38 [2024-11-13 21:21:53,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.48 | bwd: 3846.14 | bwd_inner: 3838.41 | bwd_allreduce: 7.69 | step: 21.38 3%|▎ | 1719/50750 [4:39:15<80:40:14, 5.92s/it] {'loss': 0.7284, 'learning_rate': 3.9998435413833236e-05, 'epoch': 1.69} 3%|▎ | 1719/50750 [4:39:15<80:40:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:21:59,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 21:21:59,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.88 | bwd_microstep: 3849.08 | bwd_inner_microstep: 3841.07 | bwd_allreduce_microstep: 7.94 | step_microstep: 24.92 [2024-11-13 21:21:59,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.89 | bwd: 3849.10 | bwd_inner: 3841.07 | bwd_allreduce: 7.97 | step: 24.91 3%|▎ | 1720/50750 [4:39:21<80:42:15, 5.93s/it] {'loss': 0.0206, 'learning_rate': 3.9998419408151626e-05, 'epoch': 1.69} 3%|▎ | 1720/50750 [4:39:21<80:42:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:22:05,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 21:22:05,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.03 | bwd_microstep: 3848.34 | bwd_inner_microstep: 3840.41 | bwd_allreduce_microstep: 7.86 | step_microstep: 24.40 [2024-11-13 21:22:05,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.99 | bwd: 3848.36 | bwd_inner: 3840.41 | bwd_allreduce: 7.89 | step: 24.40 3%|▎ | 1721/50750 [4:39:27<80:44:25, 5.93s/it] {'loss': 0.0203, 'learning_rate': 3.999840332102048e-05, 'epoch': 1.7} 3%|▎ | 1721/50750 [4:39:27<80:44:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:22:11,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.94 [2024-11-13 21:22:11,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.51 | bwd_microstep: 3846.88 | bwd_inner_microstep: 3839.11 | bwd_allreduce_microstep: 7.72 | step_microstep: 25.24 [2024-11-13 21:22:11,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.49 | bwd: 3846.90 | bwd_inner: 3839.11 | bwd_allreduce: 7.74 | step: 25.24 3%|▎ | 1722/50750 [4:39:33<80:44:01, 5.93s/it] {'loss': 0.0027, 'learning_rate': 3.9998387152439846e-05, 'epoch': 1.7} 3%|▎ | 1722/50750 [4:39:33<80:44:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:22:17,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 21:22:17,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.45 | bwd_microstep: 3847.86 | bwd_inner_microstep: 3840.29 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.33 [2024-11-13 21:22:17,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.45 | bwd: 3847.87 | bwd_inner: 3840.29 | bwd_allreduce: 7.54 | step: 21.33 3%|▎ | 1723/50750 [4:39:39<80:43:19, 5.93s/it] {'loss': 0.5704, 'learning_rate': 3.9998370902409804e-05, 'epoch': 1.7} 3%|▎ | 1723/50750 [4:39:39<80:43:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:22:22,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:22:22,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.51 | bwd_microstep: 3843.15 | bwd_inner_microstep: 3835.61 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.05 [2024-11-13 21:22:22,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.50 | bwd: 3843.16 | bwd_inner: 3835.61 | bwd_allreduce: 7.51 | step: 21.06 3%|▎ | 1724/50750 [4:39:44<80:40:58, 5.92s/it] {'loss': 0.0017, 'learning_rate': 3.99983545709304e-05, 'epoch': 1.7} 3%|▎ | 1724/50750 [4:39:44<80:40:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:22:28,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 21:22:28,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.09 | bwd_microstep: 3847.86 | bwd_inner_microstep: 3840.32 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.11 [2024-11-13 21:22:28,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.09 | bwd: 3847.87 | bwd_inner: 3840.32 | bwd_allreduce: 7.51 | step: 21.12 3%|▎ | 1725/50750 [4:39:50<80:38:57, 5.92s/it] {'loss': 0.0168, 'learning_rate': 3.999833815800172e-05, 'epoch': 1.7} 3%|▎ | 1725/50750 [4:39:50<80:38:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:22:34,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 5.07 [2024-11-13 21:22:34,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.88 | bwd_microstep: 3846.59 | bwd_inner_microstep: 3839.05 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.29 [2024-11-13 21:22:34,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.88 | bwd: 3846.60 | bwd_inner: 3839.05 | bwd_allreduce: 7.52 | step: 21.29 3%|▎ | 1726/50750 [4:39:56<80:38:04, 5.92s/it] {'loss': 0.1086, 'learning_rate': 3.9998321663623816e-05, 'epoch': 1.7} 3%|▎ | 1726/50750 [4:39:56<80:38:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:22:40,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 21:22:40,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.43 | bwd_microstep: 3845.03 | bwd_inner_microstep: 3836.85 | bwd_allreduce_microstep: 8.10 | step_microstep: 21.85 [2024-11-13 21:22:40,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.43 | bwd: 3845.06 | bwd_inner: 3836.85 | bwd_allreduce: 8.13 | step: 21.84 3%|▎ | 1727/50750 [4:40:02<80:37:12, 5.92s/it] {'loss': 0.0037, 'learning_rate': 3.999830508779677e-05, 'epoch': 1.7} 3%|▎ | 1727/50750 [4:40:02<80:37:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:22:46,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 21:22:46,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.57 | bwd_microstep: 3847.58 | bwd_inner_microstep: 3839.85 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.06 [2024-11-13 21:22:46,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.57 | bwd: 3847.59 | bwd_inner: 3839.85 | bwd_allreduce: 7.70 | step: 21.07 3%|▎ | 1728/50750 [4:40:08<80:36:25, 5.92s/it] {'loss': 0.8025, 'learning_rate': 3.9998288430520637e-05, 'epoch': 1.7} 3%|▎ | 1728/50750 [4:40:08<80:36:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:22:52,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:22:52,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.57 | bwd_microstep: 3842.81 | bwd_inner_microstep: 3835.30 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.02 [2024-11-13 21:22:52,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.57 | bwd: 3842.82 | bwd_inner: 3835.30 | bwd_allreduce: 7.48 | step: 21.02 3%|▎ | 1729/50750 [4:40:14<80:34:14, 5.92s/it] {'loss': 0.5269, 'learning_rate': 3.9998271691795485e-05, 'epoch': 1.7} 3%|▎ | 1729/50750 [4:40:14<80:34:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:22:58,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 21:22:58,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.02 | bwd_microstep: 3853.41 | bwd_inner_microstep: 3845.54 | bwd_allreduce_microstep: 7.81 | step_microstep: 21.72 [2024-11-13 21:22:58,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.02 | bwd: 3853.42 | bwd_inner: 3845.54 | bwd_allreduce: 7.83 | step: 21.72 3%|▎ | 1730/50750 [4:40:20<80:38:00, 5.92s/it] {'loss': 0.0111, 'learning_rate': 3.99982548716214e-05, 'epoch': 1.7} 3%|▎ | 1730/50750 [4:40:20<80:38:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:23:04,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:23:04,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.53 | bwd_microstep: 3850.23 | bwd_inner_microstep: 3842.68 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.25 [2024-11-13 21:23:04,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.50 | bwd: 3850.24 | bwd_inner: 3842.68 | bwd_allreduce: 7.52 | step: 21.26 3%|▎ | 1731/50750 [4:40:26<80:40:50, 5.93s/it] {'loss': 0.0082, 'learning_rate': 3.999823796999843e-05, 'epoch': 1.71} 3%|▎ | 1731/50750 [4:40:26<80:40:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:23:10,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 21:23:10,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.80 | bwd_microstep: 3857.44 | bwd_inner_microstep: 3849.93 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.66 [2024-11-13 21:23:10,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.80 | bwd: 3857.45 | bwd_inner: 3849.93 | bwd_allreduce: 7.48 | step: 21.67 3%|▎ | 1732/50750 [4:40:32<80:43:30, 5.93s/it] {'loss': 0.4931, 'learning_rate': 3.999822098692665e-05, 'epoch': 1.71} 3%|▎ | 1732/50750 [4:40:32<80:43:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:23:16,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:23:16,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.65 | bwd_microstep: 3846.32 | bwd_inner_microstep: 3838.76 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.36 [2024-11-13 21:23:16,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.65 | bwd: 3846.33 | bwd_inner: 3838.76 | bwd_allreduce: 7.53 | step: 21.37 3%|▎ | 1733/50750 [4:40:38<80:41:43, 5.93s/it] {'loss': 0.032, 'learning_rate': 3.999820392240613e-05, 'epoch': 1.71} 3%|▎ | 1733/50750 [4:40:38<80:41:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:23:22,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:23:22,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.33 | bwd_microstep: 3849.55 | bwd_inner_microstep: 3841.65 | bwd_allreduce_microstep: 7.85 | step_microstep: 21.38 [2024-11-13 21:23:22,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.32 | bwd: 3849.56 | bwd_inner: 3841.65 | bwd_allreduce: 7.87 | step: 21.38 3%|▎ | 1734/50750 [4:40:44<80:40:03, 5.92s/it] {'loss': 0.0583, 'learning_rate': 3.999818677643694e-05, 'epoch': 1.71} 3%|▎ | 1734/50750 [4:40:44<80:40:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:23:28,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:23:28,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.46 | bwd_microstep: 3845.29 | bwd_inner_microstep: 3837.82 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.15 [2024-11-13 21:23:28,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.46 | bwd: 3845.30 | bwd_inner: 3837.82 | bwd_allreduce: 7.45 | step: 21.16 3%|▎ | 1735/50750 [4:40:50<80:38:06, 5.92s/it] {'loss': 0.009, 'learning_rate': 3.999816954901915e-05, 'epoch': 1.71} 3%|▎ | 1735/50750 [4:40:50<80:38:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:23:34,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.92 [2024-11-13 21:23:34,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.45 | bwd_microstep: 3846.13 | bwd_inner_microstep: 3837.98 | bwd_allreduce_microstep: 8.10 | step_microstep: 21.89 [2024-11-13 21:23:34,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.43 | bwd: 3846.14 | bwd_inner: 3837.98 | bwd_allreduce: 8.12 | step: 21.89 3%|▎ | 1736/50750 [4:40:55<80:37:38, 5.92s/it] {'loss': 0.0866, 'learning_rate': 3.999815224015283e-05, 'epoch': 1.71} 3%|▎ | 1736/50750 [4:40:55<80:37:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:23:39,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:23:39,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.60 | bwd_microstep: 3850.62 | bwd_inner_microstep: 3843.10 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.20 [2024-11-13 21:23:39,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.60 | bwd: 3850.63 | bwd_inner: 3843.10 | bwd_allreduce: 7.49 | step: 21.21 3%|▎ | 1737/50750 [4:41:01<80:37:33, 5.92s/it] {'loss': 0.2475, 'learning_rate': 3.999813484983805e-05, 'epoch': 1.71} 3%|▎ | 1737/50750 [4:41:01<80:37:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 21:23:45,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 21:23:45,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.75 | bwd_microstep: 3845.59 | bwd_inner_microstep: 3837.65 | bwd_allreduce_microstep: 7.90 | step_microstep: 21.98 [2024-11-13 21:23:45,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.75 | bwd: 3845.61 | bwd_inner: 3837.65 | bwd_allreduce: 7.92 | step: 21.98 3%|▎ | 1738/50750 [4:41:07<80:38:22, 5.92s/it] {'loss': 0.0029, 'learning_rate': 3.9998117378074876e-05, 'epoch': 1.71} 3%|▎ | 1738/50750 [4:41:07<80:38:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:23:51,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 21:23:51,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.82 | bwd_microstep: 3848.69 | bwd_inner_microstep: 3840.99 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.32 [2024-11-13 21:23:51,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.81 | bwd: 3848.71 | bwd_inner: 3840.99 | bwd_allreduce: 7.67 | step: 21.32 3%|▎ | 1739/50750 [4:41:13<80:39:31, 5.92s/it] {'loss': 0.0048, 'learning_rate': 3.9998099824863394e-05, 'epoch': 1.71} 3%|▎ | 1739/50750 [4:41:13<80:39:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:23:57,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:23:57,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.37 | bwd_microstep: 3850.72 | bwd_inner_microstep: 3843.19 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.94 [2024-11-13 21:23:57,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.34 | bwd: 3850.74 | bwd_inner: 3843.19 | bwd_allreduce: 7.51 | step: 20.95 3%|▎ | 1740/50750 [4:41:19<80:41:00, 5.93s/it] {'loss': 0.1722, 'learning_rate': 3.999808219020366e-05, 'epoch': 1.71} 3%|▎ | 1740/50750 [4:41:19<80:41:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:24:03,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 21:24:03,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.84 | bwd_microstep: 3844.12 | bwd_inner_microstep: 3836.32 | bwd_allreduce_microstep: 7.76 | step_microstep: 21.95 [2024-11-13 21:24:03,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.84 | bwd: 3844.14 | bwd_inner: 3836.32 | bwd_allreduce: 7.78 | step: 21.96 3%|▎ | 1741/50750 [4:41:25<80:40:47, 5.93s/it] {'loss': 0.005, 'learning_rate': 3.999806447409575e-05, 'epoch': 1.72} 3%|▎ | 1741/50750 [4:41:25<80:40:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:24:09,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:24:09,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.84 | bwd_microstep: 3850.05 | bwd_inner_microstep: 3842.34 | bwd_allreduce_microstep: 7.65 | step_microstep: 22.05 [2024-11-13 21:24:09,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.83 | bwd: 3850.07 | bwd_inner: 3842.35 | bwd_allreduce: 7.67 | step: 22.05 3%|▎ | 1742/50750 [4:41:31<80:41:33, 5.93s/it] {'loss': 0.0167, 'learning_rate': 3.999804667653974e-05, 'epoch': 1.72} 3%|▎ | 1742/50750 [4:41:31<80:41:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:24:15,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 21:24:15,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.49 | bwd_microstep: 3848.83 | bwd_inner_microstep: 3841.04 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.88 [2024-11-13 21:24:15,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.49 | bwd: 3848.85 | bwd_inner: 3841.04 | bwd_allreduce: 7.76 | step: 21.89 3%|▎ | 1743/50750 [4:41:37<80:40:23, 5.93s/it] {'loss': 0.0976, 'learning_rate': 3.9998028797535705e-05, 'epoch': 1.72} 3%|▎ | 1743/50750 [4:41:37<80:40:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:24:21,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 21:24:21,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.19 | bwd_microstep: 3846.57 | bwd_inner_microstep: 3838.77 | bwd_allreduce_microstep: 7.75 | step_microstep: 21.69 [2024-11-13 21:24:21,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.17 | bwd: 3846.59 | bwd_inner: 3838.77 | bwd_allreduce: 7.77 | step: 21.70 3%|▎ | 1744/50750 [4:41:43<80:40:44, 5.93s/it] {'loss': 0.0018, 'learning_rate': 3.9998010837083705e-05, 'epoch': 1.72} 3%|▎ | 1744/50750 [4:41:43<80:40:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:24:27,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.53 | optimizer_step: 4.93 [2024-11-13 21:24:27,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.32 | bwd_microstep: 3848.88 | bwd_inner_microstep: 3841.33 | bwd_allreduce_microstep: 7.50 | step_microstep: 22.83 [2024-11-13 21:24:27,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.32 | bwd: 3848.89 | bwd_inner: 3841.33 | bwd_allreduce: 7.52 | step: 22.83 3%|▎ | 1745/50750 [4:41:49<80:40:04, 5.93s/it] {'loss': 0.0521, 'learning_rate': 3.999799279518382e-05, 'epoch': 1.72} 3%|▎ | 1745/50750 [4:41:49<80:40:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:24:33,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:24:33,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.93 | bwd_microstep: 3851.27 | bwd_inner_microstep: 3843.73 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.27 [2024-11-13 21:24:33,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.92 | bwd: 3851.28 | bwd_inner: 3843.73 | bwd_allreduce: 7.52 | step: 21.28 3%|▎ | 1746/50750 [4:41:55<80:40:22, 5.93s/it] {'loss': 0.0359, 'learning_rate': 3.999797467183613e-05, 'epoch': 1.72} 3%|▎ | 1746/50750 [4:41:55<80:40:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:24:39,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.94 [2024-11-13 21:24:39,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.87 | bwd_microstep: 3847.94 | bwd_inner_microstep: 3840.43 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.04 [2024-11-13 21:24:39,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.87 | bwd: 3847.95 | bwd_inner: 3840.43 | bwd_allreduce: 7.49 | step: 21.05 3%|▎ | 1747/50750 [4:42:01<80:38:12, 5.92s/it] {'loss': 0.3459, 'learning_rate': 3.999795646704071e-05, 'epoch': 1.72} 3%|▎ | 1747/50750 [4:42:01<80:38:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:24:45,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 21:24:45,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.01 | bwd_microstep: 3844.74 | bwd_inner_microstep: 3837.26 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.11 [2024-11-13 21:24:45,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.01 | bwd: 3844.75 | bwd_inner: 3837.26 | bwd_allreduce: 7.45 | step: 21.12 3%|▎ | 1748/50750 [4:42:07<80:36:19, 5.92s/it] {'loss': 0.0221, 'learning_rate': 3.999793818079762e-05, 'epoch': 1.72} 3%|▎ | 1748/50750 [4:42:07<80:36:19, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:24:51,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:24:51,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.28 | bwd_microstep: 3846.65 | bwd_inner_microstep: 3839.17 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.04 [2024-11-13 21:24:51,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.28 | bwd: 3846.66 | bwd_inner: 3839.17 | bwd_allreduce: 7.45 | step: 21.04 3%|▎ | 1749/50750 [4:42:13<80:34:57, 5.92s/it] {'loss': 0.0424, 'learning_rate': 3.999791981310694e-05, 'epoch': 1.72} 3%|▎ | 1749/50750 [4:42:13<80:34:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:24:56,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:24:56,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.97 | bwd_microstep: 3848.57 | bwd_inner_microstep: 3841.07 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.16 [2024-11-13 21:24:56,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.97 | bwd: 3848.58 | bwd_inner: 3841.07 | bwd_allreduce: 7.47 | step: 21.16 3%|▎ | 1750/50750 [4:42:18<80:34:25, 5.92s/it] {'loss': 0.0178, 'learning_rate': 3.9997901363968744e-05, 'epoch': 1.72} 3%|▎ | 1750/50750 [4:42:18<80:34:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:25:02,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:25:02,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.84 | bwd_microstep: 3847.95 | bwd_inner_microstep: 3840.39 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.70 [2024-11-13 21:25:02,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.84 | bwd: 3847.96 | bwd_inner: 3840.39 | bwd_allreduce: 7.53 | step: 21.70 3%|▎ | 1751/50750 [4:42:24<80:34:17, 5.92s/it] {'loss': 0.0042, 'learning_rate': 3.999788283338312e-05, 'epoch': 1.73} 3%|▎ | 1751/50750 [4:42:24<80:34:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:25:08,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:25:08,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.53 | bwd_microstep: 3853.06 | bwd_inner_microstep: 3845.23 | bwd_allreduce_microstep: 7.78 | step_microstep: 21.70 [2024-11-13 21:25:08,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.52 | bwd: 3853.08 | bwd_inner: 3845.23 | bwd_allreduce: 7.80 | step: 21.70 3%|▎ | 1752/50750 [4:42:30<80:37:05, 5.92s/it] {'loss': 0.0014, 'learning_rate': 3.9997864221350126e-05, 'epoch': 1.73} 3%|▎ | 1752/50750 [4:42:30<80:37:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:25:14,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 21:25:14,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.23 | bwd_microstep: 3855.68 | bwd_inner_microstep: 3848.18 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.05 [2024-11-13 21:25:14,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.22 | bwd: 3855.69 | bwd_inner: 3848.18 | bwd_allreduce: 7.47 | step: 21.06 3%|▎ | 1753/50750 [4:42:36<80:39:05, 5.93s/it] {'loss': 0.4545, 'learning_rate': 3.999784552786985e-05, 'epoch': 1.73} 3%|▎ | 1753/50750 [4:42:36<80:39:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:25:20,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:25:20,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.85 | bwd_microstep: 3849.07 | bwd_inner_microstep: 3841.60 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.01 [2024-11-13 21:25:20,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.83 | bwd: 3849.08 | bwd_inner: 3841.60 | bwd_allreduce: 7.44 | step: 21.02 3%|▎ | 1754/50750 [4:42:42<80:37:46, 5.92s/it] {'loss': 0.0206, 'learning_rate': 3.9997826752942355e-05, 'epoch': 1.73} 3%|▎ | 1754/50750 [4:42:42<80:37:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:25:26,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 21:25:26,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.74 | bwd_microstep: 3848.72 | bwd_inner_microstep: 3841.00 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.78 [2024-11-13 21:25:26,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.74 | bwd: 3848.74 | bwd_inner: 3841.00 | bwd_allreduce: 7.70 | step: 21.79 3%|▎ | 1755/50750 [4:42:48<80:37:08, 5.92s/it] {'loss': 0.0013, 'learning_rate': 3.9997807896567737e-05, 'epoch': 1.73} 3%|▎ | 1755/50750 [4:42:48<80:37:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:25:32,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:25:32,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.38 | bwd_microstep: 3851.25 | bwd_inner_microstep: 3843.73 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.20 [2024-11-13 21:25:32,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.38 | bwd: 3851.26 | bwd_inner: 3843.73 | bwd_allreduce: 7.49 | step: 21.21 3%|▎ | 1756/50750 [4:42:54<80:37:22, 5.92s/it] {'loss': 0.6914, 'learning_rate': 3.999778895874606e-05, 'epoch': 1.73} 3%|▎ | 1756/50750 [4:42:54<80:37:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:25:38,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 21:25:38,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.48 | bwd_microstep: 3844.03 | bwd_inner_microstep: 3836.55 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.83 [2024-11-13 21:25:38,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.48 | bwd: 3844.04 | bwd_inner: 3836.55 | bwd_allreduce: 7.45 | step: 20.83 3%|▎ | 1757/50750 [4:43:00<80:35:36, 5.92s/it] {'loss': 0.0022, 'learning_rate': 3.999776993947739e-05, 'epoch': 1.73} 3%|▎ | 1757/50750 [4:43:00<80:35:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:25:44,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:25:44,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.19 | bwd_microstep: 3851.50 | bwd_inner_microstep: 3843.98 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.11 [2024-11-13 21:25:44,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.18 | bwd: 3851.52 | bwd_inner: 3843.98 | bwd_allreduce: 7.50 | step: 21.11 3%|▎ | 1758/50750 [4:43:06<80:35:39, 5.92s/it] {'loss': 0.0032, 'learning_rate': 3.9997750838761835e-05, 'epoch': 1.73} 3%|▎ | 1758/50750 [4:43:06<80:35:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:25:50,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 21:25:50,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.69 | bwd_microstep: 3847.08 | bwd_inner_microstep: 3839.58 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.81 [2024-11-13 21:25:50,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.68 | bwd: 3847.09 | bwd_inner: 3839.58 | bwd_allreduce: 7.47 | step: 20.81 3%|▎ | 1759/50750 [4:43:12<80:35:00, 5.92s/it] {'loss': 0.0348, 'learning_rate': 3.9997731656599446e-05, 'epoch': 1.73} 3%|▎ | 1759/50750 [4:43:12<80:35:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:25:56,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:25:56,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.80 | bwd_microstep: 3847.92 | bwd_inner_microstep: 3840.44 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.93 [2024-11-13 21:25:56,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.80 | bwd: 3847.93 | bwd_inner: 3840.44 | bwd_allreduce: 7.45 | step: 20.94 3%|▎ | 1760/50750 [4:43:18<80:34:53, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.9997712392990305e-05, 'epoch': 1.73} 3%|▎ | 1760/50750 [4:43:18<80:34:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:26:02,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:26:02,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.18 | bwd_microstep: 3849.46 | bwd_inner_microstep: 3841.98 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.67 [2024-11-13 21:26:02,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.16 | bwd: 3849.47 | bwd_inner: 3841.98 | bwd_allreduce: 7.45 | step: 21.68 3%|▎ | 1761/50750 [4:43:24<80:36:44, 5.92s/it] {'loss': 0.0395, 'learning_rate': 3.999769304793451e-05, 'epoch': 1.73} 3%|▎ | 1761/50750 [4:43:24<80:36:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:26:08,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 21:26:08,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.13 | bwd_microstep: 3852.67 | bwd_inner_microstep: 3844.95 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.22 [2024-11-13 21:26:08,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.13 | bwd: 3852.69 | bwd_inner: 3844.95 | bwd_allreduce: 7.69 | step: 21.22 3%|▎ | 1762/50750 [4:43:30<80:37:34, 5.93s/it] {'loss': 0.1429, 'learning_rate': 3.999767362143211e-05, 'epoch': 1.74} 3%|▎ | 1762/50750 [4:43:30<80:37:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:26:13,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 21:26:13,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.02 | bwd_microstep: 3857.91 | bwd_inner_microstep: 3848.93 | bwd_allreduce_microstep: 8.94 | step_microstep: 21.08 [2024-11-13 21:26:13,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.02 | bwd: 3857.93 | bwd_inner: 3848.93 | bwd_allreduce: 8.95 | step: 21.09 3%|▎ | 1763/50750 [4:43:35<80:39:16, 5.93s/it] {'loss': 0.0233, 'learning_rate': 3.999765411348321e-05, 'epoch': 1.74} 3%|▎ | 1763/50750 [4:43:35<80:39:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:26:19,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:26:19,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.84 | bwd_microstep: 3859.35 | bwd_inner_microstep: 3851.82 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.34 [2024-11-13 21:26:19,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.82 | bwd: 3859.36 | bwd_inner: 3851.82 | bwd_allreduce: 7.50 | step: 21.34 3%|▎ | 1764/50750 [4:43:41<80:41:01, 5.93s/it] {'loss': 0.6315, 'learning_rate': 3.999763452408788e-05, 'epoch': 1.74} 3%|▎ | 1764/50750 [4:43:41<80:41:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:26:25,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:26:25,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.07 | bwd_microstep: 3854.31 | bwd_inner_microstep: 3846.84 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.64 [2024-11-13 21:26:25,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.07 | bwd: 3854.32 | bwd_inner: 3846.84 | bwd_allreduce: 7.44 | step: 20.64 3%|▎ | 1765/50750 [4:43:47<80:40:44, 5.93s/it] {'loss': 0.2132, 'learning_rate': 3.999761485324619e-05, 'epoch': 1.74} 3%|▎ | 1765/50750 [4:43:47<80:40:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:26:31,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.99 [2024-11-13 21:26:31,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.22 | bwd_microstep: 3861.59 | bwd_inner_microstep: 3853.82 | bwd_allreduce_microstep: 7.72 | step_microstep: 21.65 [2024-11-13 21:26:31,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.22 | bwd: 3861.60 | bwd_inner: 3853.82 | bwd_allreduce: 7.74 | step: 21.65 3%|▎ | 1766/50750 [4:43:53<80:42:56, 5.93s/it] {'loss': 0.0009, 'learning_rate': 3.9997595100958234e-05, 'epoch': 1.74} 3%|▎ | 1766/50750 [4:43:53<80:42:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:26:37,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:26:37,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.81 | bwd_microstep: 3860.06 | bwd_inner_microstep: 3852.54 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.94 [2024-11-13 21:26:37,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.80 | bwd: 3860.07 | bwd_inner: 3852.54 | bwd_allreduce: 7.50 | step: 20.95 3%|▎ | 1767/50750 [4:43:59<80:43:45, 5.93s/it] {'loss': 0.022, 'learning_rate': 3.999757526722409e-05, 'epoch': 1.74} 3%|▎ | 1767/50750 [4:43:59<80:43:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:26:43,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 21:26:43,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.12 | bwd_microstep: 3859.24 | bwd_inner_microstep: 3851.71 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.10 [2024-11-13 21:26:43,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.13 | bwd: 3859.25 | bwd_inner: 3851.71 | bwd_allreduce: 7.50 | step: 21.10 3%|▎ | 1768/50750 [4:44:05<80:43:30, 5.93s/it] {'loss': 0.0026, 'learning_rate': 3.999755535204383e-05, 'epoch': 1.74} 3%|▎ | 1768/50750 [4:44:05<80:43:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:26:49,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 21:26:49,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.68 | bwd_microstep: 3860.71 | bwd_inner_microstep: 3852.62 | bwd_allreduce_microstep: 8.04 | step_microstep: 22.08 [2024-11-13 21:26:49,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.68 | bwd: 3860.72 | bwd_inner: 3852.62 | bwd_allreduce: 8.06 | step: 22.09 3%|▎ | 1769/50750 [4:44:11<80:45:10, 5.94s/it] {'loss': 0.0109, 'learning_rate': 3.9997535355417546e-05, 'epoch': 1.74} 3%|▎ | 1769/50750 [4:44:11<80:45:10, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:26:55,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.96 | optimizer_step: 4.92 [2024-11-13 21:26:55,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.29 | bwd_microstep: 3852.03 | bwd_inner_microstep: 3844.43 | bwd_allreduce_microstep: 7.55 | step_microstep: 23.69 [2024-11-13 21:26:55,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.28 | bwd: 3852.04 | bwd_inner: 3844.43 | bwd_allreduce: 7.57 | step: 23.71 3%|▎ | 1770/50750 [4:44:17<80:45:22, 5.94s/it] {'loss': 0.0071, 'learning_rate': 3.999751527734531e-05, 'epoch': 1.74} 3%|▎ | 1770/50750 [4:44:17<80:45:22, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:27:01,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:27:01,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.98 | bwd_microstep: 3860.61 | bwd_inner_microstep: 3853.08 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.37 [2024-11-13 21:27:01,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.98 | bwd: 3860.62 | bwd_inner: 3853.08 | bwd_allreduce: 7.50 | step: 21.37 3%|▎ | 1771/50750 [4:44:23<80:45:14, 5.94s/it] {'loss': 0.0015, 'learning_rate': 3.999749511782721e-05, 'epoch': 1.74} 3%|▎ | 1771/50750 [4:44:23<80:45:14, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:27:07,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.49 | optimizer_step: 4.92 [2024-11-13 21:27:07,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.01 | bwd_microstep: 3852.17 | bwd_inner_microstep: 3844.62 | bwd_allreduce_microstep: 7.51 | step_microstep: 22.93 [2024-11-13 21:27:07,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.01 | bwd: 3852.18 | bwd_inner: 3844.62 | bwd_allreduce: 7.52 | step: 22.93 3%|▎ | 1772/50750 [4:44:29<80:43:57, 5.93s/it] {'loss': 0.0109, 'learning_rate': 3.999747487686333e-05, 'epoch': 1.75} 3%|▎ | 1772/50750 [4:44:29<80:43:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:27:13,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:27:13,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.15 | bwd_microstep: 3850.94 | bwd_inner_microstep: 3843.41 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.12 [2024-11-13 21:27:13,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.15 | bwd: 3850.95 | bwd_inner: 3843.41 | bwd_allreduce: 7.50 | step: 21.12 3%|▎ | 1773/50750 [4:44:35<80:40:57, 5.93s/it] {'loss': 0.0008, 'learning_rate': 3.9997454554453746e-05, 'epoch': 1.75} 3%|▎ | 1773/50750 [4:44:35<80:40:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:27:19,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 21:27:19,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.30 | bwd_microstep: 3852.14 | bwd_inner_microstep: 3844.61 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.23 [2024-11-13 21:27:19,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.30 | bwd: 3852.15 | bwd_inner: 3844.61 | bwd_allreduce: 7.50 | step: 21.24 3%|▎ | 1774/50750 [4:44:41<80:38:53, 5.93s/it] {'loss': 0.8235, 'learning_rate': 3.9997434150598545e-05, 'epoch': 1.75} 3%|▎ | 1774/50750 [4:44:41<80:38:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:27:25,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 4.93 [2024-11-13 21:27:25,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.52 | bwd_microstep: 3846.88 | bwd_inner_microstep: 3839.35 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.22 [2024-11-13 21:27:25,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.52 | bwd: 3846.90 | bwd_inner: 3839.35 | bwd_allreduce: 7.51 | step: 21.22 3%|▎ | 1775/50750 [4:44:47<80:37:13, 5.93s/it] {'loss': 0.0285, 'learning_rate': 3.999741366529781e-05, 'epoch': 1.75} 3%|▎ | 1775/50750 [4:44:47<80:37:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:27:31,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-13 21:27:31,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.05 | bwd_microstep: 3846.90 | bwd_inner_microstep: 3838.97 | bwd_allreduce_microstep: 7.89 | step_microstep: 21.56 [2024-11-13 21:27:31,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.05 | bwd: 3846.92 | bwd_inner: 3838.97 | bwd_allreduce: 7.90 | step: 21.56 3%|▎ | 1776/50750 [4:44:53<80:37:31, 5.93s/it] {'loss': 0.001, 'learning_rate': 3.9997393098551625e-05, 'epoch': 1.75} 3%|▎ | 1776/50750 [4:44:53<80:37:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:27:37,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 21:27:37,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.27 | bwd_microstep: 3850.03 | bwd_inner_microstep: 3842.28 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.73 [2024-11-13 21:27:37,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.25 | bwd: 3850.04 | bwd_inner: 3842.28 | bwd_allreduce: 7.72 | step: 21.74 4%|▎ | 1777/50750 [4:44:59<80:40:07, 5.93s/it] {'loss': 0.0027, 'learning_rate': 3.999737245036007e-05, 'epoch': 1.75} 4%|▎ | 1777/50750 [4:44:59<80:40:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:27:42,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:27:42,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.93 | bwd_microstep: 3852.12 | bwd_inner_microstep: 3844.41 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.67 [2024-11-13 21:27:42,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.92 | bwd: 3852.14 | bwd_inner: 3844.41 | bwd_allreduce: 7.67 | step: 21.67 4%|▎ | 1778/50750 [4:45:04<80:41:33, 5.93s/it] {'loss': 0.0175, 'learning_rate': 3.999735172072322e-05, 'epoch': 1.75} 4%|▎ | 1778/50750 [4:45:04<80:41:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:27:48,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 21:27:48,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.61 | bwd_microstep: 3849.19 | bwd_inner_microstep: 3840.99 | bwd_allreduce_microstep: 8.12 | step_microstep: 22.19 [2024-11-13 21:27:48,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.61 | bwd: 3849.22 | bwd_inner: 3840.99 | bwd_allreduce: 8.16 | step: 22.18 4%|▎ | 1779/50750 [4:45:10<80:39:48, 5.93s/it] {'loss': 0.6318, 'learning_rate': 3.9997330909641184e-05, 'epoch': 1.75} 4%|▎ | 1779/50750 [4:45:10<80:39:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:27:54,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:27:54,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.33 | bwd_microstep: 3863.28 | bwd_inner_microstep: 3855.78 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.00 [2024-11-13 21:27:54,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.33 | bwd: 3863.29 | bwd_inner: 3855.78 | bwd_allreduce: 7.48 | step: 21.00 4%|▎ | 1780/50750 [4:45:16<80:40:40, 5.93s/it] {'loss': 0.0038, 'learning_rate': 3.999731001711403e-05, 'epoch': 1.75} 4%|▎ | 1780/50750 [4:45:16<80:40:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:28:00,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:28:00,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.49 | bwd_microstep: 3850.51 | bwd_inner_microstep: 3843.00 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.12 [2024-11-13 21:28:00,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.49 | bwd: 3850.53 | bwd_inner: 3843.00 | bwd_allreduce: 7.49 | step: 21.13 4%|▎ | 1781/50750 [4:45:22<80:38:05, 5.93s/it] {'loss': 0.0052, 'learning_rate': 3.999728904314184e-05, 'epoch': 1.75} 4%|▎ | 1781/50750 [4:45:22<80:38:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:28:06,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:28:06,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.98 | bwd_microstep: 3850.68 | bwd_inner_microstep: 3843.17 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.39 [2024-11-13 21:28:06,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.98 | bwd: 3850.69 | bwd_inner: 3843.17 | bwd_allreduce: 7.49 | step: 21.40 4%|▎ | 1782/50750 [4:45:28<80:36:04, 5.93s/it] {'loss': 0.001, 'learning_rate': 3.9997267987724714e-05, 'epoch': 1.76} 4%|▎ | 1782/50750 [4:45:28<80:36:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:28:12,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.54 | optimizer_step: 4.92 [2024-11-13 21:28:12,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.96 | bwd_microstep: 3850.11 | bwd_inner_microstep: 3842.57 | bwd_allreduce_microstep: 7.50 | step_microstep: 22.69 [2024-11-13 21:28:12,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.96 | bwd: 3850.12 | bwd_inner: 3842.57 | bwd_allreduce: 7.51 | step: 22.69 4%|▎ | 1783/50750 [4:45:34<80:36:58, 5.93s/it] {'loss': 0.0166, 'learning_rate': 3.999724685086273e-05, 'epoch': 1.76} 4%|▎ | 1783/50750 [4:45:34<80:36:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:28:18,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.30 | optimizer_step: 4.92 [2024-11-13 21:28:18,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.18 | bwd_microstep: 3866.12 | bwd_inner_microstep: 3856.93 | bwd_allreduce_microstep: 9.14 | step_microstep: 23.07 [2024-11-13 21:28:18,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.16 | bwd: 3866.14 | bwd_inner: 3856.93 | bwd_allreduce: 9.16 | step: 23.07 4%|▎ | 1784/50750 [4:45:40<80:43:00, 5.93s/it] {'loss': 0.354, 'learning_rate': 3.999722563255597e-05, 'epoch': 1.76} 4%|▎ | 1784/50750 [4:45:40<80:43:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:28:24,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.44 | optimizer_step: 4.93 [2024-11-13 21:28:24,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.88 | bwd_microstep: 3847.73 | bwd_inner_microstep: 3839.98 | bwd_allreduce_microstep: 7.69 | step_microstep: 22.67 [2024-11-13 21:28:24,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.86 | bwd: 3847.74 | bwd_inner: 3839.98 | bwd_allreduce: 7.71 | step: 22.67 4%|▎ | 1785/50750 [4:45:46<80:43:02, 5.93s/it] {'loss': 0.0062, 'learning_rate': 3.9997204332804525e-05, 'epoch': 1.76} 4%|▎ | 1785/50750 [4:45:46<80:43:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:28:30,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:28:30,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.21 | bwd_microstep: 3846.39 | bwd_inner_microstep: 3838.83 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.07 [2024-11-13 21:28:30,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.20 | bwd: 3846.40 | bwd_inner: 3838.83 | bwd_allreduce: 7.53 | step: 21.07 4%|▎ | 1786/50750 [4:45:52<80:39:52, 5.93s/it] {'loss': 0.7523, 'learning_rate': 3.999718295160848e-05, 'epoch': 1.76} 4%|▎ | 1786/50750 [4:45:52<80:39:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:28:36,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 21:28:36,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.57 | bwd_microstep: 3851.44 | bwd_inner_microstep: 3843.60 | bwd_allreduce_microstep: 7.80 | step_microstep: 21.79 [2024-11-13 21:28:36,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.57 | bwd: 3851.45 | bwd_inner: 3843.60 | bwd_allreduce: 7.82 | step: 21.80 4%|▎ | 1787/50750 [4:45:58<80:38:07, 5.93s/it] {'loss': 0.2996, 'learning_rate': 3.9997161488967926e-05, 'epoch': 1.76} 4%|▎ | 1787/50750 [4:45:58<80:38:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:28:42,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 21:28:42,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.85 | bwd_microstep: 3846.32 | bwd_inner_microstep: 3838.81 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.46 [2024-11-13 21:28:42,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.84 | bwd: 3846.33 | bwd_inner: 3838.81 | bwd_allreduce: 7.49 | step: 21.47 4%|▎ | 1788/50750 [4:46:04<80:37:44, 5.93s/it] {'loss': 0.2693, 'learning_rate': 3.999713994488294e-05, 'epoch': 1.76} 4%|▎ | 1788/50750 [4:46:04<80:37:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:28:48,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:28:48,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.56 | bwd_microstep: 3849.79 | bwd_inner_microstep: 3842.27 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.29 [2024-11-13 21:28:48,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.56 | bwd: 3849.80 | bwd_inner: 3842.27 | bwd_allreduce: 7.49 | step: 21.30 4%|▎ | 1789/50750 [4:46:10<80:35:26, 5.93s/it] {'loss': 0.0078, 'learning_rate': 3.999711831935363e-05, 'epoch': 1.76} 4%|▎ | 1789/50750 [4:46:10<80:35:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 21:28:54,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:28:54,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.26 | bwd_microstep: 3849.32 | bwd_inner_microstep: 3841.78 | bwd_allreduce_microstep: 7.50 | step_microstep: 20.99 [2024-11-13 21:28:54,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.26 | bwd: 3849.34 | bwd_inner: 3841.78 | bwd_allreduce: 7.51 | step: 21.00 4%|▎ | 1790/50750 [4:46:16<80:34:01, 5.92s/it] {'loss': 0.004, 'learning_rate': 3.9997096612380067e-05, 'epoch': 1.76} 4%|▎ | 1790/50750 [4:46:16<80:34:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:29:00,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:29:00,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.70 | bwd_microstep: 3846.46 | bwd_inner_microstep: 3838.75 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.83 [2024-11-13 21:29:00,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.70 | bwd: 3846.48 | bwd_inner: 3838.75 | bwd_allreduce: 7.67 | step: 21.83 4%|▎ | 1791/50750 [4:46:21<80:32:37, 5.92s/it] {'loss': 0.0033, 'learning_rate': 3.999707482396234e-05, 'epoch': 1.76} 4%|▎ | 1791/50750 [4:46:21<80:32:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:29:05,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 21:29:05,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.25 | bwd_microstep: 3844.08 | bwd_inner_microstep: 3836.42 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.57 [2024-11-13 21:29:05,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.25 | bwd: 3844.09 | bwd_inner: 3836.42 | bwd_allreduce: 7.63 | step: 21.58 4%|▎ | 1792/50750 [4:46:27<80:31:40, 5.92s/it] {'loss': 0.0259, 'learning_rate': 3.999705295410054e-05, 'epoch': 1.77} 4%|▎ | 1792/50750 [4:46:27<80:31:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:29:11,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 21:29:11,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.14 | bwd_microstep: 3847.83 | bwd_inner_microstep: 3840.32 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.15 [2024-11-13 21:29:11,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.13 | bwd: 3847.84 | bwd_inner: 3840.32 | bwd_allreduce: 7.48 | step: 21.15 4%|▎ | 1793/50750 [4:46:33<80:32:57, 5.92s/it] {'loss': 0.0078, 'learning_rate': 3.9997031002794756e-05, 'epoch': 1.77} 4%|▎ | 1793/50750 [4:46:33<80:32:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:29:17,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.67 | optimizer_step: 4.93 [2024-11-13 21:29:17,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.97 | bwd_microstep: 3848.25 | bwd_inner_microstep: 3840.21 | bwd_allreduce_microstep: 7.97 | step_microstep: 29.42 [2024-11-13 21:29:17,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.98 | bwd: 3848.27 | bwd_inner: 3840.21 | bwd_allreduce: 8.00 | step: 29.42 4%|▎ | 1794/50750 [4:46:39<80:34:14, 5.92s/it] {'loss': 0.7299, 'learning_rate': 3.999700897004509e-05, 'epoch': 1.77} 4%|▎ | 1794/50750 [4:46:39<80:34:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:29:23,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:29:23,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.55 | bwd_microstep: 3849.89 | bwd_inner_microstep: 3842.38 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.08 [2024-11-13 21:29:23,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.55 | bwd: 3849.91 | bwd_inner: 3842.38 | bwd_allreduce: 7.49 | step: 21.08 4%|▎ | 1795/50750 [4:46:45<80:33:02, 5.92s/it] {'loss': 0.0037, 'learning_rate': 3.999698685585161e-05, 'epoch': 1.77} 4%|▎ | 1795/50750 [4:46:45<80:33:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:29:29,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:29:29,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.27 | bwd_microstep: 3850.31 | bwd_inner_microstep: 3842.78 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.02 [2024-11-13 21:29:29,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.27 | bwd: 3850.32 | bwd_inner: 3842.78 | bwd_allreduce: 7.50 | step: 21.03 4%|▎ | 1796/50750 [4:46:51<80:32:03, 5.92s/it] {'loss': 0.0082, 'learning_rate': 3.999696466021442e-05, 'epoch': 1.77} 4%|▎ | 1796/50750 [4:46:51<80:32:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:29:35,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:29:35,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.91 | bwd_microstep: 3842.55 | bwd_inner_microstep: 3835.04 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.13 [2024-11-13 21:29:35,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.91 | bwd: 3842.57 | bwd_inner: 3835.04 | bwd_allreduce: 7.49 | step: 21.13 4%|▎ | 1797/50750 [4:46:57<80:30:05, 5.92s/it] {'loss': 0.0287, 'learning_rate': 3.999694238313361e-05, 'epoch': 1.77} 4%|▎ | 1797/50750 [4:46:57<80:30:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:29:41,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.94 [2024-11-13 21:29:41,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.48 | bwd_microstep: 3848.23 | bwd_inner_microstep: 3840.73 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.97 [2024-11-13 21:29:41,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.48 | bwd: 3848.24 | bwd_inner: 3840.73 | bwd_allreduce: 7.48 | step: 20.97 4%|▎ | 1798/50750 [4:47:03<80:29:36, 5.92s/it] {'loss': 0.0096, 'learning_rate': 3.999692002460926e-05, 'epoch': 1.77} 4%|▎ | 1798/50750 [4:47:03<80:29:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:29:47,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 21:29:47,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.39 | bwd_microstep: 3846.34 | bwd_inner_microstep: 3838.81 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.57 [2024-11-13 21:29:47,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.39 | bwd: 3846.35 | bwd_inner: 3838.81 | bwd_allreduce: 7.50 | step: 21.57 4%|▎ | 1799/50750 [4:47:09<80:29:00, 5.92s/it] {'loss': 0.0053, 'learning_rate': 3.999689758464147e-05, 'epoch': 1.77} 4%|▎ | 1799/50750 [4:47:09<80:29:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:29:53,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:29:53,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.60 | bwd_microstep: 3845.41 | bwd_inner_microstep: 3837.92 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.17 [2024-11-13 21:29:53,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.60 | bwd: 3845.43 | bwd_inner: 3837.92 | bwd_allreduce: 7.47 | step: 21.17 4%|▎ | 1800/50750 [4:47:15<80:28:27, 5.92s/it] {'loss': 0.3036, 'learning_rate': 3.9996875063230333e-05, 'epoch': 1.77} 4%|▎ | 1800/50750 [4:47:15<80:28:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:29:59,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 21:29:59,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.66 | bwd_microstep: 3844.17 | bwd_inner_microstep: 3836.56 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.43 [2024-11-13 21:29:59,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.66 | bwd: 3844.18 | bwd_inner: 3836.56 | bwd_allreduce: 7.58 | step: 21.43 4%|▎ | 1801/50750 [4:47:21<80:27:56, 5.92s/it] {'loss': 0.0051, 'learning_rate': 3.9996852460375944e-05, 'epoch': 1.77} 4%|▎ | 1801/50750 [4:47:21<80:27:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:30:05,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:30:05,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.34 | bwd_microstep: 3852.94 | bwd_inner_microstep: 3845.48 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.98 [2024-11-13 21:30:05,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.33 | bwd: 3852.96 | bwd_inner: 3845.48 | bwd_allreduce: 7.44 | step: 20.98 4%|▎ | 1802/50750 [4:47:27<80:29:48, 5.92s/it] {'loss': 0.3601, 'learning_rate': 3.9996829776078375e-05, 'epoch': 1.78} 4%|▎ | 1802/50750 [4:47:27<80:29:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:30:11,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:30:11,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.45 | bwd_microstep: 3843.55 | bwd_inner_microstep: 3836.06 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.91 [2024-11-13 21:30:11,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.44 | bwd: 3843.56 | bwd_inner: 3836.06 | bwd_allreduce: 7.46 | step: 20.91 4%|▎ | 1803/50750 [4:47:33<80:28:11, 5.92s/it] {'loss': 0.0209, 'learning_rate': 3.999680701033774e-05, 'epoch': 1.78} 4%|▎ | 1803/50750 [4:47:33<80:28:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:30:16,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:30:16,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.72 | bwd_microstep: 3847.62 | bwd_inner_microstep: 3840.14 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.82 [2024-11-13 21:30:16,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.72 | bwd: 3847.63 | bwd_inner: 3840.14 | bwd_allreduce: 7.45 | step: 20.83 4%|▎ | 1804/50750 [4:47:38<80:27:40, 5.92s/it] {'loss': 0.0325, 'learning_rate': 3.999678416315413e-05, 'epoch': 1.78} 4%|▎ | 1804/50750 [4:47:38<80:27:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:30:22,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:30:22,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.68 | bwd_microstep: 3850.06 | bwd_inner_microstep: 3842.58 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.98 [2024-11-13 21:30:22,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.68 | bwd: 3850.07 | bwd_inner: 3842.58 | bwd_allreduce: 7.45 | step: 20.98 4%|▎ | 1805/50750 [4:47:44<80:27:52, 5.92s/it] {'loss': 0.0509, 'learning_rate': 3.9996761234527624e-05, 'epoch': 1.78} 4%|▎ | 1805/50750 [4:47:44<80:27:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:30:28,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:30:28,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.55 | bwd_microstep: 3848.65 | bwd_inner_microstep: 3841.17 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.81 [2024-11-13 21:30:28,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.55 | bwd: 3848.66 | bwd_inner: 3841.17 | bwd_allreduce: 7.45 | step: 20.82 4%|▎ | 1806/50750 [4:47:50<80:27:18, 5.92s/it] {'loss': 0.6124, 'learning_rate': 3.9996738224458324e-05, 'epoch': 1.78} 4%|▎ | 1806/50750 [4:47:50<80:27:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:30:34,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:30:34,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.23 | bwd_microstep: 3847.09 | bwd_inner_microstep: 3839.62 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.90 [2024-11-13 21:30:34,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.23 | bwd: 3847.11 | bwd_inner: 3839.62 | bwd_allreduce: 7.45 | step: 20.90 4%|▎ | 1807/50750 [4:47:56<80:27:24, 5.92s/it] {'loss': 0.0058, 'learning_rate': 3.999671513294633e-05, 'epoch': 1.78} 4%|▎ | 1807/50750 [4:47:56<80:27:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:30:40,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:30:40,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.30 | bwd_microstep: 3848.64 | bwd_inner_microstep: 3841.16 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.79 [2024-11-13 21:30:40,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.30 | bwd: 3848.65 | bwd_inner: 3841.16 | bwd_allreduce: 7.46 | step: 20.79 4%|▎ | 1808/50750 [4:48:02<80:27:20, 5.92s/it] {'loss': 0.0017, 'learning_rate': 3.999669195999172e-05, 'epoch': 1.78} 4%|▎ | 1808/50750 [4:48:02<80:27:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:30:46,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 21:30:46,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.68 | bwd_microstep: 3850.70 | bwd_inner_microstep: 3843.20 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.92 [2024-11-13 21:30:46,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.68 | bwd: 3850.71 | bwd_inner: 3843.20 | bwd_allreduce: 7.47 | step: 20.92 4%|▎ | 1809/50750 [4:48:08<80:28:06, 5.92s/it] {'loss': 0.0089, 'learning_rate': 3.999666870559461e-05, 'epoch': 1.78} 4%|▎ | 1809/50750 [4:48:08<80:28:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:30:52,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:30:52,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.64 | bwd_microstep: 3845.68 | bwd_inner_microstep: 3838.14 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.14 [2024-11-13 21:30:52,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.63 | bwd: 3845.70 | bwd_inner: 3838.14 | bwd_allreduce: 7.51 | step: 21.15 4%|▎ | 1810/50750 [4:48:14<80:27:44, 5.92s/it] {'loss': 0.6234, 'learning_rate': 3.9996645369755075e-05, 'epoch': 1.78} 4%|▎ | 1810/50750 [4:48:14<80:27:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:30:58,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:30:58,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.65 | bwd_microstep: 3845.74 | bwd_inner_microstep: 3838.24 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.96 [2024-11-13 21:30:58,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.65 | bwd: 3845.75 | bwd_inner: 3838.24 | bwd_allreduce: 7.47 | step: 20.96 4%|▎ | 1811/50750 [4:48:20<80:26:59, 5.92s/it] {'loss': 0.0055, 'learning_rate': 3.9996621952473214e-05, 'epoch': 1.78} 4%|▎ | 1811/50750 [4:48:20<80:26:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:31:04,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:31:04,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.32 | bwd_microstep: 3851.93 | bwd_inner_microstep: 3844.47 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.91 [2024-11-13 21:31:04,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.30 | bwd: 3851.94 | bwd_inner: 3844.47 | bwd_allreduce: 7.44 | step: 20.92 4%|▎ | 1812/50750 [4:48:26<80:28:50, 5.92s/it] {'loss': 0.0066, 'learning_rate': 3.999659845374913e-05, 'epoch': 1.79} 4%|▎ | 1812/50750 [4:48:26<80:28:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:31:10,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:31:10,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.90 | bwd_microstep: 3849.00 | bwd_inner_microstep: 3841.53 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.98 [2024-11-13 21:31:10,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.90 | bwd: 3849.01 | bwd_inner: 3841.53 | bwd_allreduce: 7.44 | step: 20.99 4%|▎ | 1813/50750 [4:48:32<80:28:36, 5.92s/it] {'loss': 0.0044, 'learning_rate': 3.999657487358291e-05, 'epoch': 1.79} 4%|▎ | 1813/50750 [4:48:32<80:28:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:31:16,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 21:31:16,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.34 | bwd_microstep: 3862.26 | bwd_inner_microstep: 3854.76 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.96 [2024-11-13 21:31:16,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.34 | bwd: 3862.28 | bwd_inner: 3854.76 | bwd_allreduce: 7.47 | step: 20.96 4%|▎ | 1814/50750 [4:48:38<80:31:33, 5.92s/it] {'loss': 0.0134, 'learning_rate': 3.999655121197466e-05, 'epoch': 1.79} 4%|▎ | 1814/50750 [4:48:38<80:31:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:31:22,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:31:22,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.51 | bwd_microstep: 3850.52 | bwd_inner_microstep: 3843.00 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.18 [2024-11-13 21:31:22,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.51 | bwd: 3850.53 | bwd_inner: 3843.00 | bwd_allreduce: 7.49 | step: 21.18 4%|▎ | 1815/50750 [4:48:44<80:32:01, 5.92s/it] {'loss': 0.0023, 'learning_rate': 3.999652746892447e-05, 'epoch': 1.79} 4%|▎ | 1815/50750 [4:48:44<80:32:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:31:28,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 21:31:28,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.46 | bwd_microstep: 3849.17 | bwd_inner_microstep: 3841.63 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.07 [2024-11-13 21:31:28,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.46 | bwd: 3849.18 | bwd_inner: 3841.63 | bwd_allreduce: 7.51 | step: 21.07 4%|▎ | 1816/50750 [4:48:49<80:30:49, 5.92s/it] {'loss': 0.0046, 'learning_rate': 3.9996503644432435e-05, 'epoch': 1.79} 4%|▎ | 1816/50750 [4:48:49<80:30:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:31:33,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:31:33,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.43 | bwd_microstep: 3850.05 | bwd_inner_microstep: 3842.49 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.12 [2024-11-13 21:31:33,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.43 | bwd: 3850.06 | bwd_inner: 3842.49 | bwd_allreduce: 7.53 | step: 21.13 4%|▎ | 1817/50750 [4:48:55<80:30:42, 5.92s/it] {'loss': 0.0027, 'learning_rate': 3.9996479738498657e-05, 'epoch': 1.79} 4%|▎ | 1817/50750 [4:48:55<80:30:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:31:39,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:31:39,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.26 | bwd_microstep: 3853.01 | bwd_inner_microstep: 3845.50 | bwd_allreduce_microstep: 7.46 | step_microstep: 23.37 [2024-11-13 21:31:39,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.26 | bwd: 3853.02 | bwd_inner: 3845.50 | bwd_allreduce: 7.48 | step: 23.38 4%|▎ | 1818/50750 [4:49:01<80:30:59, 5.92s/it] {'loss': 0.4742, 'learning_rate': 3.999645575112323e-05, 'epoch': 1.79} 4%|▎ | 1818/50750 [4:49:01<80:30:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:31:45,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:31:45,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.61 | bwd_microstep: 3850.88 | bwd_inner_microstep: 3843.31 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.83 [2024-11-13 21:31:45,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.60 | bwd: 3850.89 | bwd_inner: 3843.31 | bwd_allreduce: 7.54 | step: 21.83 4%|▎ | 1819/50750 [4:49:07<80:31:00, 5.92s/it] {'loss': 0.014, 'learning_rate': 3.999643168230625e-05, 'epoch': 1.79} 4%|▎ | 1819/50750 [4:49:07<80:31:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:31:51,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:31:51,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.12 | bwd_microstep: 3859.89 | bwd_inner_microstep: 3852.38 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.94 [2024-11-13 21:31:51,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.11 | bwd: 3859.90 | bwd_inner: 3852.38 | bwd_allreduce: 7.49 | step: 20.94 4%|▎ | 1820/50750 [4:49:13<80:33:50, 5.93s/it] {'loss': 0.1624, 'learning_rate': 3.9996407532047816e-05, 'epoch': 1.79} 4%|▎ | 1820/50750 [4:49:13<80:33:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:31:57,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 21:31:57,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.76 | bwd_microstep: 3860.69 | bwd_inner_microstep: 3852.88 | bwd_allreduce_microstep: 7.77 | step_microstep: 21.64 [2024-11-13 21:31:57,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.76 | bwd: 3860.70 | bwd_inner: 3852.88 | bwd_allreduce: 7.79 | step: 21.64 4%|▎ | 1821/50750 [4:49:19<80:34:45, 5.93s/it] {'loss': 0.4811, 'learning_rate': 3.999638330034803e-05, 'epoch': 1.79} 4%|▎ | 1821/50750 [4:49:19<80:34:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:32:03,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.92 [2024-11-13 21:32:03,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.82 | bwd_microstep: 3857.11 | bwd_inner_microstep: 3849.60 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.37 [2024-11-13 21:32:03,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.81 | bwd: 3857.12 | bwd_inner: 3849.60 | bwd_allreduce: 7.48 | step: 21.37 4%|▎ | 1822/50750 [4:49:25<80:36:01, 5.93s/it] {'loss': 0.0074, 'learning_rate': 3.9996358987206994e-05, 'epoch': 1.8} 4%|▎ | 1822/50750 [4:49:25<80:36:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:32:09,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:32:09,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.24 | bwd_microstep: 3860.46 | bwd_inner_microstep: 3852.97 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.95 [2024-11-13 21:32:09,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.24 | bwd: 3860.48 | bwd_inner: 3852.97 | bwd_allreduce: 7.47 | step: 20.95 4%|▎ | 1823/50750 [4:49:31<80:35:58, 5.93s/it] {'loss': 0.0058, 'learning_rate': 3.99963345926248e-05, 'epoch': 1.8} 4%|▎ | 1823/50750 [4:49:31<80:35:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:32:15,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 5.11 [2024-11-13 21:32:15,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.93 | bwd_microstep: 3857.36 | bwd_inner_microstep: 3849.79 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.69 [2024-11-13 21:32:15,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.93 | bwd: 3857.38 | bwd_inner: 3849.79 | bwd_allreduce: 7.54 | step: 21.69 4%|▎ | 1824/50750 [4:49:37<80:36:20, 5.93s/it] {'loss': 0.0042, 'learning_rate': 3.999631011660155e-05, 'epoch': 1.8} 4%|▎ | 1824/50750 [4:49:37<80:36:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:32:21,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:32:21,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.81 | bwd_microstep: 3850.65 | bwd_inner_microstep: 3843.02 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.30 [2024-11-13 21:32:21,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.79 | bwd: 3850.66 | bwd_inner: 3843.02 | bwd_allreduce: 7.60 | step: 21.31 4%|▎ | 1825/50750 [4:49:43<80:34:41, 5.93s/it] {'loss': 0.0143, 'learning_rate': 3.9996285559137346e-05, 'epoch': 1.8} 4%|▎ | 1825/50750 [4:49:43<80:34:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:32:27,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:32:27,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.92 | bwd_microstep: 3850.97 | bwd_inner_microstep: 3843.48 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.01 [2024-11-13 21:32:27,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.91 | bwd: 3850.98 | bwd_inner: 3843.48 | bwd_allreduce: 7.46 | step: 21.01 4%|▎ | 1826/50750 [4:49:49<80:34:55, 5.93s/it] {'loss': 0.1264, 'learning_rate': 3.999626092023228e-05, 'epoch': 1.8} 4%|▎ | 1826/50750 [4:49:49<80:34:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:32:33,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:32:33,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.67 | bwd_microstep: 3849.49 | bwd_inner_microstep: 3841.77 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.62 [2024-11-13 21:32:33,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.67 | bwd: 3849.50 | bwd_inner: 3841.77 | bwd_allreduce: 7.68 | step: 21.62 4%|▎ | 1827/50750 [4:49:55<80:33:16, 5.93s/it] {'loss': 0.0072, 'learning_rate': 3.9996236199886454e-05, 'epoch': 1.8} 4%|▎ | 1827/50750 [4:49:55<80:33:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:32:39,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 21:32:39,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.85 | bwd_microstep: 3850.25 | bwd_inner_microstep: 3842.66 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.19 [2024-11-13 21:32:39,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.85 | bwd: 3850.26 | bwd_inner: 3842.66 | bwd_allreduce: 7.56 | step: 21.19 4%|▎ | 1828/50750 [4:50:01<80:31:22, 5.93s/it] {'loss': 0.0067, 'learning_rate': 3.999621139809997e-05, 'epoch': 1.8} 4%|▎ | 1828/50750 [4:50:01<80:31:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:32:45,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 21:32:45,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.03 | bwd_microstep: 3847.79 | bwd_inner_microstep: 3840.10 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.82 [2024-11-13 21:32:45,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.03 | bwd: 3847.80 | bwd_inner: 3840.10 | bwd_allreduce: 7.66 | step: 21.82 4%|▎ | 1829/50750 [4:50:07<80:30:29, 5.92s/it] {'loss': 0.0878, 'learning_rate': 3.999618651487294e-05, 'epoch': 1.8} 4%|▎ | 1829/50750 [4:50:07<80:30:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:32:51,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:32:51,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.00 | bwd_microstep: 3858.76 | bwd_inner_microstep: 3851.07 | bwd_allreduce_microstep: 7.65 | step_microstep: 20.99 [2024-11-13 21:32:51,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.99 | bwd: 3858.78 | bwd_inner: 3851.07 | bwd_allreduce: 7.67 | step: 20.99 4%|▎ | 1830/50750 [4:50:12<80:32:06, 5.93s/it] {'loss': 0.0725, 'learning_rate': 3.999616155020545e-05, 'epoch': 1.8} 4%|▎ | 1830/50750 [4:50:12<80:32:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:32:56,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.44 | optimizer_step: 4.92 [2024-11-13 21:32:56,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.03 | bwd_microstep: 3868.61 | bwd_inner_microstep: 3860.51 | bwd_allreduce_microstep: 8.05 | step_microstep: 22.62 [2024-11-13 21:32:56,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.03 | bwd: 3868.63 | bwd_inner: 3860.51 | bwd_allreduce: 8.07 | step: 22.63 4%|▎ | 1831/50750 [4:50:18<80:35:57, 5.93s/it] {'loss': 0.0275, 'learning_rate': 3.999613650409761e-05, 'epoch': 1.8} 4%|▎ | 1831/50750 [4:50:18<80:35:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:33:02,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.39 | optimizer_step: 4.93 [2024-11-13 21:33:02,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.21 | bwd_microstep: 3847.53 | bwd_inner_microstep: 3839.07 | bwd_allreduce_microstep: 8.41 | step_microstep: 22.76 [2024-11-13 21:33:02,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.20 | bwd: 3847.55 | bwd_inner: 3839.07 | bwd_allreduce: 8.44 | step: 22.77 4%|▎ | 1832/50750 [4:50:24<80:36:04, 5.93s/it] {'loss': 0.6911, 'learning_rate': 3.999611137654952e-05, 'epoch': 1.8} 4%|▎ | 1832/50750 [4:50:24<80:36:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:33:08,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:33:08,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.95 | bwd_microstep: 3849.01 | bwd_inner_microstep: 3841.52 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.06 [2024-11-13 21:33:08,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.93 | bwd: 3849.02 | bwd_inner: 3841.52 | bwd_allreduce: 7.46 | step: 21.06 4%|▎ | 1833/50750 [4:50:30<80:35:40, 5.93s/it] {'loss': 0.0019, 'learning_rate': 3.999608616756129e-05, 'epoch': 1.81} 4%|▎ | 1833/50750 [4:50:30<80:35:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:33:14,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:33:14,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.29 | bwd_microstep: 3853.14 | bwd_inner_microstep: 3845.46 | bwd_allreduce_microstep: 7.62 | step_microstep: 22.22 [2024-11-13 21:33:14,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.26 | bwd: 3853.16 | bwd_inner: 3845.46 | bwd_allreduce: 7.64 | step: 22.23 4%|▎ | 1834/50750 [4:50:36<80:36:24, 5.93s/it] {'loss': 0.443, 'learning_rate': 3.9996060877133e-05, 'epoch': 1.81} 4%|▎ | 1834/50750 [4:50:36<80:36:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:33:20,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:33:20,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.73 | bwd_microstep: 3847.30 | bwd_inner_microstep: 3839.80 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.11 [2024-11-13 21:33:20,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.72 | bwd: 3847.32 | bwd_inner: 3839.80 | bwd_allreduce: 7.47 | step: 21.11 4%|▎ | 1835/50750 [4:50:42<80:33:35, 5.93s/it] {'loss': 0.4144, 'learning_rate': 3.999603550526478e-05, 'epoch': 1.81} 4%|▎ | 1835/50750 [4:50:42<80:33:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:33:26,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:33:26,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.23 | bwd_microstep: 3852.58 | bwd_inner_microstep: 3845.11 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.16 [2024-11-13 21:33:26,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.23 | bwd: 3852.59 | bwd_inner: 3845.11 | bwd_allreduce: 7.44 | step: 21.17 4%|▎ | 1836/50750 [4:50:48<80:31:48, 5.93s/it] {'loss': 0.0081, 'learning_rate': 3.999601005195671e-05, 'epoch': 1.81} 4%|▎ | 1836/50750 [4:50:48<80:31:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:33:32,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:33:32,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.80 | bwd_microstep: 3848.21 | bwd_inner_microstep: 3840.76 | bwd_allreduce_microstep: 7.41 | step_microstep: 21.02 [2024-11-13 21:33:32,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.79 | bwd: 3848.22 | bwd_inner: 3840.76 | bwd_allreduce: 7.42 | step: 21.02 4%|▎ | 1837/50750 [4:50:54<80:30:37, 5.93s/it] {'loss': 1.0381, 'learning_rate': 3.999598451720891e-05, 'epoch': 1.81} 4%|▎ | 1837/50750 [4:50:54<80:30:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:33:38,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 21:33:38,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1991.80 | bwd_microstep: 3792.89 | bwd_inner_microstep: 3785.42 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.93 [2024-11-13 21:33:38,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1991.80 | bwd: 3792.90 | bwd_inner: 3785.42 | bwd_allreduce: 7.44 | step: 20.94 4%|▎ | 1838/50750 [4:51:00<80:07:28, 5.90s/it] {'loss': 0.1038, 'learning_rate': 3.999595890102148e-05, 'epoch': 1.81} 4%|▎ | 1838/50750 [4:51:00<80:07:28, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:33:44,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:33:44,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.78 | bwd_microstep: 3849.50 | bwd_inner_microstep: 3842.00 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.95 [2024-11-13 21:33:44,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.78 | bwd: 3849.52 | bwd_inner: 3842.00 | bwd_allreduce: 7.48 | step: 20.95 4%|▎ | 1839/50750 [4:51:06<80:11:58, 5.90s/it] {'loss': 0.0033, 'learning_rate': 3.999593320339452e-05, 'epoch': 1.81} 4%|▎ | 1839/50750 [4:51:06<80:11:58, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:33:50,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:33:50,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.51 | bwd_microstep: 3853.30 | bwd_inner_microstep: 3845.83 | bwd_allreduce_microstep: 7.43 | step_microstep: 23.35 [2024-11-13 21:33:50,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.51 | bwd: 3853.32 | bwd_inner: 3845.83 | bwd_allreduce: 7.44 | step: 23.35 4%|▎ | 1840/50750 [4:51:12<80:17:22, 5.91s/it] {'loss': 0.477, 'learning_rate': 3.999590742432815e-05, 'epoch': 1.81} 4%|▎ | 1840/50750 [4:51:12<80:17:22, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:33:56,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:33:56,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.87 | bwd_microstep: 3881.26 | bwd_inner_microstep: 3873.80 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.93 [2024-11-13 21:33:56,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.86 | bwd: 3881.27 | bwd_inner: 3873.80 | bwd_allreduce: 7.43 | step: 20.93 4%|▎ | 1841/50750 [4:51:18<80:27:59, 5.92s/it] {'loss': 0.0071, 'learning_rate': 3.9995881563822456e-05, 'epoch': 1.81} 4%|▎ | 1841/50750 [4:51:18<80:27:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:34:02,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 21:34:02,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.32 | bwd_microstep: 3858.19 | bwd_inner_microstep: 3850.71 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.82 [2024-11-13 21:34:02,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.32 | bwd: 3858.20 | bwd_inner: 3850.71 | bwd_allreduce: 7.45 | step: 20.82 4%|▎ | 1842/50750 [4:51:24<80:31:34, 5.93s/it] {'loss': 0.598, 'learning_rate': 3.999585562187754e-05, 'epoch': 1.81} 4%|▎ | 1842/50750 [4:51:24<80:31:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:34:08,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:34:08,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.79 | bwd_microstep: 3848.63 | bwd_inner_microstep: 3841.16 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.97 [2024-11-13 21:34:08,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.79 | bwd: 3848.64 | bwd_inner: 3841.16 | bwd_allreduce: 7.44 | step: 20.97 4%|▎ | 1843/50750 [4:51:29<80:29:21, 5.92s/it] {'loss': 0.4075, 'learning_rate': 3.999582959849353e-05, 'epoch': 1.82} 4%|▎ | 1843/50750 [4:51:29<80:29:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:34:13,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:34:13,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.93 | bwd_microstep: 3846.18 | bwd_inner_microstep: 3838.71 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.39 [2024-11-13 21:34:13,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.93 | bwd: 3846.19 | bwd_inner: 3838.71 | bwd_allreduce: 7.44 | step: 21.40 4%|▎ | 1844/50750 [4:51:35<80:27:41, 5.92s/it] {'loss': 0.0559, 'learning_rate': 3.999580349367052e-05, 'epoch': 1.82} 4%|▎ | 1844/50750 [4:51:35<80:27:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:34:19,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:34:19,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.86 | bwd_microstep: 3845.10 | bwd_inner_microstep: 3837.63 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.93 [2024-11-13 21:34:19,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.86 | bwd: 3845.11 | bwd_inner: 3837.64 | bwd_allreduce: 7.44 | step: 20.94 4%|▎ | 1845/50750 [4:51:41<80:26:41, 5.92s/it] {'loss': 0.3025, 'learning_rate': 3.999577730740861e-05, 'epoch': 1.82} 4%|▎ | 1845/50750 [4:51:41<80:26:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:34:25,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:34:25,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.11 | bwd_microstep: 3856.92 | bwd_inner_microstep: 3849.39 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.11 [2024-11-13 21:34:25,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.11 | bwd: 3856.93 | bwd_inner: 3849.39 | bwd_allreduce: 7.50 | step: 21.11 4%|▎ | 1846/50750 [4:51:47<80:28:58, 5.92s/it] {'loss': 0.3987, 'learning_rate': 3.9995751039707915e-05, 'epoch': 1.82} 4%|▎ | 1846/50750 [4:51:47<80:28:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:34:31,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:34:31,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.24 | bwd_microstep: 3849.76 | bwd_inner_microstep: 3842.25 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.90 [2024-11-13 21:34:31,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.24 | bwd: 3849.78 | bwd_inner: 3842.25 | bwd_allreduce: 7.49 | step: 21.90 4%|▎ | 1847/50750 [4:51:53<80:28:37, 5.92s/it] {'loss': 0.0112, 'learning_rate': 3.999572469056854e-05, 'epoch': 1.82} 4%|▎ | 1847/50750 [4:51:53<80:28:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:34:37,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 21:34:37,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.15 | bwd_microstep: 3848.41 | bwd_inner_microstep: 3840.95 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.84 [2024-11-13 21:34:37,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.15 | bwd: 3848.42 | bwd_inner: 3840.95 | bwd_allreduce: 7.44 | step: 20.84 4%|▎ | 1848/50750 [4:51:59<80:28:23, 5.92s/it] {'loss': 0.576, 'learning_rate': 3.999569825999059e-05, 'epoch': 1.82} 4%|▎ | 1848/50750 [4:51:59<80:28:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:34:43,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:34:43,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.38 | bwd_microstep: 3854.93 | bwd_inner_microstep: 3847.42 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.09 [2024-11-13 21:34:43,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.38 | bwd: 3854.95 | bwd_inner: 3847.42 | bwd_allreduce: 7.49 | step: 21.09 4%|▎ | 1849/50750 [4:52:05<80:29:53, 5.93s/it] {'loss': 0.0216, 'learning_rate': 3.999567174797417e-05, 'epoch': 1.82} 4%|▎ | 1849/50750 [4:52:05<80:29:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:34:49,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:34:49,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.59 | bwd_microstep: 3848.04 | bwd_inner_microstep: 3840.53 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.14 [2024-11-13 21:34:49,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.59 | bwd: 3848.05 | bwd_inner: 3840.53 | bwd_allreduce: 7.48 | step: 21.14 4%|▎ | 1850/50750 [4:52:11<80:27:34, 5.92s/it] {'loss': 0.0774, 'learning_rate': 3.9995645154519406e-05, 'epoch': 1.82} 4%|▎ | 1850/50750 [4:52:11<80:27:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:34:55,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:34:55,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.41 | bwd_microstep: 3848.93 | bwd_inner_microstep: 3841.42 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.92 [2024-11-13 21:34:55,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.41 | bwd: 3848.94 | bwd_inner: 3841.42 | bwd_allreduce: 7.49 | step: 20.93 4%|▎ | 1851/50750 [4:52:17<80:25:52, 5.92s/it] {'loss': 0.2611, 'learning_rate': 3.999561847962639e-05, 'epoch': 1.82} 4%|▎ | 1851/50750 [4:52:17<80:25:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:35:01,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:35:01,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.04 | bwd_microstep: 3853.18 | bwd_inner_microstep: 3845.62 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.75 [2024-11-13 21:35:01,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.04 | bwd: 3853.19 | bwd_inner: 3845.62 | bwd_allreduce: 7.53 | step: 21.75 4%|▎ | 1852/50750 [4:52:23<80:27:11, 5.92s/it] {'loss': 0.0356, 'learning_rate': 3.999559172329523e-05, 'epoch': 1.82} 4%|▎ | 1852/50750 [4:52:23<80:27:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:35:07,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 21:35:07,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.07 | bwd_microstep: 3843.89 | bwd_inner_microstep: 3836.42 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.92 [2024-11-13 21:35:07,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.07 | bwd: 3843.90 | bwd_inner: 3836.42 | bwd_allreduce: 7.45 | step: 20.92 4%|▎ | 1853/50750 [4:52:29<80:24:52, 5.92s/it] {'loss': 0.1264, 'learning_rate': 3.999556488552604e-05, 'epoch': 1.83} 4%|▎ | 1853/50750 [4:52:29<80:24:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:35:13,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:35:13,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.25 | bwd_microstep: 3849.75 | bwd_inner_microstep: 3842.32 | bwd_allreduce_microstep: 7.40 | step_microstep: 20.84 [2024-11-13 21:35:13,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.25 | bwd: 3849.77 | bwd_inner: 3842.32 | bwd_allreduce: 7.41 | step: 20.84 4%|▎ | 1854/50750 [4:52:35<80:24:15, 5.92s/it] {'loss': 0.0172, 'learning_rate': 3.999553796631893e-05, 'epoch': 1.83} 4%|▎ | 1854/50750 [4:52:35<80:24:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:35:19,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 5.11 [2024-11-13 21:35:19,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.69 | bwd_microstep: 3847.73 | bwd_inner_microstep: 3840.20 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.20 [2024-11-13 21:35:19,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.69 | bwd: 3847.75 | bwd_inner: 3840.20 | bwd_allreduce: 7.51 | step: 21.20 4%|▎ | 1855/50750 [4:52:41<80:23:28, 5.92s/it] {'loss': 0.1265, 'learning_rate': 3.9995510965674e-05, 'epoch': 1.83} 4%|▎ | 1855/50750 [4:52:41<80:23:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:35:25,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 21:35:25,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.59 | bwd_microstep: 3856.08 | bwd_inner_microstep: 3848.33 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.69 [2024-11-13 21:35:25,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.59 | bwd: 3856.10 | bwd_inner: 3848.33 | bwd_allreduce: 7.72 | step: 21.69 4%|▎ | 1856/50750 [4:52:46<80:27:47, 5.92s/it] {'loss': 0.2587, 'learning_rate': 3.999548388359138e-05, 'epoch': 1.83} 4%|▎ | 1856/50750 [4:52:46<80:27:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:35:30,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:35:30,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.37 | bwd_microstep: 3849.21 | bwd_inner_microstep: 3841.72 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.21 [2024-11-13 21:35:30,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.35 | bwd: 3849.22 | bwd_inner: 3841.72 | bwd_allreduce: 7.46 | step: 21.21 4%|▎ | 1857/50750 [4:52:52<80:27:20, 5.92s/it] {'loss': 0.0937, 'learning_rate': 3.999545672007116e-05, 'epoch': 1.83} 4%|▎ | 1857/50750 [4:52:52<80:27:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:35:36,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:35:36,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.07 | bwd_microstep: 3845.23 | bwd_inner_microstep: 3837.77 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.06 [2024-11-13 21:35:36,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.07 | bwd: 3845.24 | bwd_inner: 3837.77 | bwd_allreduce: 7.44 | step: 21.07 4%|▎ | 1858/50750 [4:52:58<80:26:03, 5.92s/it] {'loss': 0.0852, 'learning_rate': 3.999542947511345e-05, 'epoch': 1.83} 4%|▎ | 1858/50750 [4:52:58<80:26:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:35:42,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:35:42,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.04 | bwd_microstep: 3846.43 | bwd_inner_microstep: 3838.96 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.08 [2024-11-13 21:35:42,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.04 | bwd: 3846.44 | bwd_inner: 3838.97 | bwd_allreduce: 7.43 | step: 21.09 4%|▎ | 1859/50750 [4:53:04<80:24:47, 5.92s/it] {'loss': 0.2677, 'learning_rate': 3.9995402148718384e-05, 'epoch': 1.83} 4%|▎ | 1859/50750 [4:53:04<80:24:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:35:48,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:35:48,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.47 | bwd_microstep: 3844.23 | bwd_inner_microstep: 3836.75 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.89 [2024-11-13 21:35:48,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.45 | bwd: 3844.24 | bwd_inner: 3836.75 | bwd_allreduce: 7.46 | step: 20.89 4%|▎ | 1860/50750 [4:53:10<80:25:17, 5.92s/it] {'loss': 0.0407, 'learning_rate': 3.9995374740886053e-05, 'epoch': 1.83} 4%|▎ | 1860/50750 [4:53:10<80:25:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:35:54,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 21:35:54,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.65 | bwd_microstep: 3857.44 | bwd_inner_microstep: 3849.80 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.37 [2024-11-13 21:35:54,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.65 | bwd: 3857.45 | bwd_inner: 3849.80 | bwd_allreduce: 7.61 | step: 21.38 4%|▎ | 1861/50750 [4:53:16<80:26:00, 5.92s/it] {'loss': 0.0954, 'learning_rate': 3.999534725161657e-05, 'epoch': 1.83} 4%|▎ | 1861/50750 [4:53:16<80:26:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:36:00,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:36:00,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.18 | bwd_microstep: 3850.42 | bwd_inner_microstep: 3842.54 | bwd_allreduce_microstep: 7.83 | step_microstep: 20.93 [2024-11-13 21:36:00,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.18 | bwd: 3850.43 | bwd_inner: 3842.54 | bwd_allreduce: 7.85 | step: 20.94 4%|▎ | 1862/50750 [4:53:22<80:25:32, 5.92s/it] {'loss': 0.4718, 'learning_rate': 3.999531968091006e-05, 'epoch': 1.83} 4%|▎ | 1862/50750 [4:53:22<80:25:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:36:06,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 21:36:06,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.53 | bwd_microstep: 3853.09 | bwd_inner_microstep: 3845.62 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.12 [2024-11-13 21:36:06,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.53 | bwd: 3853.10 | bwd_inner: 3845.62 | bwd_allreduce: 7.45 | step: 21.13 4%|▎ | 1863/50750 [4:53:28<80:27:09, 5.92s/it] {'loss': 0.0295, 'learning_rate': 3.9995292028766626e-05, 'epoch': 1.84} 4%|▎ | 1863/50750 [4:53:28<80:27:09, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:36:12,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:36:12,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.47 | bwd_microstep: 3848.92 | bwd_inner_microstep: 3841.43 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.05 [2024-11-13 21:36:12,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.47 | bwd: 3848.93 | bwd_inner: 3841.43 | bwd_allreduce: 7.46 | step: 21.05 4%|▎ | 1864/50750 [4:53:34<80:25:37, 5.92s/it] {'loss': 0.031, 'learning_rate': 3.9995264295186376e-05, 'epoch': 1.84} 4%|▎ | 1864/50750 [4:53:34<80:25:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:36:18,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 21:36:18,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.57 | bwd_microstep: 3844.44 | bwd_inner_microstep: 3836.97 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.89 [2024-11-13 21:36:18,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.57 | bwd: 3844.45 | bwd_inner: 3836.97 | bwd_allreduce: 7.44 | step: 20.89 4%|▎ | 1865/50750 [4:53:40<80:22:37, 5.92s/it] {'loss': 0.1497, 'learning_rate': 3.999523648016943e-05, 'epoch': 1.84} 4%|▎ | 1865/50750 [4:53:40<80:22:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:36:24,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:36:24,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.20 | bwd_microstep: 3848.68 | bwd_inner_microstep: 3841.20 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.88 [2024-11-13 21:36:24,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.20 | bwd: 3848.70 | bwd_inner: 3841.20 | bwd_allreduce: 7.46 | step: 20.89 4%|▎ | 1866/50750 [4:53:46<80:22:48, 5.92s/it] {'loss': 0.0456, 'learning_rate': 3.99952085837159e-05, 'epoch': 1.84} 4%|▎ | 1866/50750 [4:53:46<80:22:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:36:30,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:36:30,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.48 | bwd_microstep: 3846.29 | bwd_inner_microstep: 3838.81 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.84 [2024-11-13 21:36:30,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.48 | bwd: 3846.30 | bwd_inner: 3838.81 | bwd_allreduce: 7.45 | step: 20.84 4%|▎ | 1867/50750 [4:53:52<80:21:51, 5.92s/it] {'loss': 0.343, 'learning_rate': 3.99951806058259e-05, 'epoch': 1.84} 4%|▎ | 1867/50750 [4:53:52<80:21:51, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:36:36,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 21:36:36,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.42 | bwd_microstep: 3847.13 | bwd_inner_microstep: 3839.66 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.92 [2024-11-13 21:36:36,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.42 | bwd: 3847.14 | bwd_inner: 3839.66 | bwd_allreduce: 7.44 | step: 20.93 4%|▎ | 1868/50750 [4:53:58<80:22:24, 5.92s/it] {'loss': 0.7603, 'learning_rate': 3.999515254649955e-05, 'epoch': 1.84} 4%|▎ | 1868/50750 [4:53:58<80:22:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:36:41,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 21:36:41,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.05 | bwd_microstep: 3850.08 | bwd_inner_microstep: 3842.24 | bwd_allreduce_microstep: 7.79 | step_microstep: 21.74 [2024-11-13 21:36:41,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.05 | bwd: 3850.09 | bwd_inner: 3842.24 | bwd_allreduce: 7.81 | step: 21.75 4%|▎ | 1869/50750 [4:54:03<80:23:45, 5.92s/it] {'loss': 0.0043, 'learning_rate': 3.9995124405736945e-05, 'epoch': 1.84} 4%|▎ | 1869/50750 [4:54:03<80:23:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:36:47,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:36:47,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.53 | bwd_microstep: 3841.03 | bwd_inner_microstep: 3833.39 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.03 [2024-11-13 21:36:47,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.51 | bwd: 3841.05 | bwd_inner: 3833.39 | bwd_allreduce: 7.61 | step: 21.03 4%|▎ | 1870/50750 [4:54:09<80:21:53, 5.92s/it] {'loss': 0.3311, 'learning_rate': 3.999509618353822e-05, 'epoch': 1.84} 4%|▎ | 1870/50750 [4:54:09<80:21:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:36:53,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 21:36:53,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.03 | bwd_microstep: 3849.21 | bwd_inner_microstep: 3841.48 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.61 [2024-11-13 21:36:53,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.03 | bwd: 3849.22 | bwd_inner: 3841.48 | bwd_allreduce: 7.70 | step: 21.62 4%|▎ | 1871/50750 [4:54:15<80:22:20, 5.92s/it] {'loss': 0.0085, 'learning_rate': 3.999506787990347e-05, 'epoch': 1.84} 4%|▎ | 1871/50750 [4:54:15<80:22:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:36:59,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:36:59,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.68 | bwd_microstep: 3865.99 | bwd_inner_microstep: 3858.44 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.16 [2024-11-13 21:36:59,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.68 | bwd: 3866.01 | bwd_inner: 3858.44 | bwd_allreduce: 7.53 | step: 21.17 4%|▎ | 1872/50750 [4:54:21<80:26:53, 5.93s/it] {'loss': 0.5024, 'learning_rate': 3.999503949483283e-05, 'epoch': 1.84} 4%|▎ | 1872/50750 [4:54:21<80:26:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:37:05,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 5.04 [2024-11-13 21:37:05,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.09 | bwd_microstep: 3843.52 | bwd_inner_microstep: 3836.02 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.33 [2024-11-13 21:37:05,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.09 | bwd: 3843.53 | bwd_inner: 3836.02 | bwd_allreduce: 7.47 | step: 21.33 4%|▎ | 1873/50750 [4:54:27<80:23:13, 5.92s/it] {'loss': 0.0015, 'learning_rate': 3.999501102832641e-05, 'epoch': 1.85} 4%|▎ | 1873/50750 [4:54:27<80:23:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:37:11,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.96 [2024-11-13 21:37:11,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.38 | bwd_microstep: 3846.11 | bwd_inner_microstep: 3838.61 | bwd_allreduce_microstep: 7.46 | step_microstep: 22.08 [2024-11-13 21:37:11,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.38 | bwd: 3846.12 | bwd_inner: 3838.61 | bwd_allreduce: 7.47 | step: 22.08 4%|▎ | 1874/50750 [4:54:33<80:21:46, 5.92s/it] {'loss': 0.0015, 'learning_rate': 3.9994982480384324e-05, 'epoch': 1.85} 4%|▎ | 1874/50750 [4:54:33<80:21:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:37:17,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 21:37:17,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.25 | bwd_microstep: 3853.91 | bwd_inner_microstep: 3846.26 | bwd_allreduce_microstep: 7.60 | step_microstep: 22.67 [2024-11-13 21:37:17,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.25 | bwd: 3853.92 | bwd_inner: 3846.26 | bwd_allreduce: 7.62 | step: 22.67 4%|▎ | 1875/50750 [4:54:39<80:24:19, 5.92s/it] {'loss': 0.0085, 'learning_rate': 3.999495385100669e-05, 'epoch': 1.85} 4%|▎ | 1875/50750 [4:54:39<80:24:19, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:37:23,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:37:23,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.84 | bwd_microstep: 3847.35 | bwd_inner_microstep: 3839.69 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.22 [2024-11-13 21:37:23,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.83 | bwd: 3847.36 | bwd_inner: 3839.69 | bwd_allreduce: 7.64 | step: 21.23 4%|▎ | 1876/50750 [4:54:45<80:23:47, 5.92s/it] {'loss': 0.0032, 'learning_rate': 3.9994925140193616e-05, 'epoch': 1.85} 4%|▎ | 1876/50750 [4:54:45<80:23:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:37:29,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:37:29,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.56 | bwd_microstep: 3844.46 | bwd_inner_microstep: 3836.93 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.07 [2024-11-13 21:37:29,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.56 | bwd: 3844.47 | bwd_inner: 3836.93 | bwd_allreduce: 7.50 | step: 21.07 4%|▎ | 1877/50750 [4:54:51<80:22:47, 5.92s/it] {'loss': 0.0373, 'learning_rate': 3.9994896347945225e-05, 'epoch': 1.85} 4%|▎ | 1877/50750 [4:54:51<80:22:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:37:35,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 21:37:35,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.23 | bwd_microstep: 3853.32 | bwd_inner_microstep: 3845.81 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.98 [2024-11-13 21:37:35,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.22 | bwd: 3853.33 | bwd_inner: 3845.81 | bwd_allreduce: 7.48 | step: 20.98 4%|▎ | 1878/50750 [4:54:57<80:25:41, 5.92s/it] {'loss': 0.0364, 'learning_rate': 3.9994867474261643e-05, 'epoch': 1.85} 4%|▎ | 1878/50750 [4:54:57<80:25:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:37:41,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.46 | optimizer_step: 4.93 [2024-11-13 21:37:41,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.16 | bwd_microstep: 3847.63 | bwd_inner_microstep: 3839.41 | bwd_allreduce_microstep: 8.16 | step_microstep: 25.15 [2024-11-13 21:37:41,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.16 | bwd: 3847.64 | bwd_inner: 3839.41 | bwd_allreduce: 8.18 | step: 25.15 4%|▎ | 1879/50750 [4:55:03<80:25:28, 5.92s/it] {'loss': 0.4045, 'learning_rate': 3.999483851914297e-05, 'epoch': 1.85} 4%|▎ | 1879/50750 [4:55:03<80:25:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:37:47,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.86 | optimizer_step: 4.92 [2024-11-13 21:37:47,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.28 | bwd_microstep: 3861.26 | bwd_inner_microstep: 3853.45 | bwd_allreduce_microstep: 7.77 | step_microstep: 23.02 [2024-11-13 21:37:47,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.27 | bwd: 3861.28 | bwd_inner: 3853.45 | bwd_allreduce: 7.79 | step: 23.04 4%|▎ | 1880/50750 [4:55:09<80:28:15, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.999480948258934e-05, 'epoch': 1.85} 4%|▎ | 1880/50750 [4:55:09<80:28:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:37:53,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:37:53,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.36 | bwd_microstep: 3856.38 | bwd_inner_microstep: 3848.78 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.35 [2024-11-13 21:37:53,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.36 | bwd: 3856.39 | bwd_inner: 3848.78 | bwd_allreduce: 7.57 | step: 21.36 4%|▎ | 1881/50750 [4:55:15<80:28:02, 5.93s/it] {'loss': 0.0024, 'learning_rate': 3.9994780364600864e-05, 'epoch': 1.85} 4%|▎ | 1881/50750 [4:55:15<80:28:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:37:58,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 5.08 [2024-11-13 21:37:58,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.94 | bwd_microstep: 3850.64 | bwd_inner_microstep: 3843.10 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.55 [2024-11-13 21:37:58,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.94 | bwd: 3850.65 | bwd_inner: 3843.10 | bwd_allreduce: 7.51 | step: 21.55 4%|▎ | 1882/50750 [4:55:20<80:26:09, 5.93s/it] {'loss': 0.0213, 'learning_rate': 3.999475116517766e-05, 'epoch': 1.85} 4%|▎ | 1882/50750 [4:55:20<80:26:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:38:04,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:38:04,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.36 | bwd_microstep: 3846.28 | bwd_inner_microstep: 3838.81 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.88 [2024-11-13 21:38:04,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.36 | bwd: 3846.29 | bwd_inner: 3838.81 | bwd_allreduce: 7.44 | step: 20.88 4%|▎ | 1883/50750 [4:55:26<80:23:45, 5.92s/it] {'loss': 0.0031, 'learning_rate': 3.999472188431985e-05, 'epoch': 1.86} 4%|▎ | 1883/50750 [4:55:26<80:23:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:38:10,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:38:10,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.30 | bwd_microstep: 3847.23 | bwd_inner_microstep: 3839.75 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.74 [2024-11-13 21:38:10,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.30 | bwd: 3847.24 | bwd_inner: 3839.75 | bwd_allreduce: 7.45 | step: 20.75 4%|▎ | 1884/50750 [4:55:32<80:21:29, 5.92s/it] {'loss': 0.1394, 'learning_rate': 3.999469252202755e-05, 'epoch': 1.86} 4%|▎ | 1884/50750 [4:55:32<80:21:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:38:16,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:38:16,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.02 | bwd_microstep: 3852.84 | bwd_inner_microstep: 3845.26 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.67 [2024-11-13 21:38:16,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.02 | bwd: 3852.85 | bwd_inner: 3845.26 | bwd_allreduce: 7.55 | step: 21.67 4%|▎ | 1885/50750 [4:55:38<80:23:59, 5.92s/it] {'loss': 0.0199, 'learning_rate': 3.999466307830088e-05, 'epoch': 1.86} 4%|▎ | 1885/50750 [4:55:38<80:23:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:38:22,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 21:38:22,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.56 | bwd_microstep: 3853.34 | bwd_inner_microstep: 3845.63 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.12 [2024-11-13 21:38:22,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.56 | bwd: 3853.36 | bwd_inner: 3845.63 | bwd_allreduce: 7.68 | step: 21.12 4%|▎ | 1886/50750 [4:55:44<80:24:24, 5.92s/it] {'loss': 0.3861, 'learning_rate': 3.999463355313996e-05, 'epoch': 1.86} 4%|▎ | 1886/50750 [4:55:44<80:24:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:38:28,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 21:38:28,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.80 | bwd_microstep: 3847.26 | bwd_inner_microstep: 3839.80 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.86 [2024-11-13 21:38:28,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.79 | bwd: 3847.28 | bwd_inner: 3839.80 | bwd_allreduce: 7.44 | step: 20.87 4%|▎ | 1887/50750 [4:55:50<80:23:45, 5.92s/it] {'loss': 0.132, 'learning_rate': 3.999460394654491e-05, 'epoch': 1.86} 4%|▎ | 1887/50750 [4:55:50<80:23:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:38:34,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:38:34,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.05 | bwd_microstep: 3844.70 | bwd_inner_microstep: 3837.25 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.82 [2024-11-13 21:38:34,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.05 | bwd: 3844.71 | bwd_inner: 3837.25 | bwd_allreduce: 7.42 | step: 20.83 4%|▎ | 1888/50750 [4:55:56<80:21:04, 5.92s/it] {'loss': 0.0074, 'learning_rate': 3.999457425851586e-05, 'epoch': 1.86} 4%|▎ | 1888/50750 [4:55:56<80:21:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:38:40,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:38:40,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.55 | bwd_microstep: 3841.28 | bwd_inner_microstep: 3833.82 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.92 [2024-11-13 21:38:40,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.55 | bwd: 3841.30 | bwd_inner: 3833.82 | bwd_allreduce: 7.44 | step: 20.92 4%|▎ | 1889/50750 [4:56:02<80:18:32, 5.92s/it] {'loss': 0.4424, 'learning_rate': 3.999454448905291e-05, 'epoch': 1.86} 4%|▎ | 1889/50750 [4:56:02<80:18:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:38:46,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:38:46,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.05 | bwd_microstep: 3853.90 | bwd_inner_microstep: 3846.41 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.15 [2024-11-13 21:38:46,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.05 | bwd: 3853.91 | bwd_inner: 3846.41 | bwd_allreduce: 7.46 | step: 21.16 4%|▎ | 1890/50750 [4:56:08<80:20:31, 5.92s/it] {'loss': 0.3001, 'learning_rate': 3.9994514638156204e-05, 'epoch': 1.86} 4%|▎ | 1890/50750 [4:56:08<80:20:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:38:52,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:38:52,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.96 | bwd_microstep: 3852.60 | bwd_inner_microstep: 3845.13 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.07 [2024-11-13 21:38:52,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.96 | bwd: 3852.61 | bwd_inner: 3845.13 | bwd_allreduce: 7.44 | step: 21.07 4%|▎ | 1891/50750 [4:56:14<80:23:03, 5.92s/it] {'loss': 0.0599, 'learning_rate': 3.999448470582585e-05, 'epoch': 1.86} 4%|▎ | 1891/50750 [4:56:14<80:23:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:38:58,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:38:58,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.58 | bwd_microstep: 3845.51 | bwd_inner_microstep: 3838.05 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.86 [2024-11-13 21:38:58,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.58 | bwd: 3845.52 | bwd_inner: 3838.05 | bwd_allreduce: 7.43 | step: 20.86 4%|▎ | 1892/50750 [4:56:20<80:20:57, 5.92s/it] {'loss': 0.1076, 'learning_rate': 3.999445469206197e-05, 'epoch': 1.86} 4%|▎ | 1892/50750 [4:56:20<80:20:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:39:04,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 21:39:04,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.72 | bwd_microstep: 3843.84 | bwd_inner_microstep: 3836.39 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.96 [2024-11-13 21:39:04,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.72 | bwd: 3843.85 | bwd_inner: 3836.39 | bwd_allreduce: 7.43 | step: 20.96 4%|▎ | 1893/50750 [4:56:26<80:18:42, 5.92s/it] {'loss': 0.054, 'learning_rate': 3.99944245968647e-05, 'epoch': 1.87} 4%|▎ | 1893/50750 [4:56:26<80:18:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:39:10,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:39:10,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.97 | bwd_microstep: 3842.77 | bwd_inner_microstep: 3834.94 | bwd_allreduce_microstep: 7.76 | step_microstep: 23.46 [2024-11-13 21:39:10,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.97 | bwd: 3842.79 | bwd_inner: 3834.94 | bwd_allreduce: 7.79 | step: 23.45 4%|▎ | 1894/50750 [4:56:31<80:17:21, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.999439442023414e-05, 'epoch': 1.87} 4%|▎ | 1894/50750 [4:56:31<80:17:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:39:15,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 21:39:15,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.45 | bwd_microstep: 3843.41 | bwd_inner_microstep: 3835.95 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.17 [2024-11-13 21:39:15,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.45 | bwd: 3843.42 | bwd_inner: 3835.95 | bwd_allreduce: 7.44 | step: 21.18 4%|▎ | 1895/50750 [4:56:37<80:16:08, 5.91s/it] {'loss': 0.019, 'learning_rate': 3.9994364162170434e-05, 'epoch': 1.87} 4%|▎ | 1895/50750 [4:56:37<80:16:08, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:39:21,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:39:21,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.37 | bwd_microstep: 3850.33 | bwd_inner_microstep: 3842.82 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.04 [2024-11-13 21:39:21,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.37 | bwd: 3850.34 | bwd_inner: 3842.82 | bwd_allreduce: 7.48 | step: 21.04 4%|▎ | 1896/50750 [4:56:43<80:16:54, 5.92s/it] {'loss': 0.38, 'learning_rate': 3.999433382267369e-05, 'epoch': 1.87} 4%|▎ | 1896/50750 [4:56:43<80:16:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:39:27,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:39:27,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.76 | bwd_microstep: 3845.91 | bwd_inner_microstep: 3838.47 | bwd_allreduce_microstep: 7.40 | step_microstep: 20.91 [2024-11-13 21:39:27,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.76 | bwd: 3845.92 | bwd_inner: 3838.47 | bwd_allreduce: 7.41 | step: 20.91 4%|▎ | 1897/50750 [4:56:49<80:16:10, 5.92s/it] {'loss': 0.1344, 'learning_rate': 3.9994303401744046e-05, 'epoch': 1.87} 4%|▎ | 1897/50750 [4:56:49<80:16:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:39:33,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-13 21:39:33,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.04 | bwd_microstep: 3845.14 | bwd_inner_microstep: 3837.69 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.81 [2024-11-13 21:39:33,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.04 | bwd: 3845.15 | bwd_inner: 3837.69 | bwd_allreduce: 7.42 | step: 20.81 4%|▎ | 1898/50750 [4:56:55<80:16:07, 5.92s/it] {'loss': 0.0039, 'learning_rate': 3.999427289938161e-05, 'epoch': 1.87} 4%|▎ | 1898/50750 [4:56:55<80:16:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:39:39,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:39:39,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.12 | bwd_microstep: 3845.70 | bwd_inner_microstep: 3838.21 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.02 [2024-11-13 21:39:39,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.12 | bwd: 3845.71 | bwd_inner: 3838.20 | bwd_allreduce: 7.46 | step: 21.03 4%|▎ | 1899/50750 [4:57:01<80:15:46, 5.91s/it] {'loss': 0.5189, 'learning_rate': 3.999424231558652e-05, 'epoch': 1.87} 4%|▎ | 1899/50750 [4:57:01<80:15:46, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:39:45,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:39:45,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.40 | bwd_microstep: 3845.99 | bwd_inner_microstep: 3838.46 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.02 [2024-11-13 21:39:45,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.40 | bwd: 3846.01 | bwd_inner: 3838.46 | bwd_allreduce: 7.50 | step: 21.02 4%|▎ | 1900/50750 [4:57:07<80:15:42, 5.91s/it] {'loss': 0.0002, 'learning_rate': 3.9994211650358884e-05, 'epoch': 1.87} 4%|▎ | 1900/50750 [4:57:07<80:15:42, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:39:51,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:39:51,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.22 | bwd_microstep: 3848.08 | bwd_inner_microstep: 3840.56 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.15 [2024-11-13 21:39:51,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.22 | bwd: 3848.10 | bwd_inner: 3840.56 | bwd_allreduce: 7.50 | step: 21.15 4%|▎ | 1901/50750 [4:57:13<80:16:49, 5.92s/it] {'loss': 0.0096, 'learning_rate': 3.9994180903698844e-05, 'epoch': 1.87} 4%|▎ | 1901/50750 [4:57:13<80:16:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:39:57,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.94 [2024-11-13 21:39:57,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.61 | bwd_microstep: 3845.95 | bwd_inner_microstep: 3838.33 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.40 [2024-11-13 21:39:57,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.61 | bwd: 3845.96 | bwd_inner: 3838.33 | bwd_allreduce: 7.59 | step: 21.40 4%|▎ | 1902/50750 [4:57:19<80:16:47, 5.92s/it] {'loss': 0.2758, 'learning_rate': 3.999415007560652e-05, 'epoch': 1.87} 4%|▎ | 1902/50750 [4:57:19<80:16:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:40:03,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:40:03,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.27 | bwd_microstep: 3844.73 | bwd_inner_microstep: 3837.20 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.15 [2024-11-13 21:40:03,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.27 | bwd: 3844.74 | bwd_inner: 3837.20 | bwd_allreduce: 7.50 | step: 21.16 4%|▎ | 1903/50750 [4:57:25<80:16:55, 5.92s/it] {'loss': 0.0024, 'learning_rate': 3.9994119166082035e-05, 'epoch': 1.87} 4%|▎ | 1903/50750 [4:57:25<80:16:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:40:09,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:40:09,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.05 | bwd_microstep: 3848.30 | bwd_inner_microstep: 3840.78 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.10 [2024-11-13 21:40:09,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.05 | bwd: 3848.31 | bwd_inner: 3840.78 | bwd_allreduce: 7.49 | step: 21.11 4%|▍ | 1904/50750 [4:57:31<80:17:23, 5.92s/it] {'loss': 0.088, 'learning_rate': 3.999408817512552e-05, 'epoch': 1.88} 4%|▍ | 1904/50750 [4:57:31<80:17:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:40:15,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 21:40:15,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.73 | bwd_microstep: 3847.63 | bwd_inner_microstep: 3840.12 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.18 [2024-11-13 21:40:15,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.73 | bwd: 3847.64 | bwd_inner: 3840.12 | bwd_allreduce: 7.48 | step: 21.19 4%|▍ | 1905/50750 [4:57:37<80:17:45, 5.92s/it] {'loss': 0.6611, 'learning_rate': 3.999405710273709e-05, 'epoch': 1.88} 4%|▍ | 1905/50750 [4:57:37<80:17:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:40:21,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:40:21,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.29 | bwd_microstep: 3845.95 | bwd_inner_microstep: 3838.43 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.11 [2024-11-13 21:40:21,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.28 | bwd: 3845.96 | bwd_inner: 3838.43 | bwd_allreduce: 7.49 | step: 21.12 4%|▍ | 1906/50750 [4:57:42<80:18:37, 5.92s/it] {'loss': 0.2185, 'learning_rate': 3.999402594891689e-05, 'epoch': 1.88} 4%|▍ | 1906/50750 [4:57:42<80:18:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:40:26,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:40:26,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.60 | bwd_microstep: 3849.60 | bwd_inner_microstep: 3842.09 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.74 [2024-11-13 21:40:26,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.59 | bwd: 3849.61 | bwd_inner: 3842.09 | bwd_allreduce: 7.48 | step: 21.74 4%|▍ | 1907/50750 [4:57:48<80:19:10, 5.92s/it] {'loss': 0.1016, 'learning_rate': 3.999399471366502e-05, 'epoch': 1.88} 4%|▍ | 1907/50750 [4:57:48<80:19:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:40:32,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 5.03 [2024-11-13 21:40:32,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.59 | bwd_microstep: 3847.68 | bwd_inner_microstep: 3840.15 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.22 [2024-11-13 21:40:32,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.59 | bwd: 3847.69 | bwd_inner: 3840.15 | bwd_allreduce: 7.50 | step: 21.23 4%|▍ | 1908/50750 [4:57:54<80:19:33, 5.92s/it] {'loss': 0.0063, 'learning_rate': 3.999396339698164e-05, 'epoch': 1.88} 4%|▍ | 1908/50750 [4:57:54<80:19:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:40:38,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 21:40:38,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.75 | bwd_microstep: 3846.58 | bwd_inner_microstep: 3839.08 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.01 [2024-11-13 21:40:38,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.75 | bwd: 3846.59 | bwd_inner: 3839.08 | bwd_allreduce: 7.47 | step: 21.01 4%|▍ | 1909/50750 [4:58:00<80:18:09, 5.92s/it] {'loss': 0.0045, 'learning_rate': 3.999393199886685e-05, 'epoch': 1.88} 4%|▍ | 1909/50750 [4:58:00<80:18:09, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:40:44,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 21:40:44,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.34 | bwd_microstep: 3847.11 | bwd_inner_microstep: 3839.60 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.56 [2024-11-13 21:40:44,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.34 | bwd: 3847.12 | bwd_inner: 3839.60 | bwd_allreduce: 7.48 | step: 21.56 4%|▍ | 1910/50750 [4:58:06<80:17:44, 5.92s/it] {'loss': 0.004, 'learning_rate': 3.999390051932078e-05, 'epoch': 1.88} 4%|▍ | 1910/50750 [4:58:06<80:17:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:40:50,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.92 [2024-11-13 21:40:50,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.72 | bwd_microstep: 3849.88 | bwd_inner_microstep: 3841.83 | bwd_allreduce_microstep: 8.00 | step_microstep: 21.91 [2024-11-13 21:40:50,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.72 | bwd: 3849.90 | bwd_inner: 3841.83 | bwd_allreduce: 8.02 | step: 21.91 4%|▍ | 1911/50750 [4:58:12<80:18:52, 5.92s/it] {'loss': 0.2414, 'learning_rate': 3.999386895834358e-05, 'epoch': 1.88} 4%|▍ | 1911/50750 [4:58:12<80:18:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:40:56,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 21:40:56,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.75 | bwd_microstep: 3846.12 | bwd_inner_microstep: 3838.44 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.28 [2024-11-13 21:40:56,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.75 | bwd: 3846.13 | bwd_inner: 3838.44 | bwd_allreduce: 7.65 | step: 21.29 4%|▍ | 1912/50750 [4:58:18<80:17:30, 5.92s/it] {'loss': 0.0018, 'learning_rate': 3.999383731593537e-05, 'epoch': 1.88} 4%|▍ | 1912/50750 [4:58:18<80:17:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:41:02,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:41:02,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.80 | bwd_microstep: 3852.76 | bwd_inner_microstep: 3845.29 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.85 [2024-11-13 21:41:02,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.80 | bwd: 3852.77 | bwd_inner: 3845.29 | bwd_allreduce: 7.45 | step: 20.85 4%|▍ | 1913/50750 [4:58:24<80:18:33, 5.92s/it] {'loss': 0.0025, 'learning_rate': 3.999380559209626e-05, 'epoch': 1.88} 4%|▍ | 1913/50750 [4:58:24<80:18:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:41:08,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 21:41:08,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.25 | bwd_microstep: 3851.96 | bwd_inner_microstep: 3844.49 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.77 [2024-11-13 21:41:08,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.25 | bwd: 3851.97 | bwd_inner: 3844.49 | bwd_allreduce: 7.44 | step: 20.78 4%|▍ | 1914/50750 [4:58:30<80:18:39, 5.92s/it] {'loss': 0.0101, 'learning_rate': 3.9993773786826404e-05, 'epoch': 1.89} 4%|▍ | 1914/50750 [4:58:30<80:18:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:41:14,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 21:41:14,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.16 | bwd_microstep: 3848.30 | bwd_inner_microstep: 3840.83 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.14 [2024-11-13 21:41:14,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.16 | bwd: 3848.32 | bwd_inner: 3840.83 | bwd_allreduce: 7.45 | step: 21.14 4%|▍ | 1915/50750 [4:58:36<80:18:21, 5.92s/it] {'loss': 0.0397, 'learning_rate': 3.9993741900125916e-05, 'epoch': 1.89} 4%|▍ | 1915/50750 [4:58:36<80:18:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:41:20,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:41:20,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.89 | bwd_microstep: 3844.62 | bwd_inner_microstep: 3837.14 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.20 [2024-11-13 21:41:20,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.89 | bwd: 3844.63 | bwd_inner: 3837.15 | bwd_allreduce: 7.45 | step: 21.20 4%|▍ | 1916/50750 [4:58:42<80:17:00, 5.92s/it] {'loss': 0.3848, 'learning_rate': 3.9993709931994934e-05, 'epoch': 1.89} 4%|▍ | 1916/50750 [4:58:42<80:17:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 21:41:26,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:41:26,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.75 | bwd_microstep: 3848.20 | bwd_inner_microstep: 3840.72 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.92 [2024-11-13 21:41:26,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.75 | bwd: 3848.22 | bwd_inner: 3840.72 | bwd_allreduce: 7.45 | step: 20.92 4%|▍ | 1917/50750 [4:58:48<80:17:13, 5.92s/it] {'loss': 0.1697, 'learning_rate': 3.999367788243358e-05, 'epoch': 1.89} 4%|▍ | 1917/50750 [4:58:48<80:17:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:41:32,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:41:32,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.03 | bwd_microstep: 3860.49 | bwd_inner_microstep: 3852.51 | bwd_allreduce_microstep: 7.93 | step_microstep: 21.60 [2024-11-13 21:41:32,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.00 | bwd: 3860.50 | bwd_inner: 3852.51 | bwd_allreduce: 7.95 | step: 21.60 4%|▍ | 1918/50750 [4:58:54<80:21:10, 5.92s/it] {'loss': 0.0056, 'learning_rate': 3.999364575144199e-05, 'epoch': 1.89} 4%|▍ | 1918/50750 [4:58:54<80:21:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:41:37,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 21:41:37,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.70 | bwd_microstep: 3855.69 | bwd_inner_microstep: 3848.08 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.68 [2024-11-13 21:41:37,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.70 | bwd: 3855.70 | bwd_inner: 3848.08 | bwd_allreduce: 7.58 | step: 21.69 4%|▍ | 1919/50750 [4:58:59<80:23:24, 5.93s/it] {'loss': 0.1726, 'learning_rate': 3.99936135390203e-05, 'epoch': 1.89} 4%|▍ | 1919/50750 [4:58:59<80:23:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:41:43,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:41:43,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.54 | bwd_microstep: 3846.90 | bwd_inner_microstep: 3839.39 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.02 [2024-11-13 21:41:43,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.53 | bwd: 3846.91 | bwd_inner: 3839.39 | bwd_allreduce: 7.48 | step: 21.02 4%|▍ | 1920/50750 [4:59:05<80:23:30, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.999358124516863e-05, 'epoch': 1.89} 4%|▍ | 1920/50750 [4:59:05<80:23:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:41:49,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 5.11 [2024-11-13 21:41:49,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.15 | bwd_microstep: 3853.26 | bwd_inner_microstep: 3845.64 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.95 [2024-11-13 21:41:49,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.15 | bwd: 3853.28 | bwd_inner: 3845.64 | bwd_allreduce: 7.60 | step: 21.96 4%|▍ | 1921/50750 [4:59:11<80:23:35, 5.93s/it] {'loss': 0.8605, 'learning_rate': 3.9993548869887116e-05, 'epoch': 1.89} 4%|▍ | 1921/50750 [4:59:11<80:23:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:41:55,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:41:55,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.13 | bwd_microstep: 3848.40 | bwd_inner_microstep: 3840.88 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.62 [2024-11-13 21:41:55,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.11 | bwd: 3848.42 | bwd_inner: 3840.88 | bwd_allreduce: 7.49 | step: 21.62 4%|▍ | 1922/50750 [4:59:17<80:21:45, 5.93s/it] {'loss': 0.3121, 'learning_rate': 3.9993516413175895e-05, 'epoch': 1.89} 4%|▍ | 1922/50750 [4:59:17<80:21:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:42:01,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:42:01,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.95 | bwd_microstep: 3846.18 | bwd_inner_microstep: 3838.68 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.17 [2024-11-13 21:42:01,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.94 | bwd: 3846.19 | bwd_inner: 3838.68 | bwd_allreduce: 7.48 | step: 21.17 4%|▍ | 1923/50750 [4:59:23<80:19:31, 5.92s/it] {'loss': 0.0951, 'learning_rate': 3.9993483875035094e-05, 'epoch': 1.89} 4%|▍ | 1923/50750 [4:59:23<80:19:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:42:07,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.94 [2024-11-13 21:42:07,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.83 | bwd_microstep: 3857.85 | bwd_inner_microstep: 3850.34 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.12 [2024-11-13 21:42:07,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.83 | bwd: 3857.87 | bwd_inner: 3850.34 | bwd_allreduce: 7.49 | step: 21.13 4%|▍ | 1924/50750 [4:59:29<80:20:54, 5.92s/it] {'loss': 0.3816, 'learning_rate': 3.999345125546484e-05, 'epoch': 1.9} 4%|▍ | 1924/50750 [4:59:29<80:20:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:42:13,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:42:13,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.79 | bwd_microstep: 3845.52 | bwd_inner_microstep: 3838.00 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.04 [2024-11-13 21:42:13,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.78 | bwd: 3845.53 | bwd_inner: 3838.00 | bwd_allreduce: 7.49 | step: 21.04 4%|▍ | 1925/50750 [4:59:35<80:18:58, 5.92s/it] {'loss': 0.0062, 'learning_rate': 3.999341855446528e-05, 'epoch': 1.9} 4%|▍ | 1925/50750 [4:59:35<80:18:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:42:19,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:42:19,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.92 | bwd_microstep: 3850.39 | bwd_inner_microstep: 3842.87 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.13 [2024-11-13 21:42:19,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.92 | bwd: 3850.41 | bwd_inner: 3842.87 | bwd_allreduce: 7.50 | step: 21.13 4%|▍ | 1926/50750 [4:59:41<80:19:02, 5.92s/it] {'loss': 0.0191, 'learning_rate': 3.9993385772036536e-05, 'epoch': 1.9} 4%|▍ | 1926/50750 [4:59:41<80:19:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:42:25,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:42:25,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.61 | bwd_microstep: 3847.91 | bwd_inner_microstep: 3840.38 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.23 [2024-11-13 21:42:25,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.61 | bwd: 3847.92 | bwd_inner: 3840.38 | bwd_allreduce: 7.51 | step: 21.23 4%|▍ | 1927/50750 [4:59:47<80:18:32, 5.92s/it] {'loss': 0.0088, 'learning_rate': 3.999335290817875e-05, 'epoch': 1.9} 4%|▍ | 1927/50750 [4:59:47<80:18:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:42:31,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:42:31,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.09 | bwd_microstep: 3847.66 | bwd_inner_microstep: 3840.16 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.00 [2024-11-13 21:42:31,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.10 | bwd: 3847.67 | bwd_inner: 3840.16 | bwd_allreduce: 7.47 | step: 21.00 4%|▍ | 1928/50750 [4:59:53<80:18:40, 5.92s/it] {'loss': 0.0111, 'learning_rate': 3.9993319962892046e-05, 'epoch': 1.9} 4%|▍ | 1928/50750 [4:59:53<80:18:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:42:37,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:42:37,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.59 | bwd_microstep: 3847.51 | bwd_inner_microstep: 3839.98 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.14 [2024-11-13 21:42:37,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.59 | bwd: 3847.52 | bwd_inner: 3839.98 | bwd_allreduce: 7.51 | step: 21.14 4%|▍ | 1929/50750 [4:59:59<80:17:50, 5.92s/it] {'loss': 0.5745, 'learning_rate': 3.999328693617656e-05, 'epoch': 1.9} 4%|▍ | 1929/50750 [4:59:59<80:17:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:42:43,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 21:42:43,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.33 | bwd_microstep: 3851.90 | bwd_inner_microstep: 3844.37 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.00 [2024-11-13 21:42:43,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.33 | bwd: 3851.91 | bwd_inner: 3844.37 | bwd_allreduce: 7.50 | step: 21.00 4%|▍ | 1930/50750 [5:00:05<80:18:19, 5.92s/it] {'loss': 0.0065, 'learning_rate': 3.999325382803244e-05, 'epoch': 1.9} 4%|▍ | 1930/50750 [5:00:05<80:18:19, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:42:49,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 21:42:49,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.14 | bwd_microstep: 3851.28 | bwd_inner_microstep: 3843.77 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.75 [2024-11-13 21:42:49,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.14 | bwd: 3851.29 | bwd_inner: 3843.77 | bwd_allreduce: 7.48 | step: 21.76 4%|▍ | 1931/50750 [5:00:11<80:19:30, 5.92s/it] {'loss': 0.014, 'learning_rate': 3.99932206384598e-05, 'epoch': 1.9} 4%|▍ | 1931/50750 [5:00:11<80:19:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:42:55,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.57 | optimizer_step: 5.12 [2024-11-13 21:42:55,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.84 | bwd_microstep: 3854.32 | bwd_inner_microstep: 3846.33 | bwd_allreduce_microstep: 7.93 | step_microstep: 29.51 [2024-11-13 21:42:55,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.84 | bwd: 3854.33 | bwd_inner: 3846.33 | bwd_allreduce: 7.95 | step: 29.51 4%|▍ | 1932/50750 [5:00:16<80:23:39, 5.93s/it] {'loss': 0.0199, 'learning_rate': 3.9993187367458796e-05, 'epoch': 1.9} 4%|▍ | 1932/50750 [5:00:16<80:23:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:43:00,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:43:00,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.99 | bwd_microstep: 3849.16 | bwd_inner_microstep: 3841.65 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.01 [2024-11-13 21:43:00,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.97 | bwd: 3849.18 | bwd_inner: 3841.65 | bwd_allreduce: 7.49 | step: 21.01 4%|▍ | 1933/50750 [5:00:22<80:22:11, 5.93s/it] {'loss': 0.0011, 'learning_rate': 3.9993154015029545e-05, 'epoch': 1.9} 4%|▍ | 1933/50750 [5:00:22<80:22:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:43:06,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 21:43:06,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.18 | bwd_microstep: 3852.30 | bwd_inner_microstep: 3844.71 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.50 [2024-11-13 21:43:06,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.18 | bwd: 3852.32 | bwd_inner: 3844.71 | bwd_allreduce: 7.56 | step: 21.50 4%|▍ | 1934/50750 [5:00:28<80:21:51, 5.93s/it] {'loss': 0.337, 'learning_rate': 3.99931205811722e-05, 'epoch': 1.91} 4%|▍ | 1934/50750 [5:00:28<80:21:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:43:12,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:43:12,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.76 | bwd_microstep: 3851.25 | bwd_inner_microstep: 3843.70 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.35 [2024-11-13 21:43:12,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.75 | bwd: 3851.26 | bwd_inner: 3843.70 | bwd_allreduce: 7.53 | step: 21.35 4%|▍ | 1935/50750 [5:00:34<80:23:26, 5.93s/it] {'loss': 0.0054, 'learning_rate': 3.999308706588688e-05, 'epoch': 1.91} 4%|▍ | 1935/50750 [5:00:34<80:23:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:43:18,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 21:43:18,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.31 | bwd_microstep: 3851.72 | bwd_inner_microstep: 3844.14 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.56 [2024-11-13 21:43:18,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.29 | bwd: 3851.74 | bwd_inner: 3844.14 | bwd_allreduce: 7.56 | step: 21.57 4%|▍ | 1936/50750 [5:00:40<80:24:22, 5.93s/it] {'loss': 0.3338, 'learning_rate': 3.9993053469173724e-05, 'epoch': 1.91} 4%|▍ | 1936/50750 [5:00:40<80:24:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:43:25,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:43:25,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.92 | bwd_microstep: 4221.60 | bwd_inner_microstep: 4214.01 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.54 [2024-11-13 21:43:25,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.92 | bwd: 4221.61 | bwd_inner: 4214.01 | bwd_allreduce: 7.56 | step: 21.54 4%|▍ | 1937/50750 [5:00:46<81:53:29, 6.04s/it] {'loss': 0.0035, 'learning_rate': 3.999301979103289e-05, 'epoch': 1.91} 4%|▍ | 1937/50750 [5:00:46<81:53:29, 6.04s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:43:30,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:43:30,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.00 | bwd_microstep: 3848.61 | bwd_inner_microstep: 3841.04 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.64 [2024-11-13 21:43:30,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.00 | bwd: 3848.62 | bwd_inner: 3841.04 | bwd_allreduce: 7.53 | step: 21.65 4%|▍ | 1938/50750 [5:00:52<81:24:55, 6.00s/it] {'loss': 0.0154, 'learning_rate': 3.9992986031464485e-05, 'epoch': 1.91} 4%|▍ | 1938/50750 [5:00:52<81:24:55, 6.00s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:43:36,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:43:36,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.43 | bwd_microstep: 3849.03 | bwd_inner_microstep: 3841.45 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.69 [2024-11-13 21:43:36,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.43 | bwd: 3849.04 | bwd_inner: 3841.45 | bwd_allreduce: 7.55 | step: 21.69 4%|▍ | 1939/50750 [5:00:58<81:07:06, 5.98s/it] {'loss': 0.1921, 'learning_rate': 3.999295219046867e-05, 'epoch': 1.91} 4%|▍ | 1939/50750 [5:00:58<81:07:06, 5.98s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:43:42,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:43:42,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.18 | bwd_microstep: 3858.32 | bwd_inner_microstep: 3850.84 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.08 [2024-11-13 21:43:42,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.18 | bwd: 3858.34 | bwd_inner: 3850.84 | bwd_allreduce: 7.46 | step: 21.09 4%|▍ | 1940/50750 [5:01:04<80:56:45, 5.97s/it] {'loss': 0.2201, 'learning_rate': 3.999291826804557e-05, 'epoch': 1.91} 4%|▍ | 1940/50750 [5:01:04<80:56:45, 5.97s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:43:48,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 21:43:48,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.28 | bwd_microstep: 3848.39 | bwd_inner_microstep: 3840.66 | bwd_allreduce_microstep: 7.68 | step_microstep: 22.13 [2024-11-13 21:43:48,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.25 | bwd: 3848.41 | bwd_inner: 3840.66 | bwd_allreduce: 7.70 | step: 22.13 4%|▍ | 1941/50750 [5:01:10<80:45:03, 5.96s/it] {'loss': 0.0014, 'learning_rate': 3.999288426419533e-05, 'epoch': 1.91} 4%|▍ | 1941/50750 [5:01:10<80:45:03, 5.96s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2203 [2024-11-13 21:43:54,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:43:54,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.83 | bwd_microstep: 3850.10 | bwd_inner_microstep: 3842.54 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.39 [2024-11-13 21:43:54,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.81 | bwd: 3850.11 | bwd_inner: 3842.54 | bwd_allreduce: 7.53 | step: 21.39 4%|▍ | 1942/50750 [5:01:16<80:37:53, 5.95s/it] {'loss': 0.0365, 'learning_rate': 3.999285017891808e-05, 'epoch': 1.91} 4%|▍ | 1942/50750 [5:01:16<80:37:53, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:44:00,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:44:00,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.24 | bwd_microstep: 3853.11 | bwd_inner_microstep: 3845.56 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.46 [2024-11-13 21:44:00,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.24 | bwd: 3853.12 | bwd_inner: 3845.56 | bwd_allreduce: 7.52 | step: 21.46 4%|▍ | 1943/50750 [5:01:22<80:34:30, 5.94s/it] {'loss': 0.0183, 'learning_rate': 3.9992816012213976e-05, 'epoch': 1.91} 4%|▍ | 1943/50750 [5:01:22<80:34:30, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:44:06,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.45 | optimizer_step: 4.92 [2024-11-13 21:44:06,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.01 | bwd_microstep: 3852.38 | bwd_inner_microstep: 3844.58 | bwd_allreduce_microstep: 7.75 | step_microstep: 22.55 [2024-11-13 21:44:06,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.00 | bwd: 3852.39 | bwd_inner: 3844.58 | bwd_allreduce: 7.77 | step: 22.55 4%|▍ | 1944/50750 [5:01:28<80:31:25, 5.94s/it] {'loss': 0.411, 'learning_rate': 3.9992781764083135e-05, 'epoch': 1.92} 4%|▍ | 1944/50750 [5:01:28<80:31:25, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:44:12,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 21:44:12,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.83 | bwd_microstep: 3850.61 | bwd_inner_microstep: 3842.88 | bwd_allreduce_microstep: 7.68 | step_microstep: 22.18 [2024-11-13 21:44:12,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.82 | bwd: 3850.62 | bwd_inner: 3842.88 | bwd_allreduce: 7.70 | step: 22.18 4%|▍ | 1945/50750 [5:01:34<80:29:03, 5.94s/it] {'loss': 0.0247, 'learning_rate': 3.999274743452571e-05, 'epoch': 1.92} 4%|▍ | 1945/50750 [5:01:34<80:29:03, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:44:18,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 21:44:18,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.36 | bwd_microstep: 3849.30 | bwd_inner_microstep: 3841.53 | bwd_allreduce_microstep: 7.72 | step_microstep: 22.37 [2024-11-13 21:44:18,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.34 | bwd: 3849.31 | bwd_inner: 3841.53 | bwd_allreduce: 7.74 | step: 22.38 4%|▍ | 1946/50750 [5:01:40<80:26:16, 5.93s/it] {'loss': 0.0903, 'learning_rate': 3.999271302354184e-05, 'epoch': 1.92} 4%|▍ | 1946/50750 [5:01:40<80:26:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:44:24,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:44:24,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.70 | bwd_microstep: 3844.70 | bwd_inner_microstep: 3837.13 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.57 [2024-11-13 21:44:24,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.70 | bwd: 3844.71 | bwd_inner: 3837.13 | bwd_allreduce: 7.54 | step: 21.58 4%|▍ | 1947/50750 [5:01:46<80:22:11, 5.93s/it] {'loss': 0.0009, 'learning_rate': 3.999267853113166e-05, 'epoch': 1.92} 4%|▍ | 1947/50750 [5:01:46<80:22:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:44:30,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:44:30,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.59 | bwd_microstep: 3848.46 | bwd_inner_microstep: 3840.89 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.52 [2024-11-13 21:44:30,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.58 | bwd: 3848.47 | bwd_inner: 3840.89 | bwd_allreduce: 7.54 | step: 21.52 4%|▍ | 1948/50750 [5:01:52<80:20:37, 5.93s/it] {'loss': 0.1858, 'learning_rate': 3.999264395729532e-05, 'epoch': 1.92} 4%|▍ | 1948/50750 [5:01:52<80:20:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:44:36,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:44:36,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.78 | bwd_microstep: 3848.00 | bwd_inner_microstep: 3840.42 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.55 [2024-11-13 21:44:36,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.78 | bwd: 3848.02 | bwd_inner: 3840.42 | bwd_allreduce: 7.56 | step: 21.56 4%|▍ | 1949/50750 [5:01:58<80:20:42, 5.93s/it] {'loss': 0.0475, 'learning_rate': 3.999260930203295e-05, 'epoch': 1.92} 4%|▍ | 1949/50750 [5:01:58<80:20:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:44:42,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 21:44:42,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.29 | bwd_microstep: 3845.19 | bwd_inner_microstep: 3837.60 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.46 [2024-11-13 21:44:42,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.27 | bwd: 3845.20 | bwd_inner: 3837.60 | bwd_allreduce: 7.56 | step: 21.47 4%|▍ | 1950/50750 [5:02:04<80:19:58, 5.93s/it] {'loss': 0.3388, 'learning_rate': 3.999257456534469e-05, 'epoch': 1.92} 4%|▍ | 1950/50750 [5:02:04<80:19:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:44:47,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:44:47,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.92 | bwd_microstep: 3845.43 | bwd_inner_microstep: 3837.88 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.66 [2024-11-13 21:44:47,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.91 | bwd: 3845.44 | bwd_inner: 3837.88 | bwd_allreduce: 7.51 | step: 21.67 4%|▍ | 1951/50750 [5:02:09<80:18:04, 5.92s/it] {'loss': 0.042, 'learning_rate': 3.9992539747230694e-05, 'epoch': 1.92} 4%|▍ | 1951/50750 [5:02:09<80:18:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:44:53,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-13 21:44:53,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.80 | bwd_microstep: 3850.49 | bwd_inner_microstep: 3842.50 | bwd_allreduce_microstep: 7.95 | step_microstep: 22.71 [2024-11-13 21:44:53,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.80 | bwd: 3850.50 | bwd_inner: 3842.50 | bwd_allreduce: 7.96 | step: 22.71 4%|▍ | 1952/50750 [5:02:15<80:20:08, 5.93s/it] {'loss': 0.0557, 'learning_rate': 3.9992504847691097e-05, 'epoch': 1.92} 4%|▍ | 1952/50750 [5:02:15<80:20:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:44:59,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:44:59,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.94 | bwd_microstep: 3849.61 | bwd_inner_microstep: 3842.01 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.47 [2024-11-13 21:44:59,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.93 | bwd: 3849.62 | bwd_inner: 3842.01 | bwd_allreduce: 7.57 | step: 21.47 4%|▍ | 1953/50750 [5:02:21<80:20:22, 5.93s/it] {'loss': 0.0018, 'learning_rate': 3.999246986672604e-05, 'epoch': 1.92} 4%|▍ | 1953/50750 [5:02:21<80:20:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:45:05,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 21:45:05,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.23 | bwd_microstep: 3851.21 | bwd_inner_microstep: 3843.62 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.48 [2024-11-13 21:45:05,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.23 | bwd: 3851.23 | bwd_inner: 3843.62 | bwd_allreduce: 7.57 | step: 21.48 4%|▍ | 1954/50750 [5:02:27<80:20:31, 5.93s/it] {'loss': 0.2563, 'learning_rate': 3.999243480433566e-05, 'epoch': 1.93} 4%|▍ | 1954/50750 [5:02:27<80:20:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:45:11,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:45:11,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.33 | bwd_microstep: 3849.32 | bwd_inner_microstep: 3841.76 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.55 [2024-11-13 21:45:11,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.33 | bwd: 3849.34 | bwd_inner: 3841.76 | bwd_allreduce: 7.54 | step: 21.55 4%|▍ | 1955/50750 [5:02:33<80:19:50, 5.93s/it] {'loss': 0.138, 'learning_rate': 3.999239966052011e-05, 'epoch': 1.93} 4%|▍ | 1955/50750 [5:02:33<80:19:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:45:17,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.62 | optimizer_step: 4.93 [2024-11-13 21:45:17,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.57 | bwd_microstep: 3846.22 | bwd_inner_microstep: 3838.71 | bwd_allreduce_microstep: 7.46 | step_microstep: 22.74 [2024-11-13 21:45:17,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.56 | bwd: 3846.23 | bwd_inner: 3838.71 | bwd_allreduce: 7.48 | step: 22.75 4%|▍ | 1956/50750 [5:02:39<80:19:42, 5.93s/it] {'loss': 0.2743, 'learning_rate': 3.9992364435279534e-05, 'epoch': 1.93} 4%|▍ | 1956/50750 [5:02:39<80:19:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:45:23,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.91 | optimizer_step: 4.93 [2024-11-13 21:45:23,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.80 | bwd_microstep: 3846.96 | bwd_inner_microstep: 3839.44 | bwd_allreduce_microstep: 7.47 | step_microstep: 23.34 [2024-11-13 21:45:23,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.78 | bwd: 3846.97 | bwd_inner: 3839.44 | bwd_allreduce: 7.49 | step: 23.36 4%|▍ | 1957/50750 [5:02:45<80:19:50, 5.93s/it] {'loss': 0.0279, 'learning_rate': 3.9992329128614065e-05, 'epoch': 1.93} 4%|▍ | 1957/50750 [5:02:45<80:19:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:45:29,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:45:29,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.64 | bwd_microstep: 3856.00 | bwd_inner_microstep: 3848.45 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.37 [2024-11-13 21:45:29,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.62 | bwd: 3856.01 | bwd_inner: 3848.45 | bwd_allreduce: 7.52 | step: 21.38 4%|▍ | 1958/50750 [5:02:51<80:22:26, 5.93s/it] {'loss': 0.1138, 'learning_rate': 3.9992293740523854e-05, 'epoch': 1.93} 4%|▍ | 1958/50750 [5:02:51<80:22:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:45:35,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:45:35,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.96 | bwd_microstep: 3849.11 | bwd_inner_microstep: 3841.33 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.98 [2024-11-13 21:45:35,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.96 | bwd: 3849.12 | bwd_inner: 3841.33 | bwd_allreduce: 7.75 | step: 21.98 4%|▍ | 1959/50750 [5:02:57<80:20:27, 5.93s/it] {'loss': 0.1513, 'learning_rate': 3.9992258271009044e-05, 'epoch': 1.93} 4%|▍ | 1959/50750 [5:02:57<80:20:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:45:41,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 21:45:41,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.45 | bwd_microstep: 3847.28 | bwd_inner_microstep: 3839.69 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.41 [2024-11-13 21:45:41,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.44 | bwd: 3847.29 | bwd_inner: 3839.69 | bwd_allreduce: 7.56 | step: 21.42 4%|▍ | 1960/50750 [5:03:03<80:18:48, 5.93s/it] {'loss': 0.238, 'learning_rate': 3.999222272006979e-05, 'epoch': 1.93} 4%|▍ | 1960/50750 [5:03:03<80:18:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:45:47,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 5.00 [2024-11-13 21:45:47,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.27 | bwd_microstep: 3849.12 | bwd_inner_microstep: 3841.59 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.71 [2024-11-13 21:45:47,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.27 | bwd: 3849.14 | bwd_inner: 3841.59 | bwd_allreduce: 7.51 | step: 21.72 4%|▍ | 1961/50750 [5:03:09<80:19:08, 5.93s/it] {'loss': 0.0049, 'learning_rate': 3.9992187087706214e-05, 'epoch': 1.93} 4%|▍ | 1961/50750 [5:03:09<80:19:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:45:53,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:45:53,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.09 | bwd_microstep: 3851.17 | bwd_inner_microstep: 3843.60 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.45 [2024-11-13 21:45:53,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.07 | bwd: 3851.19 | bwd_inner: 3843.60 | bwd_allreduce: 7.55 | step: 21.46 4%|▍ | 1962/50750 [5:03:15<80:19:14, 5.93s/it] {'loss': 0.5637, 'learning_rate': 3.999215137391847e-05, 'epoch': 1.93} 4%|▍ | 1962/50750 [5:03:15<80:19:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:45:59,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:45:59,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.44 | bwd_microstep: 3844.09 | bwd_inner_microstep: 3836.52 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.41 [2024-11-13 21:45:59,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.44 | bwd: 3844.11 | bwd_inner: 3836.52 | bwd_allreduce: 7.54 | step: 21.41 4%|▍ | 1963/50750 [5:03:21<80:17:15, 5.92s/it] {'loss': 0.0069, 'learning_rate': 3.999211557870671e-05, 'epoch': 1.93} 4%|▍ | 1963/50750 [5:03:21<80:17:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:46:05,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 4.92 [2024-11-13 21:46:05,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.70 | bwd_microstep: 3850.33 | bwd_inner_microstep: 3842.39 | bwd_allreduce_microstep: 7.90 | step_microstep: 23.38 [2024-11-13 21:46:05,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.70 | bwd: 3850.35 | bwd_inner: 3842.39 | bwd_allreduce: 7.91 | step: 23.38 4%|▍ | 1964/50750 [5:03:27<80:18:43, 5.93s/it] {'loss': 0.5985, 'learning_rate': 3.999207970207108e-05, 'epoch': 1.93} 4%|▍ | 1964/50750 [5:03:27<80:18:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:46:10,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:46:10,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.21 | bwd_microstep: 3850.69 | bwd_inner_microstep: 3843.08 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.47 [2024-11-13 21:46:10,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.19 | bwd: 3850.70 | bwd_inner: 3843.08 | bwd_allreduce: 7.58 | step: 21.48 4%|▍ | 1965/50750 [5:03:32<80:19:08, 5.93s/it] {'loss': 0.0059, 'learning_rate': 3.999204374401172e-05, 'epoch': 1.94} 4%|▍ | 1965/50750 [5:03:32<80:19:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:46:16,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:46:16,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.99 | bwd_microstep: 3847.14 | bwd_inner_microstep: 3839.53 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.41 [2024-11-13 21:46:16,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.98 | bwd: 3847.15 | bwd_inner: 3839.53 | bwd_allreduce: 7.57 | step: 21.41 4%|▍ | 1966/50750 [5:03:38<80:17:02, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.999200770452878e-05, 'epoch': 1.94} 4%|▍ | 1966/50750 [5:03:38<80:17:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:46:22,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:46:22,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.96 | bwd_microstep: 3847.99 | bwd_inner_microstep: 3840.23 | bwd_allreduce_microstep: 7.72 | step_microstep: 21.46 [2024-11-13 21:46:22,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.96 | bwd: 3848.01 | bwd_inner: 3840.23 | bwd_allreduce: 7.74 | step: 21.47 4%|▍ | 1967/50750 [5:03:44<80:16:04, 5.92s/it] {'loss': 0.0011, 'learning_rate': 3.9991971583622404e-05, 'epoch': 1.94} 4%|▍ | 1967/50750 [5:03:44<80:16:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:46:28,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:46:28,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.10 | bwd_microstep: 3845.53 | bwd_inner_microstep: 3837.96 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.42 [2024-11-13 21:46:28,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.10 | bwd: 3845.54 | bwd_inner: 3837.96 | bwd_allreduce: 7.54 | step: 21.42 4%|▍ | 1968/50750 [5:03:50<80:16:16, 5.92s/it] {'loss': 0.0068, 'learning_rate': 3.999193538129274e-05, 'epoch': 1.94} 4%|▍ | 1968/50750 [5:03:50<80:16:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 21:46:34,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:46:34,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.77 | bwd_microstep: 3845.66 | bwd_inner_microstep: 3838.16 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.12 [2024-11-13 21:46:34,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.75 | bwd: 3845.67 | bwd_inner: 3838.16 | bwd_allreduce: 7.47 | step: 21.12 4%|▍ | 1969/50750 [5:03:56<80:14:56, 5.92s/it] {'loss': 0.0112, 'learning_rate': 3.9991899097539947e-05, 'epoch': 1.94} 4%|▍ | 1969/50750 [5:03:56<80:14:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:46:40,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 21:46:40,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.83 | bwd_microstep: 3847.96 | bwd_inner_microstep: 3840.23 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.52 [2024-11-13 21:46:40,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.82 | bwd: 3847.97 | bwd_inner: 3840.23 | bwd_allreduce: 7.70 | step: 21.52 4%|▍ | 1970/50750 [5:04:02<80:14:17, 5.92s/it] {'loss': 0.1668, 'learning_rate': 3.999186273236415e-05, 'epoch': 1.94} 4%|▍ | 1970/50750 [5:04:02<80:14:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:46:46,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:46:46,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.28 | bwd_microstep: 3847.91 | bwd_inner_microstep: 3840.43 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.12 [2024-11-13 21:46:46,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.26 | bwd: 3847.92 | bwd_inner: 3840.42 | bwd_allreduce: 7.45 | step: 21.12 4%|▍ | 1971/50750 [5:04:08<80:14:03, 5.92s/it] {'loss': 0.2531, 'learning_rate': 3.9991826285765515e-05, 'epoch': 1.94} 4%|▍ | 1971/50750 [5:04:08<80:14:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 21:46:52,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:46:52,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.31 | bwd_microstep: 3850.08 | bwd_inner_microstep: 3842.46 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.41 [2024-11-13 21:46:52,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.30 | bwd: 3850.09 | bwd_inner: 3842.46 | bwd_allreduce: 7.59 | step: 21.41 4%|▍ | 1972/50750 [5:04:14<80:18:56, 5.93s/it] {'loss': 0.0066, 'learning_rate': 3.999178975774418e-05, 'epoch': 1.94} 4%|▍ | 1972/50750 [5:04:14<80:18:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:46:58,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.74 | optimizer_step: 4.93 [2024-11-13 21:46:58,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.79 | bwd_microstep: 3862.69 | bwd_inner_microstep: 3854.58 | bwd_allreduce_microstep: 8.04 | step_microstep: 27.32 [2024-11-13 21:46:58,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.79 | bwd: 3862.71 | bwd_inner: 3854.58 | bwd_allreduce: 8.07 | step: 27.31 4%|▍ | 1973/50750 [5:04:20<80:25:40, 5.94s/it] {'loss': 0.7677, 'learning_rate': 3.99917531483003e-05, 'epoch': 1.94} 4%|▍ | 1973/50750 [5:04:20<80:25:40, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2203 [2024-11-13 21:47:04,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.52 | optimizer_step: 4.94 [2024-11-13 21:47:04,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2038.03 | bwd_microstep: 3850.41 | bwd_inner_microstep: 3842.00 | bwd_allreduce_microstep: 8.35 | step_microstep: 28.70 [2024-11-13 21:47:04,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2038.02 | bwd: 3850.44 | bwd_inner: 3842.00 | bwd_allreduce: 8.38 | step: 28.71 4%|▍ | 1974/50750 [5:04:26<80:31:43, 5.94s/it] {'loss': 0.0042, 'learning_rate': 3.999171645743402e-05, 'epoch': 1.94} 4%|▍ | 1974/50750 [5:04:26<80:31:43, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:47:10,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 21:47:10,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.74 | bwd_microstep: 3854.14 | bwd_inner_microstep: 3846.26 | bwd_allreduce_microstep: 7.83 | step_microstep: 22.07 [2024-11-13 21:47:10,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.73 | bwd: 3854.15 | bwd_inner: 3846.26 | bwd_allreduce: 7.85 | step: 22.08 4%|▍ | 1975/50750 [5:04:32<80:32:26, 5.94s/it] {'loss': 0.0369, 'learning_rate': 3.99916796851455e-05, 'epoch': 1.95} 4%|▍ | 1975/50750 [5:04:32<80:32:26, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:47:16,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.94 [2024-11-13 21:47:16,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.97 | bwd_microstep: 3847.71 | bwd_inner_microstep: 3839.91 | bwd_allreduce_microstep: 7.75 | step_microstep: 22.95 [2024-11-13 21:47:16,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.97 | bwd: 3847.72 | bwd_inner: 3839.91 | bwd_allreduce: 7.77 | step: 22.96 4%|▍ | 1976/50750 [5:04:38<80:28:48, 5.94s/it] {'loss': 0.0253, 'learning_rate': 3.999164283143487e-05, 'epoch': 1.95} 4%|▍ | 1976/50750 [5:04:38<80:28:48, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:47:22,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:47:22,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.85 | bwd_microstep: 3862.37 | bwd_inner_microstep: 3854.87 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.30 [2024-11-13 21:47:22,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.84 | bwd: 3862.39 | bwd_inner: 3854.87 | bwd_allreduce: 7.47 | step: 21.31 4%|▍ | 1977/50750 [5:04:44<80:30:36, 5.94s/it] {'loss': 0.0056, 'learning_rate': 3.99916058963023e-05, 'epoch': 1.95} 4%|▍ | 1977/50750 [5:04:44<80:30:36, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:47:28,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 21:47:28,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.94 | bwd_microstep: 3852.53 | bwd_inner_microstep: 3845.02 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.04 [2024-11-13 21:47:28,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.93 | bwd: 3852.54 | bwd_inner: 3845.02 | bwd_allreduce: 7.48 | step: 21.05 4%|▍ | 1978/50750 [5:04:50<80:28:15, 5.94s/it] {'loss': 0.5405, 'learning_rate': 3.9991568879747925e-05, 'epoch': 1.95} 4%|▍ | 1978/50750 [5:04:50<80:28:15, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:47:34,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:47:34,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.81 | bwd_microstep: 3859.57 | bwd_inner_microstep: 3852.05 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.04 [2024-11-13 21:47:34,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.81 | bwd: 3859.58 | bwd_inner: 3852.05 | bwd_allreduce: 7.49 | step: 21.05 4%|▍ | 1979/50750 [5:04:56<80:26:58, 5.94s/it] {'loss': 0.0004, 'learning_rate': 3.9991531781771915e-05, 'epoch': 1.95} 4%|▍ | 1979/50750 [5:04:56<80:26:58, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:47:39,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:47:39,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.36 | bwd_microstep: 3858.30 | bwd_inner_microstep: 3850.59 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.03 [2024-11-13 21:47:39,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.36 | bwd: 3858.31 | bwd_inner: 3850.59 | bwd_allreduce: 7.68 | step: 21.03 4%|▍ | 1980/50750 [5:05:01<80:25:32, 5.94s/it] {'loss': 0.0157, 'learning_rate': 3.9991494602374394e-05, 'epoch': 1.95} 4%|▍ | 1980/50750 [5:05:01<80:25:32, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 21:47:45,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 21:47:45,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.04 | bwd_microstep: 3854.40 | bwd_inner_microstep: 3846.83 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.25 [2024-11-13 21:47:45,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.04 | bwd: 3854.41 | bwd_inner: 3846.83 | bwd_allreduce: 7.54 | step: 21.25 4%|▍ | 1981/50750 [5:05:07<80:23:37, 5.93s/it] {'loss': 0.005, 'learning_rate': 3.999145734155554e-05, 'epoch': 1.95} 4%|▍ | 1981/50750 [5:05:07<80:23:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:47:51,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 21:47:51,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.19 | bwd_microstep: 3856.90 | bwd_inner_microstep: 3849.34 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.66 [2024-11-13 21:47:51,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.19 | bwd: 3856.92 | bwd_inner: 3849.34 | bwd_allreduce: 7.53 | step: 21.66 4%|▍ | 1982/50750 [5:05:13<80:23:12, 5.93s/it] {'loss': 1.8141, 'learning_rate': 3.9991419999315495e-05, 'epoch': 1.95} 4%|▍ | 1982/50750 [5:05:13<80:23:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:47:57,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:47:57,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.50 | bwd_microstep: 3846.04 | bwd_inner_microstep: 3838.55 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.21 [2024-11-13 21:47:57,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.50 | bwd: 3846.06 | bwd_inner: 3838.55 | bwd_allreduce: 7.47 | step: 21.21 4%|▍ | 1983/50750 [5:05:19<80:20:04, 5.93s/it] {'loss': 0.0337, 'learning_rate': 3.99913825756544e-05, 'epoch': 1.95} 4%|▍ | 1983/50750 [5:05:19<80:20:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:48:03,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 21:48:03,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.33 | bwd_microstep: 3846.96 | bwd_inner_microstep: 3839.15 | bwd_allreduce_microstep: 7.77 | step_microstep: 21.34 [2024-11-13 21:48:03,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.33 | bwd: 3846.98 | bwd_inner: 3839.15 | bwd_allreduce: 7.79 | step: 21.35 4%|▍ | 1984/50750 [5:05:25<80:18:21, 5.93s/it] {'loss': 0.7238, 'learning_rate': 3.999134507057243e-05, 'epoch': 1.95} 4%|▍ | 1984/50750 [5:05:25<80:18:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:48:09,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:48:09,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.43 | bwd_microstep: 3852.50 | bwd_inner_microstep: 3845.01 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.99 [2024-11-13 21:48:09,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.41 | bwd: 3852.51 | bwd_inner: 3845.01 | bwd_allreduce: 7.46 | step: 21.02 4%|▍ | 1985/50750 [5:05:31<80:17:43, 5.93s/it] {'loss': 0.0112, 'learning_rate': 3.9991307484069704e-05, 'epoch': 1.96} 4%|▍ | 1985/50750 [5:05:31<80:17:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:48:15,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-13 21:48:15,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.92 | bwd_microstep: 3852.75 | bwd_inner_microstep: 3845.29 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.87 [2024-11-13 21:48:15,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.92 | bwd: 3852.76 | bwd_inner: 3845.29 | bwd_allreduce: 7.44 | step: 20.87 4%|▍ | 1986/50750 [5:05:37<80:16:55, 5.93s/it] {'loss': 0.0047, 'learning_rate': 3.999126981614641e-05, 'epoch': 1.96} 4%|▍ | 1986/50750 [5:05:37<80:16:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:48:21,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 21:48:21,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.94 | bwd_microstep: 3867.41 | bwd_inner_microstep: 3859.88 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.11 [2024-11-13 21:48:21,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.94 | bwd: 3867.42 | bwd_inner: 3859.88 | bwd_allreduce: 7.50 | step: 21.11 4%|▍ | 1987/50750 [5:05:43<80:20:09, 5.93s/it] {'loss': 0.0022, 'learning_rate': 3.999123206680268e-05, 'epoch': 1.96} 4%|▍ | 1987/50750 [5:05:43<80:20:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:48:27,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:48:27,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.55 | bwd_microstep: 3847.02 | bwd_inner_microstep: 3839.49 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.97 [2024-11-13 21:48:27,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.55 | bwd: 3847.03 | bwd_inner: 3839.50 | bwd_allreduce: 7.49 | step: 20.97 4%|▍ | 1988/50750 [5:05:49<80:17:05, 5.93s/it] {'loss': 0.2845, 'learning_rate': 3.999119423603868e-05, 'epoch': 1.96} 4%|▍ | 1988/50750 [5:05:49<80:17:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:48:33,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:48:33,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.92 | bwd_microstep: 3845.53 | bwd_inner_microstep: 3838.02 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.17 [2024-11-13 21:48:33,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.91 | bwd: 3845.54 | bwd_inner: 3838.02 | bwd_allreduce: 7.49 | step: 21.17 4%|▍ | 1989/50750 [5:05:55<80:14:53, 5.92s/it] {'loss': 0.0099, 'learning_rate': 3.999115632385456e-05, 'epoch': 1.96} 4%|▍ | 1989/50750 [5:05:55<80:14:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:48:39,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:48:39,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.12 | bwd_microstep: 3848.04 | bwd_inner_microstep: 3840.50 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.01 [2024-11-13 21:48:39,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.09 | bwd: 3848.05 | bwd_inner: 3840.50 | bwd_allreduce: 7.51 | step: 21.01 4%|▍ | 1990/50750 [5:06:01<80:15:57, 5.93s/it] {'loss': 0.0102, 'learning_rate': 3.999111833025047e-05, 'epoch': 1.96} 4%|▍ | 1990/50750 [5:06:01<80:15:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:48:45,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 21:48:45,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.09 | bwd_microstep: 3860.34 | bwd_inner_microstep: 3852.84 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.39 [2024-11-13 21:48:45,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.09 | bwd: 3860.35 | bwd_inner: 3852.84 | bwd_allreduce: 7.47 | step: 21.39 4%|▍ | 1991/50750 [5:06:07<80:16:46, 5.93s/it] {'loss': 0.0103, 'learning_rate': 3.999108025522657e-05, 'epoch': 1.96} 4%|▍ | 1991/50750 [5:06:07<80:16:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:48:51,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 21:48:51,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.63 | bwd_microstep: 3845.70 | bwd_inner_microstep: 3838.15 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.07 [2024-11-13 21:48:51,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.63 | bwd: 3845.71 | bwd_inner: 3838.15 | bwd_allreduce: 7.53 | step: 21.07 4%|▍ | 1992/50750 [5:06:13<80:14:38, 5.92s/it] {'loss': 0.1311, 'learning_rate': 3.999104209878301e-05, 'epoch': 1.96} 4%|▍ | 1992/50750 [5:06:13<80:14:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:48:57,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:48:57,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.62 | bwd_microstep: 3851.12 | bwd_inner_microstep: 3843.58 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.19 [2024-11-13 21:48:57,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.62 | bwd: 3851.13 | bwd_inner: 3843.58 | bwd_allreduce: 7.51 | step: 21.20 4%|▍ | 1993/50750 [5:06:18<80:13:30, 5.92s/it] {'loss': 0.0153, 'learning_rate': 3.999100386091995e-05, 'epoch': 1.96} 4%|▍ | 1993/50750 [5:06:18<80:13:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:49:02,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:49:02,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.59 | bwd_microstep: 3844.89 | bwd_inner_microstep: 3837.37 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.10 [2024-11-13 21:49:02,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.58 | bwd: 3844.91 | bwd_inner: 3837.37 | bwd_allreduce: 7.49 | step: 21.11 4%|▍ | 1994/50750 [5:06:24<80:12:38, 5.92s/it] {'loss': 1.1819, 'learning_rate': 3.9990965541637544e-05, 'epoch': 1.96} 4%|▍ | 1994/50750 [5:06:24<80:12:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:49:08,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 5.21 [2024-11-13 21:49:08,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.05 | bwd_microstep: 3846.94 | bwd_inner_microstep: 3839.42 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.44 [2024-11-13 21:49:08,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.05 | bwd: 3846.95 | bwd_inner: 3839.42 | bwd_allreduce: 7.49 | step: 21.44 4%|▍ | 1995/50750 [5:06:30<80:11:48, 5.92s/it] {'loss': 0.0078, 'learning_rate': 3.9990927140935946e-05, 'epoch': 1.97} 4%|▍ | 1995/50750 [5:06:30<80:11:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:49:14,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.43 | optimizer_step: 4.93 [2024-11-13 21:49:14,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.52 | bwd_microstep: 3848.59 | bwd_inner_microstep: 3841.11 | bwd_allreduce_microstep: 7.43 | step_microstep: 27.94 [2024-11-13 21:49:14,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.52 | bwd: 3848.60 | bwd_inner: 3841.11 | bwd_allreduce: 7.45 | step: 27.94 4%|▍ | 1996/50750 [5:06:36<80:15:29, 5.93s/it] {'loss': 0.0156, 'learning_rate': 3.999088865881532e-05, 'epoch': 1.97} 4%|▍ | 1996/50750 [5:06:36<80:15:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:49:20,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:49:20,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.08 | bwd_microstep: 3854.38 | bwd_inner_microstep: 3846.90 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.93 [2024-11-13 21:49:20,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.07 | bwd: 3854.39 | bwd_inner: 3846.90 | bwd_allreduce: 7.45 | step: 20.93 4%|▍ | 1997/50750 [5:06:42<80:48:26, 5.97s/it] {'loss': 0.0043, 'learning_rate': 3.999085009527581e-05, 'epoch': 1.97} 4%|▍ | 1997/50750 [5:06:42<80:48:26, 5.97s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:49:26,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:49:26,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.59 | bwd_microstep: 3848.57 | bwd_inner_microstep: 3840.90 | bwd_allreduce_microstep: 7.63 | step_microstep: 20.87 [2024-11-13 21:49:26,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.59 | bwd: 3848.59 | bwd_inner: 3840.90 | bwd_allreduce: 7.65 | step: 20.88 4%|▍ | 1998/50750 [5:06:48<80:36:14, 5.95s/it] {'loss': 0.0031, 'learning_rate': 3.9990811450317586e-05, 'epoch': 1.97} 4%|▍ | 1998/50750 [5:06:48<80:36:14, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:49:32,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 5.06 [2024-11-13 21:49:32,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.31 | bwd_microstep: 3851.78 | bwd_inner_microstep: 3844.28 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.21 [2024-11-13 21:49:32,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.31 | bwd: 3851.79 | bwd_inner: 3844.28 | bwd_allreduce: 7.47 | step: 21.22 4%|▍ | 1999/50750 [5:06:54<80:28:15, 5.94s/it] {'loss': 0.202, 'learning_rate': 3.99907727239408e-05, 'epoch': 1.97} 4%|▍ | 1999/50750 [5:06:54<80:28:15, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:49:38,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:49:38,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.36 | bwd_microstep: 3845.42 | bwd_inner_microstep: 3837.93 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.80 [2024-11-13 21:49:38,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.36 | bwd: 3845.43 | bwd_inner: 3837.93 | bwd_allreduce: 7.47 | step: 20.80 4%|▍ | 2000/50750 [5:07:00<80:21:26, 5.93s/it] {'loss': 0.0017, 'learning_rate': 3.9990733916145605e-05, 'epoch': 1.97} 4%|▍ | 2000/50750 [5:07:00<80:21:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 21:49:44,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 21:49:44,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.25 | bwd_microstep: 3847.38 | bwd_inner_microstep: 3839.91 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.94 [2024-11-13 21:49:44,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.25 | bwd: 3847.39 | bwd_inner: 3839.91 | bwd_allreduce: 7.44 | step: 20.95 4%|▍ | 2001/50750 [5:07:06<80:17:01, 5.93s/it] {'loss': 0.0051, 'learning_rate': 3.999069502693217e-05, 'epoch': 1.97} 4%|▍ | 2001/50750 [5:07:06<80:17:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:49:50,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:49:50,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.53 | bwd_microstep: 3843.29 | bwd_inner_microstep: 3835.81 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.83 [2024-11-13 21:49:50,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.53 | bwd: 3843.30 | bwd_inner: 3835.81 | bwd_allreduce: 7.45 | step: 20.84 4%|▍ | 2002/50750 [5:07:12<80:13:38, 5.92s/it] {'loss': 0.0071, 'learning_rate': 3.999065605630064e-05, 'epoch': 1.97} 4%|▍ | 2002/50750 [5:07:12<80:13:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:49:56,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 5.11 [2024-11-13 21:49:56,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.60 | bwd_microstep: 3844.75 | bwd_inner_microstep: 3837.29 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.13 [2024-11-13 21:49:56,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.60 | bwd: 3844.76 | bwd_inner: 3837.29 | bwd_allreduce: 7.44 | step: 21.13 4%|▍ | 2003/50750 [5:07:18<80:11:02, 5.92s/it] {'loss': 0.0028, 'learning_rate': 3.999061700425119e-05, 'epoch': 1.97} 4%|▍ | 2003/50750 [5:07:18<80:11:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:50:02,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 21:50:02,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.61 | bwd_microstep: 3848.82 | bwd_inner_microstep: 3840.75 | bwd_allreduce_microstep: 8.01 | step_microstep: 22.29 [2024-11-13 21:50:02,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.61 | bwd: 3848.84 | bwd_inner: 3840.75 | bwd_allreduce: 8.04 | step: 22.29 4%|▍ | 2004/50750 [5:07:24<80:10:52, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.999057787078396e-05, 'epoch': 1.97} 4%|▍ | 2004/50750 [5:07:24<80:10:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:50:08,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:50:08,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.17 | bwd_microstep: 3843.17 | bwd_inner_microstep: 3835.59 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.19 [2024-11-13 21:50:08,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.15 | bwd: 3843.18 | bwd_inner: 3835.59 | bwd_allreduce: 7.55 | step: 21.20 4%|▍ | 2005/50750 [5:07:30<80:12:52, 5.92s/it] {'loss': 0.013, 'learning_rate': 3.999053865589912e-05, 'epoch': 1.98} 4%|▍ | 2005/50750 [5:07:30<80:12:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:50:14,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.47 | optimizer_step: 4.93 [2024-11-13 21:50:14,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.23 | bwd_microstep: 3848.01 | bwd_inner_microstep: 3840.45 | bwd_allreduce_microstep: 7.52 | step_microstep: 22.76 [2024-11-13 21:50:14,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.20 | bwd: 3848.02 | bwd_inner: 3840.45 | bwd_allreduce: 7.53 | step: 22.77 4%|▍ | 2006/50750 [5:07:36<80:12:31, 5.92s/it] {'loss': 0.0049, 'learning_rate': 3.999049935959684e-05, 'epoch': 1.98} 4%|▍ | 2006/50750 [5:07:36<80:12:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 21:50:20,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:50:20,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.35 | bwd_microstep: 3850.45 | bwd_inner_microstep: 3842.88 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.26 [2024-11-13 21:50:20,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.34 | bwd: 3850.46 | bwd_inner: 3842.88 | bwd_allreduce: 7.53 | step: 21.26 4%|▍ | 2007/50750 [5:07:42<80:15:30, 5.93s/it] {'loss': 0.0039, 'learning_rate': 3.9990459981877256e-05, 'epoch': 1.98} 4%|▍ | 2007/50750 [5:07:42<80:15:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:50:25,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.92 [2024-11-13 21:50:25,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.92 | bwd_microstep: 3849.34 | bwd_inner_microstep: 3841.57 | bwd_allreduce_microstep: 7.72 | step_microstep: 21.50 [2024-11-13 21:50:25,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.91 | bwd: 3849.35 | bwd_inner: 3841.57 | bwd_allreduce: 7.74 | step: 21.50 4%|▍ | 2008/50750 [5:07:47<80:13:48, 5.93s/it] {'loss': 0.8534, 'learning_rate': 3.9990420522740546e-05, 'epoch': 1.98} 4%|▍ | 2008/50750 [5:07:47<80:13:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:50:31,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:50:31,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.60 | bwd_microstep: 3858.34 | bwd_inner_microstep: 3850.76 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.41 [2024-11-13 21:50:31,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.59 | bwd: 3858.36 | bwd_inner: 3850.76 | bwd_allreduce: 7.56 | step: 21.42 4%|▍ | 2009/50750 [5:07:53<80:17:25, 5.93s/it] {'loss': 0.014, 'learning_rate': 3.999038098218687e-05, 'epoch': 1.98} 4%|▍ | 2009/50750 [5:07:53<80:17:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:50:37,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 21:50:37,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.63 | bwd_microstep: 3847.15 | bwd_inner_microstep: 3839.56 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.43 [2024-11-13 21:50:37,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.61 | bwd: 3847.16 | bwd_inner: 3839.56 | bwd_allreduce: 7.56 | step: 21.43 4%|▍ | 2010/50750 [5:07:59<80:17:38, 5.93s/it] {'loss': 0.1621, 'learning_rate': 3.999034136021638e-05, 'epoch': 1.98} 4%|▍ | 2010/50750 [5:07:59<80:17:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:50:43,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 21:50:43,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.76 | bwd_microstep: 3851.49 | bwd_inner_microstep: 3843.74 | bwd_allreduce_microstep: 7.70 | step_microstep: 22.09 [2024-11-13 21:50:43,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.75 | bwd: 3851.50 | bwd_inner: 3843.74 | bwd_allreduce: 7.72 | step: 22.09 4%|▍ | 2011/50750 [5:08:05<80:17:03, 5.93s/it] {'loss': 0.2425, 'learning_rate': 3.999030165682924e-05, 'epoch': 1.98} 4%|▍ | 2011/50750 [5:08:05<80:17:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:50:49,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:50:49,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.93 | bwd_microstep: 3849.91 | bwd_inner_microstep: 3842.39 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.38 [2024-11-13 21:50:49,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.91 | bwd: 3849.93 | bwd_inner: 3842.39 | bwd_allreduce: 7.49 | step: 21.38 4%|▍ | 2012/50750 [5:08:11<80:17:04, 5.93s/it] {'loss': 0.121, 'learning_rate': 3.999026187202562e-05, 'epoch': 1.98} 4%|▍ | 2012/50750 [5:08:11<80:17:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:50:55,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 21:50:55,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.09 | bwd_microstep: 3847.96 | bwd_inner_microstep: 3840.44 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.96 [2024-11-13 21:50:55,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.09 | bwd: 3847.97 | bwd_inner: 3840.44 | bwd_allreduce: 7.49 | step: 20.97 4%|▍ | 2013/50750 [5:08:17<80:17:30, 5.93s/it] {'loss': 0.4947, 'learning_rate': 3.999022200580567e-05, 'epoch': 1.98} 4%|▍ | 2013/50750 [5:08:17<80:17:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 21:51:01,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 21:51:01,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.02 | bwd_microstep: 3848.89 | bwd_inner_microstep: 3841.15 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.90 [2024-11-13 21:51:01,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.02 | bwd: 3848.90 | bwd_inner: 3841.15 | bwd_allreduce: 7.72 | step: 21.90 4%|▍ | 2014/50750 [5:08:23<80:20:12, 5.93s/it] {'loss': 0.4967, 'learning_rate': 3.999018205816956e-05, 'epoch': 1.98} 4%|▍ | 2014/50750 [5:08:23<80:20:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:51:07,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 5.05 [2024-11-13 21:51:07,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.96 | bwd_microstep: 3844.69 | bwd_inner_microstep: 3837.04 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.39 [2024-11-13 21:51:07,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.94 | bwd: 3844.70 | bwd_inner: 3837.04 | bwd_allreduce: 7.61 | step: 21.39 4%|▍ | 2015/50750 [5:08:29<80:17:07, 5.93s/it] {'loss': 0.2617, 'learning_rate': 3.999014202911745e-05, 'epoch': 1.99} 4%|▍ | 2015/50750 [5:08:29<80:17:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:51:13,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.83 | optimizer_step: 4.93 [2024-11-13 21:51:13,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.65 | bwd_microstep: 3848.42 | bwd_inner_microstep: 3840.69 | bwd_allreduce_microstep: 7.68 | step_microstep: 23.72 [2024-11-13 21:51:13,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.64 | bwd: 3848.43 | bwd_inner: 3840.69 | bwd_allreduce: 7.69 | step: 23.74 4%|▍ | 2016/50750 [5:08:35<80:15:03, 5.93s/it] {'loss': 0.0058, 'learning_rate': 3.999010191864951e-05, 'epoch': 1.99} 4%|▍ | 2016/50750 [5:08:35<80:15:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:51:19,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 21:51:19,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.48 | bwd_microstep: 3848.76 | bwd_inner_microstep: 3840.56 | bwd_allreduce_microstep: 8.15 | step_microstep: 21.96 [2024-11-13 21:51:19,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.48 | bwd: 3848.77 | bwd_inner: 3840.56 | bwd_allreduce: 8.17 | step: 21.96 4%|▍ | 2017/50750 [5:08:41<80:13:56, 5.93s/it] {'loss': 0.303, 'learning_rate': 3.999006172676589e-05, 'epoch': 1.99} 4%|▍ | 2017/50750 [5:08:41<80:13:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:51:25,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.72 | optimizer_step: 4.93 [2024-11-13 21:51:25,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.48 | bwd_microstep: 3858.93 | bwd_inner_microstep: 3850.89 | bwd_allreduce_microstep: 7.98 | step_microstep: 28.96 [2024-11-13 21:51:25,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.46 | bwd: 3858.95 | bwd_inner: 3850.89 | bwd_allreduce: 8.01 | step: 28.97 4%|▍ | 2018/50750 [5:08:47<80:17:53, 5.93s/it] {'loss': 0.4381, 'learning_rate': 3.999002145346677e-05, 'epoch': 1.99} 4%|▍ | 2018/50750 [5:08:47<80:17:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:51:31,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.31 | optimizer_step: 4.93 [2024-11-13 21:51:31,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.03 | bwd_microstep: 3852.94 | bwd_inner_microstep: 3845.15 | bwd_allreduce_microstep: 7.74 | step_microstep: 22.61 [2024-11-13 21:51:31,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.02 | bwd: 3852.95 | bwd_inner: 3845.15 | bwd_allreduce: 7.76 | step: 22.62 4%|▍ | 2019/50750 [5:08:53<80:16:34, 5.93s/it] {'loss': 0.5316, 'learning_rate': 3.99899810987523e-05, 'epoch': 1.99} 4%|▍ | 2019/50750 [5:08:53<80:16:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:51:37,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:51:37,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.67 | bwd_microstep: 3847.29 | bwd_inner_microstep: 3839.75 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.10 [2024-11-13 21:51:37,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.66 | bwd: 3847.30 | bwd_inner: 3839.75 | bwd_allreduce: 7.51 | step: 21.11 4%|▍ | 2020/50750 [5:08:59<80:13:37, 5.93s/it] {'loss': 0.1087, 'learning_rate': 3.998994066262266e-05, 'epoch': 1.99} 4%|▍ | 2020/50750 [5:08:59<80:13:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 21:51:43,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 21:51:43,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.98 | bwd_microstep: 3855.46 | bwd_inner_microstep: 3848.00 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.04 [2024-11-13 21:51:43,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.98 | bwd: 3855.47 | bwd_inner: 3848.00 | bwd_allreduce: 7.43 | step: 21.04 4%|▍ | 2021/50750 [5:09:05<80:12:57, 5.93s/it] {'loss': 0.0636, 'learning_rate': 3.998990014507799e-05, 'epoch': 1.99} 4%|▍ | 2021/50750 [5:09:05<80:12:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:51:49,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 21:51:49,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.01 | bwd_microstep: 3852.24 | bwd_inner_microstep: 3844.58 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.64 [2024-11-13 21:51:49,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.01 | bwd: 3852.25 | bwd_inner: 3844.58 | bwd_allreduce: 7.63 | step: 21.65 4%|▍ | 2022/50750 [5:09:10<80:12:53, 5.93s/it] {'loss': 0.0368, 'learning_rate': 3.998985954611848e-05, 'epoch': 1.99} 4%|▍ | 2022/50750 [5:09:10<80:12:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:51:54,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 21:51:54,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.94 | bwd_microstep: 3845.31 | bwd_inner_microstep: 3837.83 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.89 [2024-11-13 21:51:54,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.93 | bwd: 3845.32 | bwd_inner: 3837.83 | bwd_allreduce: 7.45 | step: 20.89 4%|▍ | 2023/50750 [5:09:16<80:11:22, 5.92s/it] {'loss': 0.0267, 'learning_rate': 3.9989818865744275e-05, 'epoch': 1.99} 4%|▍ | 2023/50750 [5:09:16<80:11:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 21:52:00,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 21:52:00,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.71 | bwd_microstep: 3847.09 | bwd_inner_microstep: 3839.60 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.97 [2024-11-13 21:52:00,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.71 | bwd: 3847.10 | bwd_inner: 3839.60 | bwd_allreduce: 7.46 | step: 20.97 4%|▍ | 2024/50750 [5:09:22<80:09:02, 5.92s/it] {'loss': 0.2326, 'learning_rate': 3.9989778103955555e-05, 'epoch': 1.99} 4%|▍ | 2024/50750 [5:09:22<80:09:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 21:52:06,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 21:52:06,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.71 | bwd_microstep: 3849.70 | bwd_inner_microstep: 3842.20 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.58 [2024-11-13 21:52:06,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.71 | bwd: 3849.71 | bwd_inner: 3842.20 | bwd_allreduce: 7.47 | step: 21.58 4%|▍ | 2025/50750 [5:09:28<80:08:52, 5.92s/it] {'loss': 0.2601, 'learning_rate': 3.998973726075248e-05, 'epoch': 2.0} 4%|▍ | 2025/50750 [5:09:28<80:08:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:52:12,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 21:52:12,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.53 | bwd_microstep: 3848.30 | bwd_inner_microstep: 3840.83 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.87 [2024-11-13 21:52:12,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.53 | bwd: 3848.31 | bwd_inner: 3840.83 | bwd_allreduce: 7.45 | step: 20.87 4%|▍ | 2026/50750 [5:09:34<80:08:04, 5.92s/it] {'loss': 0.2353, 'learning_rate': 3.998969633613522e-05, 'epoch': 2.0} 4%|▍ | 2026/50750 [5:09:34<80:08:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 21:52:18,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 21:52:18,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.78 | bwd_microstep: 3848.88 | bwd_inner_microstep: 3841.32 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.57 [2024-11-13 21:52:18,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.78 | bwd: 3848.89 | bwd_inner: 3841.32 | bwd_allreduce: 7.53 | step: 21.58 4%|▍ | 2027/50750 [5:09:40<80:07:52, 5.92s/it] {'loss': 0.0075, 'learning_rate': 3.998965533010394e-05, 'epoch': 2.0} 4%|▍ | 2027/50750 [5:09:40<80:07:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 21:52:24,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 21:52:24,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.66 | bwd_microstep: 3848.95 | bwd_inner_microstep: 3841.31 | bwd_allreduce_microstep: 7.60 | step_microstep: 22.08 [2024-11-13 21:52:24,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.65 | bwd: 3848.97 | bwd_inner: 3841.31 | bwd_allreduce: 7.61 | step: 22.08 4%|▍ | 2028/50750 [5:09:46<80:10:33, 5.92s/it] {'loss': 0.8596, 'learning_rate': 3.9989614242658804e-05, 'epoch': 2.0} 4%|▍ | 2028/50750 [5:09:46<80:10:33, 5.92s/it]evaluate! dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B Results saved to qa_abcd_lora.csv Accuracy: 0.9192913385826772 New best accuracy: 0.9192913385826772. Saving model... [INFO|trainer.py:2936] 2024-11-13 22:27:36,124 >> Saving model checkpoint to work_dirs/QA2/qa_abcd_lora [INFO|configuration_utils.py:473] 2024-11-13 22:27:36,126 >> Configuration saved in work_dirs/QA2/qa_abcd_lora/config.json [INFO|configuration_utils.py:594] 2024-11-13 22:27:36,126 >> Configuration saved in work_dirs/QA2/qa_abcd_lora/generation_config.json [INFO|modeling_utils.py:2501] 2024-11-13 22:28:19,512 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at work_dirs/QA2/qa_abcd_lora/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2433] 2024-11-13 22:28:19,514 >> tokenizer config file saved in work_dirs/QA2/qa_abcd_lora/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2024-11-13 22:28:19,514 >> Special tokens file saved in work_dirs/QA2/qa_abcd_lora/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2024-11-13 22:28:19,514 >> added tokens file saved in work_dirs/QA2/qa_abcd_lora/added_tokens.json 11/13/2024 22:28:21 - INFO - __main__ - Saved LoRA weights to work_dirs/QA2/qa_abcd_lora/lora_weights.pth dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:28:26,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:28:26,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1992.82 | bwd_microstep: 3790.06 | bwd_inner_microstep: 3782.52 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.37 [2024-11-13 22:28:26,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1992.81 | bwd: 3790.08 | bwd_inner: 3782.52 | bwd_allreduce: 7.51 | step: 21.37 4%|▍ | 2029/50750 [5:45:48<8835:18:33, 652.84s/it] {'loss': 0.0227, 'learning_rate': 3.998957307379998e-05, 'epoch': 2.0} 4%|▍ | 2029/50750 [5:45:48<8835:18:33, 652.84s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 16, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:28:31,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 22:28:31,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1002.80 | bwd_microstep: 1912.90 | bwd_inner_microstep: 1905.39 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.24 [2024-11-13 22:28:31,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1002.78 | bwd: 1912.91 | bwd_inner: 1905.39 | bwd_allreduce: 7.48 | step: 21.24 4%|▍ | 2030/50750 [5:45:53<6202:41:39, 458.33s/it] {'loss': 0.0874, 'learning_rate': 3.9989531823527646e-05, 'epoch': 2.0} 4%|▍ | 2030/50750 [5:45:53<6202:41:39, 458.33s/it][2024-11-13 22:28:34,030] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-13 22:28:39,288] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-13 22:28:44,539] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-13 22:28:49,479] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:29:07,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 22:29:07,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2007.31 | bwd_microstep: 3805.39 | bwd_inner_microstep: 3797.52 | bwd_allreduce_microstep: 7.81 | step_microstep: 22.15 [2024-11-13 22:29:07,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2007.29 | bwd: 3805.40 | bwd_inner: 3797.52 | bwd_allreduce: 7.83 | step: 22.15 4%|▍ | 2031/50750 [5:46:29<4487:34:44, 331.60s/it] {'loss': 0.0986, 'learning_rate': 3.998949049184195e-05, 'epoch': 2.0} 4%|▍ | 2031/50750 [5:46:29<4487:34:44, 331.60s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:29:13,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 22:29:13,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2009.62 | bwd_microstep: 3838.44 | bwd_inner_microstep: 3830.85 | bwd_allreduce_microstep: 7.55 | step_microstep: 22.08 [2024-11-13 22:29:13,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2009.61 | bwd: 3838.46 | bwd_inner: 3830.85 | bwd_allreduce: 7.57 | step: 22.08 4%|▍ | 2032/50750 [5:46:35<3165:12:37, 233.89s/it] {'loss': 0.0199, 'learning_rate': 3.9989449078743084e-05, 'epoch': 2.0} 4%|▍ | 2032/50750 [5:46:35<3165:12:37, 233.89s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:29:19,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 22:29:19,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2010.71 | bwd_microstep: 3829.55 | bwd_inner_microstep: 3822.05 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.48 [2024-11-13 22:29:19,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2010.69 | bwd: 3829.57 | bwd_inner: 3822.05 | bwd_allreduce: 7.48 | step: 21.48 4%|▍ | 2033/50750 [5:46:40<2239:31:47, 165.49s/it] {'loss': 0.2613, 'learning_rate': 3.99894075842312e-05, 'epoch': 2.0} 4%|▍ | 2033/50750 [5:46:40<2239:31:47, 165.49s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:29:24,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:29:24,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2014.30 | bwd_microstep: 3832.09 | bwd_inner_microstep: 3824.62 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.32 [2024-11-13 22:29:24,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2014.30 | bwd: 3832.11 | bwd_inner: 3824.62 | bwd_allreduce: 7.45 | step: 21.32 4%|▍ | 2034/50750 [5:46:46<1591:34:16, 117.61s/it] {'loss': 0.021, 'learning_rate': 3.9989366008306467e-05, 'epoch': 2.0} 4%|▍ | 2034/50750 [5:46:46<1591:34:16, 117.61s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:29:30,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 5.04 [2024-11-13 22:29:30,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.17 | bwd_microstep: 3845.02 | bwd_inner_microstep: 3837.47 | bwd_allreduce_microstep: 7.51 | step_microstep: 22.02 [2024-11-13 22:29:30,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.17 | bwd: 3845.04 | bwd_inner: 3837.47 | bwd_allreduce: 7.53 | step: 22.02 4%|▍ | 2035/50750 [5:46:52<1138:09:43, 84.11s/it] {'loss': 0.0089, 'learning_rate': 3.9989324350969056e-05, 'epoch': 2.0} 4%|▍ | 2035/50750 [5:46:52<1138:09:43, 84.11s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:29:36,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:29:36,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.75 | bwd_microstep: 3847.81 | bwd_inner_microstep: 3840.27 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.53 [2024-11-13 22:29:36,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.74 | bwd: 3847.82 | bwd_inner: 3840.27 | bwd_allreduce: 7.51 | step: 21.53 4%|▍ | 2036/50750 [5:46:58<820:45:30, 60.65s/it] {'loss': 0.036, 'learning_rate': 3.998928261221915e-05, 'epoch': 2.01} 4%|▍ | 2036/50750 [5:46:58<820:45:30, 60.65s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:29:42,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.45 | optimizer_step: 4.92 [2024-11-13 22:29:42,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.50 | bwd_microstep: 3858.05 | bwd_inner_microstep: 3849.97 | bwd_allreduce_microstep: 8.04 | step_microstep: 22.45 [2024-11-13 22:29:42,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.50 | bwd: 3858.06 | bwd_inner: 3849.97 | bwd_allreduce: 8.05 | step: 22.45 4%|▍ | 2037/50750 [5:47:04<598:37:21, 44.24s/it] {'loss': 0.0061, 'learning_rate': 3.99892407920569e-05, 'epoch': 2.01} 4%|▍ | 2037/50750 [5:47:04<598:37:21, 44.24s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:29:48,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 22:29:48,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.51 | bwd_microstep: 3854.81 | bwd_inner_microstep: 3847.28 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.70 [2024-11-13 22:29:48,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.50 | bwd: 3854.83 | bwd_inner: 3847.28 | bwd_allreduce: 7.51 | step: 21.70 4%|▍ | 2038/50750 [5:47:10<443:07:37, 32.75s/it] {'loss': 0.5558, 'learning_rate': 3.9989198890482486e-05, 'epoch': 2.01} 4%|▍ | 2038/50750 [5:47:10<443:07:37, 32.75s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:29:54,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 22:29:54,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.98 | bwd_microstep: 3856.76 | bwd_inner_microstep: 3849.20 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.65 [2024-11-13 22:29:54,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.96 | bwd: 3856.77 | bwd_inner: 3849.20 | bwd_allreduce: 7.54 | step: 21.65 4%|▍ | 2039/50750 [5:47:16<334:16:05, 24.70s/it] {'loss': 0.0341, 'learning_rate': 3.998915690749608e-05, 'epoch': 2.01} 4%|▍ | 2039/50750 [5:47:16<334:16:05, 24.70s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2203 [2024-11-13 22:30:00,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:30:00,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.47 | bwd_microstep: 3859.91 | bwd_inner_microstep: 3852.42 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.08 [2024-11-13 22:30:00,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.45 | bwd: 3859.92 | bwd_inner: 3852.42 | bwd_allreduce: 7.46 | step: 21.08 4%|▍ | 2040/50750 [5:47:22<258:05:18, 19.07s/it] {'loss': 0.0095, 'learning_rate': 3.9989114843097854e-05, 'epoch': 2.01} 4%|▍ | 2040/50750 [5:47:22<258:05:18, 19.07s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:30:06,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 22:30:06,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.38 | bwd_microstep: 3859.11 | bwd_inner_microstep: 3851.33 | bwd_allreduce_microstep: 7.73 | step_microstep: 23.88 [2024-11-13 22:30:06,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.38 | bwd: 3859.13 | bwd_inner: 3851.33 | bwd_allreduce: 7.75 | step: 23.88 4%|▍ | 2041/50750 [5:47:28<204:46:55, 15.14s/it] {'loss': 0.021, 'learning_rate': 3.998907269728797e-05, 'epoch': 2.01} 4%|▍ | 2041/50750 [5:47:28<204:46:55, 15.14s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:30:12,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:30:12,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.11 | bwd_microstep: 3845.09 | bwd_inner_microstep: 3837.56 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.56 [2024-11-13 22:30:12,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.10 | bwd: 3845.10 | bwd_inner: 3837.56 | bwd_allreduce: 7.50 | step: 21.56 4%|▍ | 2042/50750 [5:47:34<167:23:38, 12.37s/it] {'loss': 0.031, 'learning_rate': 3.998903047006661e-05, 'epoch': 2.01} 4%|▍ | 2042/50750 [5:47:34<167:23:38, 12.37s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:30:18,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:30:18,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2016.47 | bwd_microstep: 3837.28 | bwd_inner_microstep: 3829.72 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.14 [2024-11-13 22:30:18,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2016.47 | bwd: 3837.29 | bwd_inner: 3829.72 | bwd_allreduce: 7.53 | step: 21.14 4%|▍ | 2043/50750 [5:47:40<141:07:44, 10.43s/it] {'loss': 0.003, 'learning_rate': 3.9988988161433936e-05, 'epoch': 2.01} 4%|▍ | 2043/50750 [5:47:40<141:07:44, 10.43s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:30:24,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:30:24,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.99 | bwd_microstep: 3842.73 | bwd_inner_microstep: 3835.16 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.68 [2024-11-13 22:30:24,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.99 | bwd: 3842.74 | bwd_inner: 3835.16 | bwd_allreduce: 7.54 | step: 21.69 4%|▍ | 2044/50750 [5:47:46<122:47:21, 9.08s/it] {'loss': 0.001, 'learning_rate': 3.998894577139012e-05, 'epoch': 2.01} 4%|▍ | 2044/50750 [5:47:46<122:47:21, 9.08s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:30:30,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:30:30,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.11 | bwd_microstep: 3844.01 | bwd_inner_microstep: 3835.33 | bwd_allreduce_microstep: 8.59 | step_microstep: 21.01 [2024-11-13 22:30:30,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.10 | bwd: 3844.03 | bwd_inner: 3835.33 | bwd_allreduce: 8.63 | step: 21.00 4%|▍ | 2045/50750 [5:47:52<109:56:14, 8.13s/it] {'loss': 0.0002, 'learning_rate': 3.998890329993535e-05, 'epoch': 2.01} 4%|▍ | 2045/50750 [5:47:52<109:56:14, 8.13s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:30:36,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:30:36,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.24 | bwd_microstep: 3844.12 | bwd_inner_microstep: 3836.63 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.66 [2024-11-13 22:30:36,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.24 | bwd: 3844.13 | bwd_inner: 3836.63 | bwd_allreduce: 7.46 | step: 21.67 4%|▍ | 2046/50750 [5:47:57<100:56:33, 7.46s/it] {'loss': 0.0007, 'learning_rate': 3.9988860747069795e-05, 'epoch': 2.02} 4%|▍ | 2046/50750 [5:47:57<100:56:33, 7.46s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:30:41,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 22:30:41,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.83 | bwd_microstep: 3831.60 | bwd_inner_microstep: 3823.74 | bwd_allreduce_microstep: 7.81 | step_microstep: 22.26 [2024-11-13 22:30:41,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.83 | bwd: 3831.62 | bwd_inner: 3823.75 | bwd_allreduce: 7.83 | step: 22.26 4%|▍ | 2047/50750 [5:48:03<94:38:56, 7.00s/it] {'loss': 0.0128, 'learning_rate': 3.998881811279361e-05, 'epoch': 2.02} 4%|▍ | 2047/50750 [5:48:03<94:38:56, 7.00s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:30:47,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:30:47,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.05 | bwd_microstep: 3832.08 | bwd_inner_microstep: 3824.54 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.29 [2024-11-13 22:30:47,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.02 | bwd: 3832.09 | bwd_inner: 3824.54 | bwd_allreduce: 7.52 | step: 21.30 4%|▍ | 2048/50750 [5:48:09<90:14:12, 6.67s/it] {'loss': 0.0006, 'learning_rate': 3.9988775397106987e-05, 'epoch': 2.02} 4%|▍ | 2048/50750 [5:48:09<90:14:12, 6.67s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 22:30:53,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 5.11 [2024-11-13 22:30:53,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.52 | bwd_microstep: 3838.75 | bwd_inner_microstep: 3831.20 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.81 [2024-11-13 22:30:53,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.53 | bwd: 3838.77 | bwd_inner: 3831.20 | bwd_allreduce: 7.53 | step: 21.81 4%|▍ | 2049/50750 [5:48:15<87:09:49, 6.44s/it] {'loss': 0.0033, 'learning_rate': 3.998873260001009e-05, 'epoch': 2.02} 4%|▍ | 2049/50750 [5:48:15<87:09:49, 6.44s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:30:59,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 22:30:59,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.47 | bwd_microstep: 3839.64 | bwd_inner_microstep: 3832.12 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.31 [2024-11-13 22:30:59,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.47 | bwd: 3839.65 | bwd_inner: 3832.12 | bwd_allreduce: 7.49 | step: 21.31 4%|▍ | 2050/50750 [5:48:21<85:00:00, 6.28s/it] {'loss': 0.0876, 'learning_rate': 3.998868972150311e-05, 'epoch': 2.02} 4%|▍ | 2050/50750 [5:48:21<85:00:00, 6.28s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:31:05,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:31:05,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.09 | bwd_microstep: 3838.29 | bwd_inner_microstep: 3830.75 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.31 [2024-11-13 22:31:05,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.09 | bwd: 3838.31 | bwd_inner: 3830.75 | bwd_allreduce: 7.51 | step: 21.31 4%|▍ | 2051/50750 [5:48:27<83:29:46, 6.17s/it] {'loss': 0.0043, 'learning_rate': 3.998864676158619e-05, 'epoch': 2.02} 4%|▍ | 2051/50750 [5:48:27<83:29:46, 6.17s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:31:11,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:31:11,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.61 | bwd_microstep: 3841.55 | bwd_inner_microstep: 3833.99 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.80 [2024-11-13 22:31:11,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.61 | bwd: 3841.57 | bwd_inner: 3833.99 | bwd_allreduce: 7.54 | step: 21.80 4%|▍ | 2052/50750 [5:48:33<82:26:20, 6.09s/it] {'loss': 0.0022, 'learning_rate': 3.998860372025954e-05, 'epoch': 2.02} 4%|▍ | 2052/50750 [5:48:33<82:26:20, 6.09s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:31:17,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 22:31:17,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.14 | bwd_microstep: 3837.59 | bwd_inner_microstep: 3829.77 | bwd_allreduce_microstep: 7.78 | step_microstep: 21.73 [2024-11-13 22:31:17,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.14 | bwd: 3837.60 | bwd_inner: 3829.77 | bwd_allreduce: 7.79 | step: 21.73 4%|▍ | 2053/50750 [5:48:39<81:42:14, 6.04s/it] {'loss': 0.6286, 'learning_rate': 3.9988560597523315e-05, 'epoch': 2.02} 4%|▍ | 2053/50750 [5:48:39<81:42:14, 6.04s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:31:23,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 22:31:23,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.83 | bwd_microstep: 3845.82 | bwd_inner_microstep: 3837.99 | bwd_allreduce_microstep: 7.77 | step_microstep: 22.65 [2024-11-13 22:31:23,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.82 | bwd: 3845.84 | bwd_inner: 3837.99 | bwd_allreduce: 7.79 | step: 22.65 4%|▍ | 2054/50750 [5:48:45<81:15:48, 6.01s/it] {'loss': 0.0006, 'learning_rate': 3.998851739337769e-05, 'epoch': 2.02} 4%|▍ | 2054/50750 [5:48:45<81:15:48, 6.01s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:31:29,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:31:29,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.33 | bwd_microstep: 3847.72 | bwd_inner_microstep: 3840.19 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.81 [2024-11-13 22:31:29,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.31 | bwd: 3847.73 | bwd_inner: 3840.18 | bwd_allreduce: 7.51 | step: 20.82 4%|▍ | 2055/50750 [5:48:51<80:56:11, 5.98s/it] {'loss': 0.0132, 'learning_rate': 3.9988474107822856e-05, 'epoch': 2.02} 4%|▍ | 2055/50750 [5:48:51<80:56:11, 5.98s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 22:31:35,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 22:31:35,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.48 | bwd_microstep: 3839.16 | bwd_inner_microstep: 3831.63 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.96 [2024-11-13 22:31:35,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.46 | bwd: 3839.17 | bwd_inner: 3831.63 | bwd_allreduce: 7.51 | step: 20.97 4%|▍ | 2056/50750 [5:48:57<80:40:16, 5.96s/it] {'loss': 0.0008, 'learning_rate': 3.998843074085897e-05, 'epoch': 2.03} 4%|▍ | 2056/50750 [5:48:57<80:40:16, 5.96s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:31:41,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:31:41,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.20 | bwd_microstep: 3843.98 | bwd_inner_microstep: 3836.42 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.18 [2024-11-13 22:31:41,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.20 | bwd: 3844.00 | bwd_inner: 3836.42 | bwd_allreduce: 7.53 | step: 21.18 4%|▍ | 2057/50750 [5:49:03<80:30:21, 5.95s/it] {'loss': 0.006, 'learning_rate': 3.998838729248622e-05, 'epoch': 2.03} 4%|▍ | 2057/50750 [5:49:03<80:30:21, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:31:47,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 22:31:47,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.51 | bwd_microstep: 3852.64 | bwd_inner_microstep: 3844.95 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.78 [2024-11-13 22:31:47,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.51 | bwd: 3852.65 | bwd_inner: 3844.95 | bwd_allreduce: 7.67 | step: 21.78 4%|▍ | 2058/50750 [5:49:08<80:24:19, 5.94s/it] {'loss': 0.0005, 'learning_rate': 3.998834376270478e-05, 'epoch': 2.03} 4%|▍ | 2058/50750 [5:49:08<80:24:19, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:31:52,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:31:52,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.47 | bwd_microstep: 3849.85 | bwd_inner_microstep: 3842.20 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.13 [2024-11-13 22:31:52,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.46 | bwd: 3849.87 | bwd_inner: 3842.20 | bwd_allreduce: 7.63 | step: 21.13 4%|▍ | 2059/50750 [5:49:14<80:21:24, 5.94s/it] {'loss': 0.0175, 'learning_rate': 3.998830015151483e-05, 'epoch': 2.03} 4%|▍ | 2059/50750 [5:49:14<80:21:24, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:31:58,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.97 [2024-11-13 22:31:58,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.62 | bwd_microstep: 3853.36 | bwd_inner_microstep: 3845.39 | bwd_allreduce_microstep: 7.91 | step_microstep: 30.21 [2024-11-13 22:31:58,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.62 | bwd: 3853.38 | bwd_inner: 3845.39 | bwd_allreduce: 7.93 | step: 30.21 4%|▍ | 2060/50750 [5:49:20<80:22:54, 5.94s/it] {'loss': 0.0001, 'learning_rate': 3.998825645891655e-05, 'epoch': 2.03} 4%|▍ | 2060/50750 [5:49:20<80:22:54, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2203 [2024-11-13 22:32:04,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 22:32:04,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.60 | bwd_microstep: 3861.40 | bwd_inner_microstep: 3852.25 | bwd_allreduce_microstep: 9.07 | step_microstep: 23.02 [2024-11-13 22:32:04,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.59 | bwd: 3861.43 | bwd_inner: 3852.25 | bwd_allreduce: 9.11 | step: 23.01 4%|▍ | 2061/50750 [5:49:26<80:22:41, 5.94s/it] {'loss': 0.9036, 'learning_rate': 3.9988212684910107e-05, 'epoch': 2.03} 4%|▍ | 2061/50750 [5:49:26<80:22:41, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:32:10,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 22:32:10,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.55 | bwd_microstep: 3846.58 | bwd_inner_microstep: 3839.09 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.98 [2024-11-13 22:32:10,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.55 | bwd: 3846.59 | bwd_inner: 3839.09 | bwd_allreduce: 7.46 | step: 20.98 4%|▍ | 2062/50750 [5:49:32<80:16:36, 5.94s/it] {'loss': 0.0003, 'learning_rate': 3.998816882949569e-05, 'epoch': 2.03} 4%|▍ | 2062/50750 [5:49:32<80:16:36, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2194 [2024-11-13 22:32:16,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:32:16,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.47 | bwd_microstep: 3846.20 | bwd_inner_microstep: 3838.63 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.18 [2024-11-13 22:32:16,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.47 | bwd: 3846.21 | bwd_inner: 3838.63 | bwd_allreduce: 7.54 | step: 21.18 4%|▍ | 2063/50750 [5:49:38<80:11:37, 5.93s/it] {'loss': 0.0, 'learning_rate': 3.998812489267348e-05, 'epoch': 2.03} 4%|▍ | 2063/50750 [5:49:38<80:11:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 22:32:22,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:32:22,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.90 | bwd_microstep: 3847.19 | bwd_inner_microstep: 3839.67 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.08 [2024-11-13 22:32:22,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.90 | bwd: 3847.20 | bwd_inner: 3839.67 | bwd_allreduce: 7.50 | step: 21.08 4%|▍ | 2064/50750 [5:49:44<80:08:34, 5.93s/it] {'loss': 0.001, 'learning_rate': 3.9988080874443634e-05, 'epoch': 2.03} 4%|▍ | 2064/50750 [5:49:44<80:08:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:32:28,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 22:32:28,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.10 | bwd_microstep: 3858.18 | bwd_inner_microstep: 3850.64 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.51 [2024-11-13 22:32:28,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.09 | bwd: 3858.19 | bwd_inner: 3850.63 | bwd_allreduce: 7.51 | step: 21.51 4%|▍ | 2065/50750 [5:49:50<80:09:40, 5.93s/it] {'loss': 0.4112, 'learning_rate': 3.998803677480636e-05, 'epoch': 2.03} 4%|▍ | 2065/50750 [5:49:50<80:09:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:32:34,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:32:34,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.65 | bwd_microstep: 3846.14 | bwd_inner_microstep: 3838.61 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.03 [2024-11-13 22:32:34,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.65 | bwd: 3846.15 | bwd_inner: 3838.61 | bwd_allreduce: 7.50 | step: 21.03 4%|▍ | 2066/50750 [5:49:56<80:07:51, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.998799259376182e-05, 'epoch': 2.04} 4%|▍ | 2066/50750 [5:49:56<80:07:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 22:32:40,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 22:32:40,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.44 | bwd_microstep: 3852.68 | bwd_inner_microstep: 3845.18 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.53 [2024-11-13 22:32:40,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.44 | bwd: 3852.69 | bwd_inner: 3845.18 | bwd_allreduce: 7.47 | step: 21.54 4%|▍ | 2067/50750 [5:50:02<80:08:04, 5.93s/it] {'loss': 0.2129, 'learning_rate': 3.99879483313102e-05, 'epoch': 2.04} 4%|▍ | 2067/50750 [5:50:02<80:08:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:32:46,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:32:46,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.68 | bwd_microstep: 3856.91 | bwd_inner_microstep: 3849.28 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.30 [2024-11-13 22:32:46,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.67 | bwd: 3856.92 | bwd_inner: 3849.28 | bwd_allreduce: 7.60 | step: 21.32 4%|▍ | 2068/50750 [5:50:08<80:09:02, 5.93s/it] {'loss': 0.0054, 'learning_rate': 3.998790398745167e-05, 'epoch': 2.04} 4%|▍ | 2068/50750 [5:50:08<80:09:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:32:52,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 22:32:52,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.69 | bwd_microstep: 3850.02 | bwd_inner_microstep: 3842.39 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.44 [2024-11-13 22:32:52,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.69 | bwd: 3850.03 | bwd_inner: 3842.39 | bwd_allreduce: 7.60 | step: 21.44 4%|▍ | 2069/50750 [5:50:14<80:07:56, 5.93s/it] {'loss': 0.0042, 'learning_rate': 3.998785956218643e-05, 'epoch': 2.04} 4%|▍ | 2069/50750 [5:50:14<80:07:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:32:58,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:32:58,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.34 | bwd_microstep: 3850.91 | bwd_inner_microstep: 3843.18 | bwd_allreduce_microstep: 7.68 | step_microstep: 22.73 [2024-11-13 22:32:58,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.32 | bwd: 3850.93 | bwd_inner: 3843.18 | bwd_allreduce: 7.70 | step: 22.73 4%|▍ | 2070/50750 [5:50:20<80:08:57, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.9987815055514646e-05, 'epoch': 2.04} 4%|▍ | 2070/50750 [5:50:20<80:08:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:33:04,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:33:04,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.76 | bwd_microstep: 3854.42 | bwd_inner_microstep: 3846.91 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.99 [2024-11-13 22:33:04,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.76 | bwd: 3854.43 | bwd_inner: 3846.91 | bwd_allreduce: 7.48 | step: 21.00 4%|▍ | 2071/50750 [5:50:26<80:08:34, 5.93s/it] {'loss': 0.0016, 'learning_rate': 3.99877704674365e-05, 'epoch': 2.04} 4%|▍ | 2071/50750 [5:50:26<80:08:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:33:10,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 22:33:10,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.03 | bwd_microstep: 3855.42 | bwd_inner_microstep: 3847.54 | bwd_allreduce_microstep: 7.81 | step_microstep: 22.70 [2024-11-13 22:33:10,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.03 | bwd: 3855.44 | bwd_inner: 3847.54 | bwd_allreduce: 7.84 | step: 22.70 4%|▍ | 2072/50750 [5:50:31<80:10:03, 5.93s/it] {'loss': 0.0103, 'learning_rate': 3.9987725797952184e-05, 'epoch': 2.04} 4%|▍ | 2072/50750 [5:50:31<80:10:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:33:15,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:33:15,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.60 | bwd_microstep: 3843.61 | bwd_inner_microstep: 3836.15 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.66 [2024-11-13 22:33:15,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.58 | bwd: 3843.63 | bwd_inner: 3836.15 | bwd_allreduce: 7.44 | step: 20.66 4%|▍ | 2073/50750 [5:50:37<80:08:46, 5.93s/it] {'loss': 0.0073, 'learning_rate': 3.998768104706187e-05, 'epoch': 2.04} 4%|▍ | 2073/50750 [5:50:37<80:08:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:33:21,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:33:21,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.43 | bwd_microstep: 3841.45 | bwd_inner_microstep: 3833.90 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.17 [2024-11-13 22:33:21,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.43 | bwd: 3841.46 | bwd_inner: 3833.90 | bwd_allreduce: 7.52 | step: 21.18 4%|▍ | 2074/50750 [5:50:43<80:06:05, 5.92s/it] {'loss': 0.0013, 'learning_rate': 3.998763621476574e-05, 'epoch': 2.04} 4%|▍ | 2074/50750 [5:50:43<80:06:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2195 [2024-11-13 22:33:27,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:33:27,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.45 | bwd_microstep: 3847.70 | bwd_inner_microstep: 3840.23 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.21 [2024-11-13 22:33:27,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.45 | bwd: 3847.71 | bwd_inner: 3840.23 | bwd_allreduce: 7.44 | step: 21.21 4%|▍ | 2075/50750 [5:50:49<80:05:41, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.998759130106399e-05, 'epoch': 2.04} 4%|▍ | 2075/50750 [5:50:49<80:05:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:33:33,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 22:33:33,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.40 | bwd_microstep: 3852.28 | bwd_inner_microstep: 3844.79 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.92 [2024-11-13 22:33:33,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.40 | bwd: 3852.29 | bwd_inner: 3844.79 | bwd_allreduce: 7.46 | step: 20.93 4%|▍ | 2076/50750 [5:50:55<80:07:37, 5.93s/it] {'loss': 0.2435, 'learning_rate': 3.998754630595678e-05, 'epoch': 2.05} 4%|▍ | 2076/50750 [5:50:55<80:07:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:33:39,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-13 22:33:39,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.69 | bwd_microstep: 3853.77 | bwd_inner_microstep: 3845.89 | bwd_allreduce_microstep: 7.84 | step_microstep: 21.43 [2024-11-13 22:33:39,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.69 | bwd: 3853.79 | bwd_inner: 3845.89 | bwd_allreduce: 7.86 | step: 21.43 4%|▍ | 2077/50750 [5:51:01<80:08:49, 5.93s/it] {'loss': 0.1696, 'learning_rate': 3.998750122944431e-05, 'epoch': 2.05} 4%|▍ | 2077/50750 [5:51:01<80:08:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:33:45,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:33:45,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.79 | bwd_microstep: 3843.70 | bwd_inner_microstep: 3835.95 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.11 [2024-11-13 22:33:45,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.78 | bwd: 3843.72 | bwd_inner: 3835.95 | bwd_allreduce: 7.73 | step: 21.11 4%|▍ | 2078/50750 [5:51:07<80:07:12, 5.93s/it] {'loss': 0.0055, 'learning_rate': 3.9987456071526765e-05, 'epoch': 2.05} 4%|▍ | 2078/50750 [5:51:07<80:07:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:33:51,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.35 | optimizer_step: 4.93 [2024-11-13 22:33:51,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.89 | bwd_microstep: 3856.11 | bwd_inner_microstep: 3848.27 | bwd_allreduce_microstep: 7.79 | step_microstep: 22.42 [2024-11-13 22:33:51,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.89 | bwd: 3856.12 | bwd_inner: 3848.27 | bwd_allreduce: 7.81 | step: 22.42 4%|▍ | 2079/50750 [5:51:13<80:11:17, 5.93s/it] {'loss': 0.0528, 'learning_rate': 3.998741083220432e-05, 'epoch': 2.05} 4%|▍ | 2079/50750 [5:51:13<80:11:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:33:57,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:33:57,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.38 | bwd_microstep: 3840.54 | bwd_inner_microstep: 3833.06 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.69 [2024-11-13 22:33:57,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.37 | bwd: 3840.55 | bwd_inner: 3833.05 | bwd_allreduce: 7.46 | step: 20.70 4%|▍ | 2080/50750 [5:51:19<80:09:54, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.9987365511477166e-05, 'epoch': 2.05} 4%|▍ | 2080/50750 [5:51:19<80:09:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:34:03,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:34:03,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.73 | bwd_microstep: 3850.97 | bwd_inner_microstep: 3843.41 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.47 [2024-11-13 22:34:03,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.73 | bwd: 3850.98 | bwd_inner: 3843.41 | bwd_allreduce: 7.53 | step: 21.47 4%|▍ | 2081/50750 [5:51:25<80:10:50, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.998732010934548e-05, 'epoch': 2.05} 4%|▍ | 2081/50750 [5:51:25<80:10:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:34:09,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:34:09,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.84 | bwd_microstep: 3855.64 | bwd_inner_microstep: 3847.74 | bwd_allreduce_microstep: 7.86 | step_microstep: 20.92 [2024-11-13 22:34:09,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.83 | bwd: 3855.65 | bwd_inner: 3847.74 | bwd_allreduce: 7.87 | step: 20.92 4%|▍ | 2082/50750 [5:51:31<80:10:21, 5.93s/it] {'loss': 0.0792, 'learning_rate': 3.9987274625809454e-05, 'epoch': 2.05} 4%|▍ | 2082/50750 [5:51:31<80:10:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 22:34:15,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 22:34:15,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.45 | bwd_microstep: 3852.03 | bwd_inner_microstep: 3844.53 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.03 [2024-11-13 22:34:15,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.44 | bwd: 3852.04 | bwd_inner: 3844.53 | bwd_allreduce: 7.47 | step: 21.04 4%|▍ | 2083/50750 [5:51:37<80:11:40, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.9987229060869266e-05, 'epoch': 2.05} 4%|▍ | 2083/50750 [5:51:37<80:11:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:34:21,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:34:21,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.65 | bwd_microstep: 3855.38 | bwd_inner_microstep: 3847.82 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.32 [2024-11-13 22:34:21,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.64 | bwd: 3855.40 | bwd_inner: 3847.82 | bwd_allreduce: 7.54 | step: 21.33 4%|▍ | 2084/50750 [5:51:43<80:13:11, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.998718341452511e-05, 'epoch': 2.05} 4%|▍ | 2084/50750 [5:51:43<80:13:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:34:27,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:34:27,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.22 | bwd_microstep: 3845.40 | bwd_inner_microstep: 3837.83 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.39 [2024-11-13 22:34:27,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.22 | bwd: 3845.41 | bwd_inner: 3837.83 | bwd_allreduce: 7.54 | step: 21.39 4%|▍ | 2085/50750 [5:51:49<80:10:33, 5.93s/it] {'loss': 0.0174, 'learning_rate': 3.998713768677717e-05, 'epoch': 2.05} 4%|▍ | 2085/50750 [5:51:49<80:10:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:34:33,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 22:34:33,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.31 | bwd_microstep: 3854.38 | bwd_inner_microstep: 3846.70 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.92 [2024-11-13 22:34:33,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.29 | bwd: 3854.39 | bwd_inner: 3846.70 | bwd_allreduce: 7.65 | step: 21.92 4%|▍ | 2086/50750 [5:51:55<80:11:00, 5.93s/it] {'loss': 0.0015, 'learning_rate': 3.998709187762563e-05, 'epoch': 2.06} 4%|▍ | 2086/50750 [5:51:55<80:11:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:34:38,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:34:38,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.59 | bwd_microstep: 3856.31 | bwd_inner_microstep: 3848.62 | bwd_allreduce_microstep: 7.64 | step_microstep: 23.99 [2024-11-13 22:34:38,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.58 | bwd: 3856.32 | bwd_inner: 3848.62 | bwd_allreduce: 7.66 | step: 23.99 4%|▍ | 2087/50750 [5:52:00<80:12:48, 5.93s/it] {'loss': 0.0056, 'learning_rate': 3.9987045987070667e-05, 'epoch': 2.06} 4%|▍ | 2087/50750 [5:52:00<80:12:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:34:44,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 22:34:44,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.98 | bwd_microstep: 3839.59 | bwd_inner_microstep: 3831.84 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.31 [2024-11-13 22:34:44,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.98 | bwd: 3839.60 | bwd_inner: 3831.84 | bwd_allreduce: 7.73 | step: 22.32 4%|▍ | 2088/50750 [5:52:06<80:09:44, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.9987000015112486e-05, 'epoch': 2.06} 4%|▍ | 2088/50750 [5:52:06<80:09:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:34:50,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 5.00 [2024-11-13 22:34:50,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.62 | bwd_microstep: 3843.53 | bwd_inner_microstep: 3835.29 | bwd_allreduce_microstep: 8.19 | step_microstep: 21.72 [2024-11-13 22:34:50,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.61 | bwd: 3843.54 | bwd_inner: 3835.29 | bwd_allreduce: 8.21 | step: 21.73 4%|▍ | 2089/50750 [5:52:12<80:07:24, 5.93s/it] {'loss': 0.0068, 'learning_rate': 3.998695396175127e-05, 'epoch': 2.06} 4%|▍ | 2089/50750 [5:52:12<80:07:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:34:56,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:34:56,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.52 | bwd_microstep: 3840.24 | bwd_inner_microstep: 3832.43 | bwd_allreduce_microstep: 7.76 | step_microstep: 21.60 [2024-11-13 22:34:56,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.50 | bwd: 3840.25 | bwd_inner: 3832.43 | bwd_allreduce: 7.78 | step: 21.61 4%|▍ | 2090/50750 [5:52:18<80:05:11, 5.93s/it] {'loss': 0.0057, 'learning_rate': 3.9986907826987195e-05, 'epoch': 2.06} 4%|▍ | 2090/50750 [5:52:18<80:05:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:35:02,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.30 | optimizer_step: 4.93 [2024-11-13 22:35:02,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.07 | bwd_microstep: 3850.80 | bwd_inner_microstep: 3842.92 | bwd_allreduce_microstep: 7.83 | step_microstep: 22.13 [2024-11-13 22:35:02,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.07 | bwd: 3850.81 | bwd_inner: 3842.92 | bwd_allreduce: 7.85 | step: 22.13 4%|▍ | 2091/50750 [5:52:24<80:06:40, 5.93s/it] {'loss': 0.2693, 'learning_rate': 3.998686161082046e-05, 'epoch': 2.06} 4%|▍ | 2091/50750 [5:52:24<80:06:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:35:08,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 22:35:08,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.72 | bwd_microstep: 3842.96 | bwd_inner_microstep: 3835.41 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.48 [2024-11-13 22:35:08,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.69 | bwd: 3842.97 | bwd_inner: 3835.41 | bwd_allreduce: 7.52 | step: 21.49 4%|▍ | 2092/50750 [5:52:30<80:07:35, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.998681531325125e-05, 'epoch': 2.06} 4%|▍ | 2092/50750 [5:52:30<80:07:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:35:14,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:35:14,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.87 | bwd_microstep: 3851.65 | bwd_inner_microstep: 3844.14 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.74 [2024-11-13 22:35:14,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.85 | bwd: 3851.66 | bwd_inner: 3844.14 | bwd_allreduce: 7.48 | step: 20.75 4%|▍ | 2093/50750 [5:52:36<80:07:09, 5.93s/it] {'loss': 0.1072, 'learning_rate': 3.9986768934279754e-05, 'epoch': 2.06} 4%|▍ | 2093/50750 [5:52:36<80:07:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:35:20,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 22:35:20,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.66 | bwd_microstep: 3857.07 | bwd_inner_microstep: 3848.92 | bwd_allreduce_microstep: 8.08 | step_microstep: 24.60 [2024-11-13 22:35:20,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.63 | bwd: 3857.09 | bwd_inner: 3848.92 | bwd_allreduce: 8.11 | step: 24.60 4%|▍ | 2094/50750 [5:52:42<80:10:06, 5.93s/it] {'loss': 0.0112, 'learning_rate': 3.998672247390616e-05, 'epoch': 2.06} 4%|▍ | 2094/50750 [5:52:42<80:10:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:35:26,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.92 [2024-11-13 22:35:26,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.93 | bwd_microstep: 3850.36 | bwd_inner_microstep: 3842.53 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.03 [2024-11-13 22:35:26,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.91 | bwd: 3850.37 | bwd_inner: 3842.53 | bwd_allreduce: 7.80 | step: 22.03 4%|▍ | 2095/50750 [5:52:48<80:10:25, 5.93s/it] {'loss': 0.3815, 'learning_rate': 3.9986675932130654e-05, 'epoch': 2.06} 4%|▍ | 2095/50750 [5:52:48<80:10:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:35:32,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 22:35:32,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.91 | bwd_microstep: 3854.58 | bwd_inner_microstep: 3846.81 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.92 [2024-11-13 22:35:32,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.90 | bwd: 3854.59 | bwd_inner: 3846.81 | bwd_allreduce: 7.74 | step: 21.93 4%|▍ | 2096/50750 [5:52:54<80:12:42, 5.94s/it] {'loss': 0.0008, 'learning_rate': 3.998662930895343e-05, 'epoch': 2.07} 4%|▍ | 2096/50750 [5:52:54<80:12:42, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 22:35:38,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 22:35:38,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.29 | bwd_microstep: 3861.35 | bwd_inner_microstep: 3853.79 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.21 [2024-11-13 22:35:38,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.27 | bwd: 3861.36 | bwd_inner: 3853.79 | bwd_allreduce: 7.53 | step: 21.21 4%|▍ | 2097/50750 [5:53:00<80:14:46, 5.94s/it] {'loss': 0.0004, 'learning_rate': 3.998658260437468e-05, 'epoch': 2.07} 4%|▍ | 2097/50750 [5:53:00<80:14:46, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:35:44,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:35:44,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.62 | bwd_microstep: 3851.99 | bwd_inner_microstep: 3844.50 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.50 [2024-11-13 22:35:44,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.60 | bwd: 3852.00 | bwd_inner: 3844.50 | bwd_allreduce: 7.47 | step: 21.50 4%|▍ | 2098/50750 [5:53:06<80:13:18, 5.94s/it] {'loss': 0.9695, 'learning_rate': 3.998653581839459e-05, 'epoch': 2.07} 4%|▍ | 2098/50750 [5:53:06<80:13:18, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:35:50,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:35:50,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.94 | bwd_microstep: 3851.46 | bwd_inner_microstep: 3843.95 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.05 [2024-11-13 22:35:50,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.93 | bwd: 3851.47 | bwd_inner: 3843.95 | bwd_allreduce: 7.48 | step: 21.05 4%|▍ | 2099/50750 [5:53:12<80:10:53, 5.93s/it] {'loss': 0.2947, 'learning_rate': 3.998648895101335e-05, 'epoch': 2.07} 4%|▍ | 2099/50750 [5:53:12<80:10:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:35:55,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:35:55,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1993.32 | bwd_microstep: 3781.32 | bwd_inner_microstep: 3773.78 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.75 [2024-11-13 22:35:55,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1993.32 | bwd: 3781.33 | bwd_inner: 3773.78 | bwd_allreduce: 7.52 | step: 21.75 4%|▍ | 2100/50750 [5:53:17<79:43:56, 5.90s/it] {'loss': 0.0046, 'learning_rate': 3.9986442002231155e-05, 'epoch': 2.07} 4%|▍ | 2100/50750 [5:53:17<79:43:56, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:36:01,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:36:01,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.42 | bwd_microstep: 3848.23 | bwd_inner_microstep: 3840.70 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.07 [2024-11-13 22:36:01,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.42 | bwd: 3848.24 | bwd_inner: 3840.70 | bwd_allreduce: 7.49 | step: 21.07 4%|▍ | 2101/50750 [5:53:23<79:48:32, 5.91s/it] {'loss': 0.1159, 'learning_rate': 3.998639497204819e-05, 'epoch': 2.07} 4%|▍ | 2101/50750 [5:53:23<79:48:32, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:36:07,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:36:07,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.80 | bwd_microstep: 3841.11 | bwd_inner_microstep: 3833.59 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.02 [2024-11-13 22:36:07,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.80 | bwd: 3841.12 | bwd_inner: 3833.59 | bwd_allreduce: 7.49 | step: 21.02 4%|▍ | 2102/50750 [5:53:29<79:50:02, 5.91s/it] {'loss': 0.0003, 'learning_rate': 3.998634786046465e-05, 'epoch': 2.07} 4%|▍ | 2102/50750 [5:53:29<79:50:02, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:36:13,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:36:13,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.32 | bwd_microstep: 3845.09 | bwd_inner_microstep: 3837.52 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.28 [2024-11-13 22:36:13,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.32 | bwd: 3845.10 | bwd_inner: 3837.52 | bwd_allreduce: 7.54 | step: 21.29 4%|▍ | 2103/50750 [5:53:35<79:52:42, 5.91s/it] {'loss': 0.0021, 'learning_rate': 3.9986300667480726e-05, 'epoch': 2.07} 4%|▍ | 2103/50750 [5:53:35<79:52:42, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:36:19,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 22:36:19,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.92 | bwd_microstep: 3850.60 | bwd_inner_microstep: 3843.05 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.52 [2024-11-13 22:36:19,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.91 | bwd: 3850.61 | bwd_inner: 3843.05 | bwd_allreduce: 7.52 | step: 21.54 4%|▍ | 2104/50750 [5:53:41<79:57:46, 5.92s/it] {'loss': 0.0025, 'learning_rate': 3.998625339309662e-05, 'epoch': 2.07} 4%|▍ | 2104/50750 [5:53:41<79:57:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:36:25,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 22:36:25,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.89 | bwd_microstep: 3849.66 | bwd_inner_microstep: 3842.04 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.70 [2024-11-13 22:36:25,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.89 | bwd: 3849.68 | bwd_inner: 3842.04 | bwd_allreduce: 7.60 | step: 21.70 4%|▍ | 2105/50750 [5:53:47<79:59:20, 5.92s/it] {'loss': 0.1635, 'learning_rate': 3.998620603731251e-05, 'epoch': 2.07} 4%|▍ | 2105/50750 [5:53:47<79:59:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:36:31,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:36:31,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.73 | bwd_microstep: 3850.62 | bwd_inner_microstep: 3843.05 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.71 [2024-11-13 22:36:31,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.72 | bwd: 3850.63 | bwd_inner: 3843.05 | bwd_allreduce: 7.53 | step: 21.71 4%|▍ | 2106/50750 [5:53:53<80:01:52, 5.92s/it] {'loss': 0.3113, 'learning_rate': 3.9986158600128595e-05, 'epoch': 2.07} 4%|▍ | 2106/50750 [5:53:53<80:01:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:36:37,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 22:36:37,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.97 | bwd_microstep: 3851.14 | bwd_inner_microstep: 3843.31 | bwd_allreduce_microstep: 7.79 | step_microstep: 21.68 [2024-11-13 22:36:37,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.97 | bwd: 3851.16 | bwd_inner: 3843.31 | bwd_allreduce: 7.81 | step: 21.69 4%|▍ | 2107/50750 [5:53:59<80:02:36, 5.92s/it] {'loss': 0.0288, 'learning_rate': 3.9986111081545065e-05, 'epoch': 2.08} 4%|▍ | 2107/50750 [5:53:59<80:02:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:36:43,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.33 | optimizer_step: 4.93 [2024-11-13 22:36:43,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.33 | bwd_microstep: 3847.05 | bwd_inner_microstep: 3839.31 | bwd_allreduce_microstep: 7.69 | step_microstep: 22.79 [2024-11-13 22:36:43,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.32 | bwd: 3847.07 | bwd_inner: 3839.31 | bwd_allreduce: 7.71 | step: 22.79 4%|▍ | 2108/50750 [5:54:05<80:04:56, 5.93s/it] {'loss': 0.0022, 'learning_rate': 3.9986063481562116e-05, 'epoch': 2.08} 4%|▍ | 2108/50750 [5:54:05<80:04:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:36:49,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:36:49,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.65 | bwd_microstep: 3847.99 | bwd_inner_microstep: 3840.40 | bwd_allreduce_microstep: 7.54 | step_microstep: 22.14 [2024-11-13 22:36:49,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.63 | bwd: 3848.00 | bwd_inner: 3840.41 | bwd_allreduce: 7.56 | step: 22.16 4%|▍ | 2109/50750 [5:54:11<80:04:55, 5.93s/it] {'loss': 0.0022, 'learning_rate': 3.9986015800179944e-05, 'epoch': 2.08} 4%|▍ | 2109/50750 [5:54:11<80:04:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:36:55,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:36:55,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.40 | bwd_microstep: 3850.87 | bwd_inner_microstep: 3843.36 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.03 [2024-11-13 22:36:55,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.40 | bwd: 3850.88 | bwd_inner: 3843.36 | bwd_allreduce: 7.48 | step: 21.04 4%|▍ | 2110/50750 [5:54:17<80:05:28, 5.93s/it] {'loss': 0.024, 'learning_rate': 3.998596803739875e-05, 'epoch': 2.08} 4%|▍ | 2110/50750 [5:54:17<80:05:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:37:01,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:37:01,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.05 | bwd_microstep: 3847.34 | bwd_inner_microstep: 3839.83 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.14 [2024-11-13 22:37:01,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.05 | bwd: 3847.35 | bwd_inner: 3839.83 | bwd_allreduce: 7.49 | step: 21.14 4%|▍ | 2111/50750 [5:54:23<80:02:39, 5.92s/it] {'loss': 0.0061, 'learning_rate': 3.998592019321871e-05, 'epoch': 2.08} 4%|▍ | 2111/50750 [5:54:23<80:02:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:37:07,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:37:07,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.00 | bwd_microstep: 3846.93 | bwd_inner_microstep: 3839.40 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.30 [2024-11-13 22:37:07,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.00 | bwd: 3846.95 | bwd_inner: 3839.40 | bwd_allreduce: 7.51 | step: 21.30 4%|▍ | 2112/50750 [5:54:29<80:01:33, 5.92s/it] {'loss': 0.008, 'learning_rate': 3.998587226764003e-05, 'epoch': 2.08} 4%|▍ | 2112/50750 [5:54:29<80:01:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:37:12,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 22:37:12,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.59 | bwd_microstep: 3850.81 | bwd_inner_microstep: 3843.31 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.55 [2024-11-13 22:37:12,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.59 | bwd: 3850.82 | bwd_inner: 3843.31 | bwd_allreduce: 7.47 | step: 21.55 4%|▍ | 2113/50750 [5:54:34<80:02:35, 5.92s/it] {'loss': 0.1054, 'learning_rate': 3.998582426066291e-05, 'epoch': 2.08} 4%|▍ | 2113/50750 [5:54:34<80:02:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:37:18,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 22:37:18,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.91 | bwd_microstep: 3851.48 | bwd_inner_microstep: 3843.92 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.05 [2024-11-13 22:37:18,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.91 | bwd: 3851.49 | bwd_inner: 3843.92 | bwd_allreduce: 7.53 | step: 21.05 4%|▍ | 2114/50750 [5:54:40<80:02:58, 5.93s/it] {'loss': 0.198, 'learning_rate': 3.998577617228753e-05, 'epoch': 2.08} 4%|▍ | 2114/50750 [5:54:40<80:02:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:37:24,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.55 | optimizer_step: 4.93 [2024-11-13 22:37:24,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.77 | bwd_microstep: 3849.34 | bwd_inner_microstep: 3841.29 | bwd_allreduce_microstep: 7.99 | step_microstep: 28.99 [2024-11-13 22:37:24,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.76 | bwd: 3849.36 | bwd_inner: 3841.29 | bwd_allreduce: 8.01 | step: 29.00 4%|▍ | 2115/50750 [5:54:46<80:06:30, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.9985728002514105e-05, 'epoch': 2.08} 4%|▍ | 2115/50750 [5:54:46<80:06:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:37:30,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 22:37:30,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.54 | bwd_microstep: 3846.95 | bwd_inner_microstep: 3839.26 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.63 [2024-11-13 22:37:30,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.53 | bwd: 3846.96 | bwd_inner: 3839.26 | bwd_allreduce: 7.66 | step: 21.63 4%|▍ | 2116/50750 [5:54:52<80:06:25, 5.93s/it] {'loss': 0.4769, 'learning_rate': 3.998567975134282e-05, 'epoch': 2.08} 4%|▍ | 2116/50750 [5:54:52<80:06:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:37:36,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.92 [2024-11-13 22:37:36,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.22 | bwd_microstep: 3841.15 | bwd_inner_microstep: 3833.52 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.59 [2024-11-13 22:37:36,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.20 | bwd: 3841.17 | bwd_inner: 3833.52 | bwd_allreduce: 7.61 | step: 21.60 4%|▍ | 2117/50750 [5:54:58<80:06:22, 5.93s/it] {'loss': 0.0132, 'learning_rate': 3.998563141877387e-05, 'epoch': 2.09} 4%|▍ | 2117/50750 [5:54:58<80:06:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:37:42,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:37:42,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.17 | bwd_microstep: 3845.89 | bwd_inner_microstep: 3838.36 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.95 [2024-11-13 22:37:42,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.14 | bwd: 3845.90 | bwd_inner: 3838.36 | bwd_allreduce: 7.50 | step: 20.95 4%|▍ | 2118/50750 [5:55:04<80:05:04, 5.93s/it] {'loss': 0.0069, 'learning_rate': 3.998558300480745e-05, 'epoch': 2.09} 4%|▍ | 2118/50750 [5:55:04<80:05:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:37:48,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.94 [2024-11-13 22:37:48,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.50 | bwd_microstep: 3858.47 | bwd_inner_microstep: 3850.35 | bwd_allreduce_microstep: 8.05 | step_microstep: 26.46 [2024-11-13 22:37:48,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.50 | bwd: 3858.49 | bwd_inner: 3850.35 | bwd_allreduce: 8.08 | step: 26.46 4%|▍ | 2119/50750 [5:55:10<80:07:23, 5.93s/it] {'loss': 0.1567, 'learning_rate': 3.998553450944378e-05, 'epoch': 2.09} 4%|▍ | 2119/50750 [5:55:10<80:07:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:37:54,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:37:54,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.89 | bwd_microstep: 3860.30 | bwd_inner_microstep: 3852.76 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.63 [2024-11-13 22:37:54,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.87 | bwd: 3860.32 | bwd_inner: 3852.76 | bwd_allreduce: 7.52 | step: 21.63 4%|▍ | 2120/50750 [5:55:16<80:09:32, 5.93s/it] {'loss': 0.0008, 'learning_rate': 3.998548593268302e-05, 'epoch': 2.09} 4%|▍ | 2120/50750 [5:55:16<80:09:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:38:00,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 22:38:00,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.56 | bwd_microstep: 3855.17 | bwd_inner_microstep: 3847.63 | bwd_allreduce_microstep: 7.50 | step_microstep: 22.96 [2024-11-13 22:38:00,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.54 | bwd: 3855.18 | bwd_inner: 3847.63 | bwd_allreduce: 7.51 | step: 22.96 4%|▍ | 2121/50750 [5:55:22<80:08:22, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.99854372745254e-05, 'epoch': 2.09} 4%|▍ | 2121/50750 [5:55:22<80:08:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:38:06,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 22:38:06,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.09 | bwd_microstep: 3849.58 | bwd_inner_microstep: 3842.05 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.14 [2024-11-13 22:38:06,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.09 | bwd: 3849.59 | bwd_inner: 3842.05 | bwd_allreduce: 7.50 | step: 21.14 4%|▍ | 2122/50750 [5:55:28<80:05:29, 5.93s/it] {'loss': 0.061, 'learning_rate': 3.9985388534971094e-05, 'epoch': 2.09} 4%|▍ | 2122/50750 [5:55:28<80:05:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:38:12,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 22:38:12,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.95 | bwd_microstep: 3845.56 | bwd_inner_microstep: 3838.04 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.15 [2024-11-13 22:38:12,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.95 | bwd: 3845.57 | bwd_inner: 3838.04 | bwd_allreduce: 7.49 | step: 21.15 4%|▍ | 2123/50750 [5:55:34<80:02:37, 5.93s/it] {'loss': 0.1286, 'learning_rate': 3.998533971402032e-05, 'epoch': 2.09} 4%|▍ | 2123/50750 [5:55:34<80:02:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:38:18,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 22:38:18,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.89 | bwd_microstep: 3856.51 | bwd_inner_microstep: 3848.97 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.11 [2024-11-13 22:38:18,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.89 | bwd: 3856.52 | bwd_inner: 3848.98 | bwd_allreduce: 7.51 | step: 21.11 4%|▍ | 2124/50750 [5:55:40<80:02:39, 5.93s/it] {'loss': 0.0964, 'learning_rate': 3.998529081167327e-05, 'epoch': 2.09} 4%|▍ | 2124/50750 [5:55:40<80:02:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:38:24,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 22:38:24,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.37 | bwd_microstep: 3853.55 | bwd_inner_microstep: 3845.77 | bwd_allreduce_microstep: 7.73 | step_microstep: 22.58 [2024-11-13 22:38:24,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.37 | bwd: 3853.56 | bwd_inner: 3845.77 | bwd_allreduce: 7.75 | step: 22.59 4%|▍ | 2125/50750 [5:55:46<80:02:12, 5.93s/it] {'loss': 0.8408, 'learning_rate': 3.998524182793015e-05, 'epoch': 2.09} 4%|▍ | 2125/50750 [5:55:46<80:02:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:38:30,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:38:30,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.00 | bwd_microstep: 3848.28 | bwd_inner_microstep: 3840.36 | bwd_allreduce_microstep: 7.86 | step_microstep: 22.57 [2024-11-13 22:38:30,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.98 | bwd: 3848.30 | bwd_inner: 3840.36 | bwd_allreduce: 7.88 | step: 22.57 4%|▍ | 2126/50750 [5:55:52<80:02:45, 5.93s/it] {'loss': 0.0113, 'learning_rate': 3.998519276279114e-05, 'epoch': 2.09} 4%|▍ | 2126/50750 [5:55:52<80:02:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:38:36,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:38:36,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.32 | bwd_microstep: 3853.15 | bwd_inner_microstep: 3845.63 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.37 [2024-11-13 22:38:36,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.31 | bwd: 3853.16 | bwd_inner: 3845.63 | bwd_allreduce: 7.49 | step: 21.38 4%|▍ | 2127/50750 [5:55:57<80:03:20, 5.93s/it] {'loss': 0.0045, 'learning_rate': 3.998514361625645e-05, 'epoch': 2.1} 4%|▍ | 2127/50750 [5:55:57<80:03:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:38:41,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 22:38:41,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.19 | bwd_microstep: 3847.37 | bwd_inner_microstep: 3839.79 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.72 [2024-11-13 22:38:41,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.19 | bwd: 3847.38 | bwd_inner: 3839.79 | bwd_allreduce: 7.55 | step: 21.72 4%|▍ | 2128/50750 [5:56:03<80:02:14, 5.93s/it] {'loss': 0.1502, 'learning_rate': 3.998509438832629e-05, 'epoch': 2.1} 4%|▍ | 2128/50750 [5:56:03<80:02:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:38:47,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:38:47,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.46 | bwd_microstep: 3844.96 | bwd_inner_microstep: 3837.34 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.70 [2024-11-13 22:38:47,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.46 | bwd: 3844.98 | bwd_inner: 3837.34 | bwd_allreduce: 7.60 | step: 21.71 4%|▍ | 2129/50750 [5:56:09<79:59:46, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.9985045079000846e-05, 'epoch': 2.1} 4%|▍ | 2129/50750 [5:56:09<79:59:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:38:53,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:38:53,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.88 | bwd_microstep: 3848.30 | bwd_inner_microstep: 3840.79 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.00 [2024-11-13 22:38:53,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.88 | bwd: 3848.31 | bwd_inner: 3840.79 | bwd_allreduce: 7.48 | step: 21.00 4%|▍ | 2130/50750 [5:56:15<79:58:23, 5.92s/it] {'loss': 0.7314, 'learning_rate': 3.998499568828033e-05, 'epoch': 2.1} 4%|▍ | 2130/50750 [5:56:15<79:58:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:38:59,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 22:38:59,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.45 | bwd_microstep: 3845.41 | bwd_inner_microstep: 3837.53 | bwd_allreduce_microstep: 7.84 | step_microstep: 22.58 [2024-11-13 22:38:59,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.45 | bwd: 3845.42 | bwd_inner: 3837.53 | bwd_allreduce: 7.85 | step: 22.59 4%|▍ | 2131/50750 [5:56:21<79:58:15, 5.92s/it] {'loss': 0.0871, 'learning_rate': 3.9984946216164926e-05, 'epoch': 2.1} 4%|▍ | 2131/50750 [5:56:21<79:58:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:39:05,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:39:05,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.57 | bwd_microstep: 3853.54 | bwd_inner_microstep: 3846.01 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.24 [2024-11-13 22:39:05,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.56 | bwd: 3853.55 | bwd_inner: 3846.01 | bwd_allreduce: 7.49 | step: 21.24 4%|▍ | 2132/50750 [5:56:27<80:01:04, 5.93s/it] {'loss': 0.0185, 'learning_rate': 3.998489666265486e-05, 'epoch': 2.1} 4%|▍ | 2132/50750 [5:56:27<80:01:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:39:11,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 22:39:11,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.80 | bwd_microstep: 3854.57 | bwd_inner_microstep: 3846.96 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.78 [2024-11-13 22:39:11,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.80 | bwd: 3854.58 | bwd_inner: 3846.96 | bwd_allreduce: 7.58 | step: 21.78 4%|▍ | 2133/50750 [5:56:33<80:02:20, 5.93s/it] {'loss': 0.1528, 'learning_rate': 3.9984847027750316e-05, 'epoch': 2.1} 4%|▍ | 2133/50750 [5:56:33<80:02:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:39:17,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 22:39:17,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.33 | bwd_microstep: 3848.45 | bwd_inner_microstep: 3840.91 | bwd_allreduce_microstep: 7.50 | step_microstep: 22.05 [2024-11-13 22:39:17,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.33 | bwd: 3848.46 | bwd_inner: 3840.91 | bwd_allreduce: 7.51 | step: 22.05 4%|▍ | 2134/50750 [5:56:39<80:00:51, 5.93s/it] {'loss': 0.002, 'learning_rate': 3.998479731145151e-05, 'epoch': 2.1} 4%|▍ | 2134/50750 [5:56:39<80:00:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:39:23,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:39:23,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.57 | bwd_microstep: 3854.06 | bwd_inner_microstep: 3846.54 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.22 [2024-11-13 22:39:23,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.50 | bwd: 3854.07 | bwd_inner: 3846.54 | bwd_allreduce: 7.49 | step: 21.22 4%|▍ | 2135/50750 [5:56:45<80:02:19, 5.93s/it] {'loss': 0.4149, 'learning_rate': 3.998474751375863e-05, 'epoch': 2.1} 4%|▍ | 2135/50750 [5:56:45<80:02:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:39:29,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:39:29,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.85 | bwd_microstep: 3846.48 | bwd_inner_microstep: 3838.90 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.71 [2024-11-13 22:39:29,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.85 | bwd: 3846.49 | bwd_inner: 3838.90 | bwd_allreduce: 7.55 | step: 21.72 4%|▍ | 2136/50750 [5:56:51<80:00:35, 5.92s/it] {'loss': 0.288, 'learning_rate': 3.998469763467188e-05, 'epoch': 2.1} 4%|▍ | 2136/50750 [5:56:51<80:00:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:39:35,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 22:39:35,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.54 | bwd_microstep: 3849.40 | bwd_inner_microstep: 3841.67 | bwd_allreduce_microstep: 7.68 | step_microstep: 22.13 [2024-11-13 22:39:35,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.54 | bwd: 3849.42 | bwd_inner: 3841.67 | bwd_allreduce: 7.70 | step: 22.14 4%|▍ | 2137/50750 [5:56:57<80:00:35, 5.93s/it] {'loss': 0.0965, 'learning_rate': 3.998464767419148e-05, 'epoch': 2.11} 4%|▍ | 2137/50750 [5:56:57<80:00:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:39:41,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:39:41,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.23 | bwd_microstep: 3848.20 | bwd_inner_microstep: 3840.68 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.95 [2024-11-13 22:39:41,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.22 | bwd: 3848.22 | bwd_inner: 3840.68 | bwd_allreduce: 7.50 | step: 21.96 4%|▍ | 2138/50750 [5:57:03<80:00:00, 5.92s/it] {'loss': 0.0024, 'learning_rate': 3.998459763231761e-05, 'epoch': 2.11} 4%|▍ | 2138/50750 [5:57:03<80:00:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:39:47,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-13 22:39:47,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.75 | bwd_microstep: 3853.70 | bwd_inner_microstep: 3845.93 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.79 [2024-11-13 22:39:47,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.75 | bwd: 3853.72 | bwd_inner: 3845.93 | bwd_allreduce: 7.75 | step: 21.80 4%|▍ | 2139/50750 [5:57:09<80:00:27, 5.93s/it] {'loss': 0.4326, 'learning_rate': 3.9984547509050496e-05, 'epoch': 2.11} 4%|▍ | 2139/50750 [5:57:09<80:00:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:39:53,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:39:53,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.98 | bwd_microstep: 3852.54 | bwd_inner_microstep: 3844.94 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.48 [2024-11-13 22:39:53,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.95 | bwd: 3852.55 | bwd_inner: 3844.95 | bwd_allreduce: 7.57 | step: 21.49 4%|▍ | 2140/50750 [5:57:14<80:00:59, 5.93s/it] {'loss': 0.7141, 'learning_rate': 3.9984497304390324e-05, 'epoch': 2.11} 4%|▍ | 2140/50750 [5:57:14<80:00:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:39:58,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 22:39:58,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.58 | bwd_microstep: 3851.11 | bwd_inner_microstep: 3843.57 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.07 [2024-11-13 22:39:58,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.58 | bwd: 3851.12 | bwd_inner: 3843.57 | bwd_allreduce: 7.51 | step: 21.07 4%|▍ | 2141/50750 [5:57:20<79:59:58, 5.92s/it] {'loss': 0.0066, 'learning_rate': 3.998444701833731e-05, 'epoch': 2.11} 4%|▍ | 2141/50750 [5:57:20<79:59:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:40:04,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:40:04,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.62 | bwd_microstep: 3850.62 | bwd_inner_microstep: 3843.09 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.52 [2024-11-13 22:40:04,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.62 | bwd: 3850.64 | bwd_inner: 3843.09 | bwd_allreduce: 7.51 | step: 21.53 4%|▍ | 2142/50750 [5:57:26<80:00:36, 5.93s/it] {'loss': 0.0694, 'learning_rate': 3.998439665089165e-05, 'epoch': 2.11} 4%|▍ | 2142/50750 [5:57:26<80:00:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:40:10,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:40:10,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.66 | bwd_microstep: 3854.35 | bwd_inner_microstep: 3846.65 | bwd_allreduce_microstep: 7.64 | step_microstep: 22.11 [2024-11-13 22:40:10,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.66 | bwd: 3854.37 | bwd_inner: 3846.65 | bwd_allreduce: 7.66 | step: 22.11 4%|▍ | 2143/50750 [5:57:32<80:00:13, 5.93s/it] {'loss': 0.4744, 'learning_rate': 3.998434620205356e-05, 'epoch': 2.11} 4%|▍ | 2143/50750 [5:57:32<80:00:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:40:16,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 22:40:16,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.63 | bwd_microstep: 3848.48 | bwd_inner_microstep: 3840.97 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.15 [2024-11-13 22:40:16,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.63 | bwd: 3848.49 | bwd_inner: 3840.97 | bwd_allreduce: 7.48 | step: 21.15 4%|▍ | 2144/50750 [5:57:38<79:58:58, 5.92s/it] {'loss': 0.0673, 'learning_rate': 3.9984295671823236e-05, 'epoch': 2.11} 4%|▍ | 2144/50750 [5:57:38<79:58:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:40:22,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 22:40:22,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.50 | bwd_microstep: 3855.86 | bwd_inner_microstep: 3848.06 | bwd_allreduce_microstep: 7.75 | step_microstep: 22.17 [2024-11-13 22:40:22,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.48 | bwd: 3855.88 | bwd_inner: 3848.06 | bwd_allreduce: 7.77 | step: 22.17 4%|▍ | 2145/50750 [5:57:44<80:01:04, 5.93s/it] {'loss': 0.0063, 'learning_rate': 3.998424506020089e-05, 'epoch': 2.11} 4%|▍ | 2145/50750 [5:57:44<80:01:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:40:28,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 22:40:28,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.24 | bwd_microstep: 3857.46 | bwd_inner_microstep: 3848.26 | bwd_allreduce_microstep: 9.15 | step_microstep: 21.46 [2024-11-13 22:40:28,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.23 | bwd: 3857.47 | bwd_inner: 3848.26 | bwd_allreduce: 9.17 | step: 21.46 4%|▍ | 2146/50750 [5:57:50<80:02:23, 5.93s/it] {'loss': 0.0064, 'learning_rate': 3.998419436718672e-05, 'epoch': 2.11} 4%|▍ | 2146/50750 [5:57:50<80:02:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:40:34,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.92 [2024-11-13 22:40:34,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.58 | bwd_microstep: 3859.67 | bwd_inner_microstep: 3852.14 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.34 [2024-11-13 22:40:34,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.57 | bwd: 3859.68 | bwd_inner: 3852.14 | bwd_allreduce: 7.50 | step: 21.34 4%|▍ | 2147/50750 [5:57:56<80:02:53, 5.93s/it] {'loss': 0.3702, 'learning_rate': 3.9984143592780944e-05, 'epoch': 2.12} 4%|▍ | 2147/50750 [5:57:56<80:02:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:40:40,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:40:40,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.90 | bwd_microstep: 3851.15 | bwd_inner_microstep: 3843.60 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.73 [2024-11-13 22:40:40,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.90 | bwd: 3851.16 | bwd_inner: 3843.60 | bwd_allreduce: 7.52 | step: 21.73 4%|▍ | 2148/50750 [5:58:02<80:01:19, 5.93s/it] {'loss': 0.0916, 'learning_rate': 3.998409273698376e-05, 'epoch': 2.12} 4%|▍ | 2148/50750 [5:58:02<80:01:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:40:46,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 22:40:46,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.97 | bwd_microstep: 3852.80 | bwd_inner_microstep: 3845.26 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.19 [2024-11-13 22:40:46,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.97 | bwd: 3852.81 | bwd_inner: 3845.26 | bwd_allreduce: 7.51 | step: 21.20 4%|▍ | 2149/50750 [5:58:08<80:00:44, 5.93s/it] {'loss': 0.0077, 'learning_rate': 3.998404179979537e-05, 'epoch': 2.12} 4%|▍ | 2149/50750 [5:58:08<80:00:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 22:40:52,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 22:40:52,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.50 | bwd_microstep: 3843.97 | bwd_inner_microstep: 3836.48 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.93 [2024-11-13 22:40:52,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.51 | bwd: 3843.99 | bwd_inner: 3836.48 | bwd_allreduce: 7.47 | step: 20.94 4%|▍ | 2150/50750 [5:58:14<79:57:12, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.9983990781216e-05, 'epoch': 2.12} 4%|▍ | 2150/50750 [5:58:14<79:57:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:40:58,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-13 22:40:58,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.79 | bwd_microstep: 3851.73 | bwd_inner_microstep: 3843.93 | bwd_allreduce_microstep: 7.75 | step_microstep: 22.42 [2024-11-13 22:40:58,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.80 | bwd: 3851.75 | bwd_inner: 3843.93 | bwd_allreduce: 7.77 | step: 22.42 4%|▍ | 2151/50750 [5:58:20<79:58:33, 5.92s/it] {'loss': 0.0747, 'learning_rate': 3.998393968124585e-05, 'epoch': 2.12} 4%|▍ | 2151/50750 [5:58:20<79:58:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:41:04,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.62 | optimizer_step: 4.93 [2024-11-13 22:41:04,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.38 | bwd_microstep: 3849.05 | bwd_inner_microstep: 3840.96 | bwd_allreduce_microstep: 8.02 | step_microstep: 27.12 [2024-11-13 22:41:04,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.35 | bwd: 3849.07 | bwd_inner: 3840.96 | bwd_allreduce: 8.05 | step: 27.13 4%|▍ | 2152/50750 [5:58:26<80:01:27, 5.93s/it] {'loss': 0.0178, 'learning_rate': 3.9983888499885116e-05, 'epoch': 2.12} 4%|▍ | 2152/50750 [5:58:26<80:01:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:41:10,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:41:10,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.69 | bwd_microstep: 3851.44 | bwd_inner_microstep: 3843.93 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.93 [2024-11-13 22:41:10,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.69 | bwd: 3851.45 | bwd_inner: 3843.93 | bwd_allreduce: 7.48 | step: 20.93 4%|▍ | 2153/50750 [5:58:32<80:00:20, 5.93s/it] {'loss': 0.4767, 'learning_rate': 3.998383723713402e-05, 'epoch': 2.12} 4%|▍ | 2153/50750 [5:58:32<80:00:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:41:15,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 22:41:15,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.83 | bwd_microstep: 3850.00 | bwd_inner_microstep: 3842.27 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.31 [2024-11-13 22:41:15,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.83 | bwd: 3850.02 | bwd_inner: 3842.27 | bwd_allreduce: 7.70 | step: 21.31 4%|▍ | 2154/50750 [5:58:37<80:01:40, 5.93s/it] {'loss': 0.0083, 'learning_rate': 3.998378589299276e-05, 'epoch': 2.12} 4%|▍ | 2154/50750 [5:58:37<80:01:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:41:21,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 22:41:21,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.36 | bwd_microstep: 3841.72 | bwd_inner_microstep: 3834.13 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.81 [2024-11-13 22:41:21,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.34 | bwd: 3841.74 | bwd_inner: 3834.13 | bwd_allreduce: 7.57 | step: 21.81 4%|▍ | 2155/50750 [5:58:43<80:01:09, 5.93s/it] {'loss': 0.0496, 'learning_rate': 3.998373446746156e-05, 'epoch': 2.12} 4%|▍ | 2155/50750 [5:58:43<80:01:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:41:27,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 22:41:27,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.68 | bwd_microstep: 3843.64 | bwd_inner_microstep: 3836.05 | bwd_allreduce_microstep: 7.55 | step_microstep: 22.06 [2024-11-13 22:41:27,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.67 | bwd: 3843.65 | bwd_inner: 3836.05 | bwd_allreduce: 7.56 | step: 22.06 4%|▍ | 2156/50750 [5:58:49<79:59:48, 5.93s/it] {'loss': 0.2181, 'learning_rate': 3.998368296054062e-05, 'epoch': 2.12} 4%|▍ | 2156/50750 [5:58:49<79:59:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:41:33,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:41:33,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.50 | bwd_microstep: 3845.05 | bwd_inner_microstep: 3837.51 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.15 [2024-11-13 22:41:33,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.48 | bwd: 3845.06 | bwd_inner: 3837.51 | bwd_allreduce: 7.51 | step: 21.17 4%|▍ | 2157/50750 [5:58:55<79:59:23, 5.93s/it] {'loss': 0.0023, 'learning_rate': 3.998363137223014e-05, 'epoch': 2.13} 4%|▍ | 2157/50750 [5:58:55<79:59:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:41:39,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 22:41:39,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.59 | bwd_microstep: 3845.58 | bwd_inner_microstep: 3837.79 | bwd_allreduce_microstep: 7.74 | step_microstep: 22.15 [2024-11-13 22:41:39,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.57 | bwd: 3845.59 | bwd_inner: 3837.79 | bwd_allreduce: 7.76 | step: 22.16 4%|▍ | 2158/50750 [5:59:01<80:02:01, 5.93s/it] {'loss': 0.0336, 'learning_rate': 3.998357970253035e-05, 'epoch': 2.13} 4%|▍ | 2158/50750 [5:59:01<80:02:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:41:45,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 22:41:45,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.47 | bwd_microstep: 3840.71 | bwd_inner_microstep: 3833.19 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.23 [2024-11-13 22:41:45,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.46 | bwd: 3840.72 | bwd_inner: 3833.19 | bwd_allreduce: 7.50 | step: 21.23 4%|▍ | 2159/50750 [5:59:07<79:58:11, 5.92s/it] {'loss': 0.1455, 'learning_rate': 3.9983527951441455e-05, 'epoch': 2.13} 4%|▍ | 2159/50750 [5:59:07<79:58:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:41:51,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 22:41:51,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.34 | bwd_microstep: 3838.32 | bwd_inner_microstep: 3830.69 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.22 [2024-11-13 22:41:51,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.34 | bwd: 3838.33 | bwd_inner: 3830.69 | bwd_allreduce: 7.60 | step: 21.22 4%|▍ | 2160/50750 [5:59:13<79:54:55, 5.92s/it] {'loss': 0.2512, 'learning_rate': 3.998347611896365e-05, 'epoch': 2.13} 4%|▍ | 2160/50750 [5:59:13<79:54:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:41:57,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:41:57,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.29 | bwd_microstep: 3846.97 | bwd_inner_microstep: 3839.25 | bwd_allreduce_microstep: 7.67 | step_microstep: 20.85 [2024-11-13 22:41:57,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.29 | bwd: 3846.98 | bwd_inner: 3839.25 | bwd_allreduce: 7.69 | step: 20.85 4%|▍ | 2161/50750 [5:59:19<79:54:22, 5.92s/it] {'loss': 0.1852, 'learning_rate': 3.998342420509717e-05, 'epoch': 2.13} 4%|▍ | 2161/50750 [5:59:19<79:54:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:42:03,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:42:03,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.87 | bwd_microstep: 3845.83 | bwd_inner_microstep: 3838.34 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.80 [2024-11-13 22:42:03,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.88 | bwd: 3845.84 | bwd_inner: 3838.34 | bwd_allreduce: 7.46 | step: 20.80 4%|▍ | 2162/50750 [5:59:25<79:53:38, 5.92s/it] {'loss': 0.0008, 'learning_rate': 3.9983372209842205e-05, 'epoch': 2.13} 4%|▍ | 2162/50750 [5:59:25<79:53:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:42:09,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 22:42:09,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.83 | bwd_microstep: 3845.95 | bwd_inner_microstep: 3838.28 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.44 [2024-11-13 22:42:09,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.79 | bwd: 3845.96 | bwd_inner: 3838.28 | bwd_allreduce: 7.64 | step: 21.44 4%|▍ | 2163/50750 [5:59:31<79:56:19, 5.92s/it] {'loss': 0.0169, 'learning_rate': 3.998332013319899e-05, 'epoch': 2.13} 4%|▍ | 2163/50750 [5:59:31<79:56:19, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:42:15,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 22:42:15,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.11 | bwd_microstep: 3844.58 | bwd_inner_microstep: 3836.95 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.45 [2024-11-13 22:42:15,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.10 | bwd: 3844.59 | bwd_inner: 3836.95 | bwd_allreduce: 7.60 | step: 21.45 4%|▍ | 2164/50750 [5:59:37<79:56:56, 5.92s/it] {'loss': 0.0907, 'learning_rate': 3.998326797516771e-05, 'epoch': 2.13} 4%|▍ | 2164/50750 [5:59:37<79:56:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:42:21,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:42:21,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.35 | bwd_microstep: 3845.64 | bwd_inner_microstep: 3838.12 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.01 [2024-11-13 22:42:21,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.35 | bwd: 3845.65 | bwd_inner: 3838.12 | bwd_allreduce: 7.49 | step: 21.01 4%|▍ | 2165/50750 [5:59:43<79:56:09, 5.92s/it] {'loss': 0.0254, 'learning_rate': 3.9983215735748594e-05, 'epoch': 2.13} 4%|▍ | 2165/50750 [5:59:43<79:56:09, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:42:27,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 22:42:27,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.85 | bwd_microstep: 3849.33 | bwd_inner_microstep: 3841.83 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.85 [2024-11-13 22:42:27,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.85 | bwd: 3849.35 | bwd_inner: 3841.83 | bwd_allreduce: 7.48 | step: 20.85 4%|▍ | 2166/50750 [5:59:49<79:54:50, 5.92s/it] {'loss': 0.4175, 'learning_rate': 3.998316341494185e-05, 'epoch': 2.13} 4%|▍ | 2166/50750 [5:59:49<79:54:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:42:32,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 22:42:32,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.20 | bwd_microstep: 3843.08 | bwd_inner_microstep: 3835.50 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.43 [2024-11-13 22:42:32,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.20 | bwd: 3843.09 | bwd_inner: 3835.50 | bwd_allreduce: 7.55 | step: 21.44 4%|▍ | 2167/50750 [5:59:54<79:53:48, 5.92s/it] {'loss': 0.09, 'learning_rate': 3.99831110127477e-05, 'epoch': 2.13} 4%|▍ | 2167/50750 [5:59:54<79:53:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:42:38,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 22:42:38,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.39 | bwd_microstep: 3844.45 | bwd_inner_microstep: 3836.68 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.44 [2024-11-13 22:42:38,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.37 | bwd: 3844.46 | bwd_inner: 3836.68 | bwd_allreduce: 7.74 | step: 21.45 4%|▍ | 2168/50750 [6:00:00<79:55:08, 5.92s/it] {'loss': 0.0045, 'learning_rate': 3.998305852916635e-05, 'epoch': 2.14} 4%|▍ | 2168/50750 [6:00:00<79:55:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:42:44,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:42:44,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.26 | bwd_microstep: 3844.25 | bwd_inner_microstep: 3836.75 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.03 [2024-11-13 22:42:44,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.26 | bwd: 3844.26 | bwd_inner: 3836.75 | bwd_allreduce: 7.47 | step: 21.04 4%|▍ | 2169/50750 [6:00:06<79:54:05, 5.92s/it] {'loss': 0.0997, 'learning_rate': 3.998300596419801e-05, 'epoch': 2.14} 4%|▍ | 2169/50750 [6:00:06<79:54:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:42:50,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:42:50,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.01 | bwd_microstep: 3841.20 | bwd_inner_microstep: 3833.66 | bwd_allreduce_microstep: 7.50 | step_microstep: 20.96 [2024-11-13 22:42:50,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.01 | bwd: 3841.21 | bwd_inner: 3833.66 | bwd_allreduce: 7.51 | step: 20.96 4%|▍ | 2170/50750 [6:00:12<79:51:59, 5.92s/it] {'loss': 0.4205, 'learning_rate': 3.998295331784289e-05, 'epoch': 2.14} 4%|▍ | 2170/50750 [6:00:12<79:51:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:42:56,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:42:56,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.44 | bwd_microstep: 3845.00 | bwd_inner_microstep: 3837.47 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.02 [2024-11-13 22:42:56,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.44 | bwd: 3845.01 | bwd_inner: 3837.47 | bwd_allreduce: 7.50 | step: 21.02 4%|▍ | 2171/50750 [6:00:18<79:53:16, 5.92s/it] {'loss': 0.233, 'learning_rate': 3.998290059010122e-05, 'epoch': 2.14} 4%|▍ | 2171/50750 [6:00:18<79:53:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 22:43:02,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:43:02,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.22 | bwd_microstep: 3839.92 | bwd_inner_microstep: 3832.38 | bwd_allreduce_microstep: 7.50 | step_microstep: 20.84 [2024-11-13 22:43:02,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.23 | bwd: 3839.93 | bwd_inner: 3832.38 | bwd_allreduce: 7.51 | step: 20.84 4%|▍ | 2172/50750 [6:00:24<79:52:10, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.998284778097321e-05, 'epoch': 2.14} 4%|▍ | 2172/50750 [6:00:24<79:52:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:43:08,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:43:08,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.46 | bwd_microstep: 3842.88 | bwd_inner_microstep: 3835.16 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.06 [2024-11-13 22:43:08,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.45 | bwd: 3842.89 | bwd_inner: 3835.16 | bwd_allreduce: 7.69 | step: 21.06 4%|▍ | 2173/50750 [6:00:30<79:52:09, 5.92s/it] {'loss': 0.3497, 'learning_rate': 3.998279489045907e-05, 'epoch': 2.14} 4%|▍ | 2173/50750 [6:00:30<79:52:09, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:43:14,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-13 22:43:14,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.30 | bwd_microstep: 3845.95 | bwd_inner_microstep: 3838.12 | bwd_allreduce_microstep: 7.79 | step_microstep: 21.83 [2024-11-13 22:43:14,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.29 | bwd: 3845.97 | bwd_inner: 3838.12 | bwd_allreduce: 7.81 | step: 21.84 4%|▍ | 2174/50750 [6:00:36<79:53:16, 5.92s/it] {'loss': 0.0063, 'learning_rate': 3.9982741918559016e-05, 'epoch': 2.14} 4%|▍ | 2174/50750 [6:00:36<79:53:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:43:20,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 22:43:20,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.91 | bwd_microstep: 3846.25 | bwd_inner_microstep: 3838.56 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.57 [2024-11-13 22:43:20,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.89 | bwd: 3846.26 | bwd_inner: 3838.56 | bwd_allreduce: 7.66 | step: 21.58 4%|▍ | 2175/50750 [6:00:42<79:55:22, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.9982688865273266e-05, 'epoch': 2.14} 4%|▍ | 2175/50750 [6:00:42<79:55:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:43:26,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-13 22:43:26,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.27 | bwd_microstep: 3836.76 | bwd_inner_microstep: 3829.07 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.84 [2024-11-13 22:43:26,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.25 | bwd: 3836.78 | bwd_inner: 3829.07 | bwd_allreduce: 7.66 | step: 21.85 4%|▍ | 2176/50750 [6:00:48<79:55:13, 5.92s/it] {'loss': 0.7014, 'learning_rate': 3.9982635730602036e-05, 'epoch': 2.14} 4%|▍ | 2176/50750 [6:00:48<79:55:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:43:32,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 22:43:32,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.11 | bwd_microstep: 3852.08 | bwd_inner_microstep: 3843.99 | bwd_allreduce_microstep: 8.05 | step_microstep: 21.76 [2024-11-13 22:43:32,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.09 | bwd: 3852.09 | bwd_inner: 3843.99 | bwd_allreduce: 8.07 | step: 21.76 4%|▍ | 2177/50750 [6:00:54<79:58:04, 5.93s/it] {'loss': 0.7058, 'learning_rate': 3.998258251454554e-05, 'epoch': 2.14} 4%|▍ | 2177/50750 [6:00:54<79:58:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:43:38,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:43:38,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.02 | bwd_microstep: 3843.24 | bwd_inner_microstep: 3835.44 | bwd_allreduce_microstep: 7.76 | step_microstep: 21.20 [2024-11-13 22:43:38,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.01 | bwd: 3843.25 | bwd_inner: 3835.44 | bwd_allreduce: 7.77 | step: 21.20 4%|▍ | 2178/50750 [6:01:00<79:57:52, 5.93s/it] {'loss': 0.2667, 'learning_rate': 3.9982529217103995e-05, 'epoch': 2.15} 4%|▍ | 2178/50750 [6:01:00<79:57:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:43:44,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:43:44,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.65 | bwd_microstep: 3845.36 | bwd_inner_microstep: 3837.78 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.84 [2024-11-13 22:43:44,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.65 | bwd: 3845.38 | bwd_inner: 3837.78 | bwd_allreduce: 7.55 | step: 21.84 4%|▍ | 2179/50750 [6:01:06<79:59:09, 5.93s/it] {'loss': 0.0024, 'learning_rate': 3.9982475838277625e-05, 'epoch': 2.15} 4%|▍ | 2179/50750 [6:01:06<79:59:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:43:49,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:43:49,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.00 | bwd_microstep: 3842.81 | bwd_inner_microstep: 3835.32 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.96 [2024-11-13 22:43:49,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.00 | bwd: 3842.82 | bwd_inner: 3835.32 | bwd_allreduce: 7.46 | step: 20.97 4%|▍ | 2180/50750 [6:01:11<79:56:08, 5.92s/it] {'loss': 0.0788, 'learning_rate': 3.998242237806664e-05, 'epoch': 2.15} 4%|▍ | 2180/50750 [6:01:11<79:56:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:43:55,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:43:55,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.12 | bwd_microstep: 3844.73 | bwd_inner_microstep: 3837.21 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.03 [2024-11-13 22:43:55,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.12 | bwd: 3844.74 | bwd_inner: 3837.21 | bwd_allreduce: 7.49 | step: 21.03 4%|▍ | 2181/50750 [6:01:17<79:53:46, 5.92s/it] {'loss': 0.0028, 'learning_rate': 3.9982368836471264e-05, 'epoch': 2.15} 4%|▍ | 2181/50750 [6:01:17<79:53:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:44:01,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:44:01,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.55 | bwd_microstep: 3844.01 | bwd_inner_microstep: 3836.50 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.12 [2024-11-13 22:44:01,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.55 | bwd: 3844.02 | bwd_inner: 3836.50 | bwd_allreduce: 7.48 | step: 21.12 4%|▍ | 2182/50750 [6:01:23<79:53:23, 5.92s/it] {'loss': 0.0042, 'learning_rate': 3.9982315213491706e-05, 'epoch': 2.15} 4%|▍ | 2182/50750 [6:01:23<79:53:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:44:07,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:44:07,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.17 | bwd_microstep: 3847.44 | bwd_inner_microstep: 3839.85 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.61 [2024-11-13 22:44:07,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.14 | bwd: 3847.46 | bwd_inner: 3839.85 | bwd_allreduce: 7.57 | step: 21.61 4%|▍ | 2183/50750 [6:01:29<79:55:16, 5.92s/it] {'loss': 0.0058, 'learning_rate': 3.998226150912819e-05, 'epoch': 2.15} 4%|▍ | 2183/50750 [6:01:29<79:55:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:44:13,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 22:44:13,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.53 | bwd_microstep: 3840.53 | bwd_inner_microstep: 3832.99 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.64 [2024-11-13 22:44:13,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.53 | bwd: 3840.54 | bwd_inner: 3832.99 | bwd_allreduce: 7.51 | step: 21.64 4%|▍ | 2184/50750 [6:01:35<79:52:14, 5.92s/it] {'loss': 0.3311, 'learning_rate': 3.9982207723380934e-05, 'epoch': 2.15} 4%|▍ | 2184/50750 [6:01:35<79:52:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:44:19,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 22:44:19,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.81 | bwd_microstep: 3838.39 | bwd_inner_microstep: 3830.85 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.19 [2024-11-13 22:44:19,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.81 | bwd: 3838.41 | bwd_inner: 3830.85 | bwd_allreduce: 7.51 | step: 21.19 4%|▍ | 2185/50750 [6:01:41<79:49:55, 5.92s/it] {'loss': 0.3828, 'learning_rate': 3.998215385625016e-05, 'epoch': 2.15} 4%|▍ | 2185/50750 [6:01:41<79:49:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:44:25,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 22:44:25,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.95 | bwd_microstep: 3842.35 | bwd_inner_microstep: 3834.66 | bwd_allreduce_microstep: 7.64 | step_microstep: 22.30 [2024-11-13 22:44:25,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.95 | bwd: 3842.37 | bwd_inner: 3834.66 | bwd_allreduce: 7.66 | step: 22.30 4%|▍ | 2186/50750 [6:01:47<79:49:19, 5.92s/it] {'loss': 0.3677, 'learning_rate': 3.9982099907736074e-05, 'epoch': 2.15} 4%|▍ | 2186/50750 [6:01:47<79:49:19, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:44:31,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-13 22:44:31,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.83 | bwd_microstep: 3847.47 | bwd_inner_microstep: 3839.71 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.67 [2024-11-13 22:44:31,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.83 | bwd: 3847.49 | bwd_inner: 3839.71 | bwd_allreduce: 7.73 | step: 22.68 4%|▍ | 2187/50750 [6:01:53<79:52:02, 5.92s/it] {'loss': 0.0022, 'learning_rate': 3.998204587783892e-05, 'epoch': 2.15} 4%|▍ | 2187/50750 [6:01:53<79:52:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:44:37,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 22:44:37,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.41 | bwd_microstep: 3846.73 | bwd_inner_microstep: 3839.14 | bwd_allreduce_microstep: 7.54 | step_microstep: 22.76 [2024-11-13 22:44:37,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.40 | bwd: 3846.75 | bwd_inner: 3839.14 | bwd_allreduce: 7.56 | step: 22.76 4%|▍ | 2188/50750 [6:01:59<79:53:18, 5.92s/it] {'loss': 0.0062, 'learning_rate': 3.99819917665589e-05, 'epoch': 2.16} 4%|▍ | 2188/50750 [6:01:59<79:53:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:44:43,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 22:44:43,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.57 | bwd_microstep: 3838.79 | bwd_inner_microstep: 3831.30 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.99 [2024-11-13 22:44:43,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.57 | bwd: 3838.80 | bwd_inner: 3831.30 | bwd_allreduce: 7.46 | step: 20.99 4%|▍ | 2189/50750 [6:02:05<79:51:40, 5.92s/it] {'loss': 0.0078, 'learning_rate': 3.998193757389623e-05, 'epoch': 2.16} 4%|▍ | 2189/50750 [6:02:05<79:51:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 22:44:49,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 22:44:49,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.07 | bwd_microstep: 3850.72 | bwd_inner_microstep: 3843.02 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.62 [2024-11-13 22:44:49,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.07 | bwd: 3850.73 | bwd_inner: 3843.02 | bwd_allreduce: 7.67 | step: 21.63 4%|▍ | 2190/50750 [6:02:11<79:52:18, 5.92s/it] {'loss': 0.0101, 'learning_rate': 3.998188329985115e-05, 'epoch': 2.16} 4%|▍ | 2190/50750 [6:02:11<79:52:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:44:55,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 22:44:55,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.45 | bwd_microstep: 3839.30 | bwd_inner_microstep: 3831.80 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.87 [2024-11-13 22:44:55,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.45 | bwd: 3839.31 | bwd_inner: 3831.80 | bwd_allreduce: 7.46 | step: 20.88 4%|▍ | 2191/50750 [6:02:17<79:51:16, 5.92s/it] {'loss': 0.0181, 'learning_rate': 3.998182894442386e-05, 'epoch': 2.16} 4%|▍ | 2191/50750 [6:02:17<79:51:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:45:01,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:45:01,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.84 | bwd_microstep: 3847.42 | bwd_inner_microstep: 3839.90 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.61 [2024-11-13 22:45:01,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.84 | bwd: 3847.43 | bwd_inner: 3839.90 | bwd_allreduce: 7.49 | step: 21.62 4%|▍ | 2192/50750 [6:02:22<79:52:09, 5.92s/it] {'loss': 0.016, 'learning_rate': 3.998177450761461e-05, 'epoch': 2.16} 4%|▍ | 2192/50750 [6:02:22<79:52:09, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:45:06,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:45:06,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.19 | bwd_microstep: 3840.30 | bwd_inner_microstep: 3832.79 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.13 [2024-11-13 22:45:06,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.19 | bwd: 3840.32 | bwd_inner: 3832.79 | bwd_allreduce: 7.49 | step: 21.13 4%|▍ | 2193/50750 [6:02:28<79:49:54, 5.92s/it] {'loss': 0.0291, 'learning_rate': 3.9981719989423586e-05, 'epoch': 2.16} 4%|▍ | 2193/50750 [6:02:28<79:49:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:45:12,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 22:45:12,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.20 | bwd_microstep: 3843.61 | bwd_inner_microstep: 3836.15 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.90 [2024-11-13 22:45:12,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.20 | bwd: 3843.62 | bwd_inner: 3836.15 | bwd_allreduce: 7.43 | step: 20.90 4%|▍ | 2194/50750 [6:02:34<79:48:43, 5.92s/it] {'loss': 0.0098, 'learning_rate': 3.998166538985103e-05, 'epoch': 2.16} 4%|▍ | 2194/50750 [6:02:34<79:48:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:45:18,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:45:18,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.10 | bwd_microstep: 3850.03 | bwd_inner_microstep: 3842.53 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.89 [2024-11-13 22:45:18,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.10 | bwd: 3850.04 | bwd_inner: 3842.53 | bwd_allreduce: 7.48 | step: 20.90 4%|▍ | 2195/50750 [6:02:40<79:50:16, 5.92s/it] {'loss': 0.0025, 'learning_rate': 3.9981610708897165e-05, 'epoch': 2.16} 4%|▍ | 2195/50750 [6:02:40<79:50:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 22:45:24,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 22:45:24,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.21 | bwd_microstep: 3845.73 | bwd_inner_microstep: 3838.21 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.14 [2024-11-13 22:45:24,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.20 | bwd: 3845.75 | bwd_inner: 3838.21 | bwd_allreduce: 7.50 | step: 21.14 4%|▍ | 2196/50750 [6:02:46<79:50:38, 5.92s/it] {'loss': 0.0012, 'learning_rate': 3.998155594656221e-05, 'epoch': 2.16} 4%|▍ | 2196/50750 [6:02:46<79:50:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:45:30,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.92 [2024-11-13 22:45:30,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.05 | bwd_microstep: 3843.97 | bwd_inner_microstep: 3836.09 | bwd_allreduce_microstep: 7.83 | step_microstep: 22.25 [2024-11-13 22:45:30,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.05 | bwd: 3843.98 | bwd_inner: 3836.09 | bwd_allreduce: 7.84 | step: 22.25 4%|▍ | 2197/50750 [6:02:52<79:51:09, 5.92s/it] {'loss': 0.1529, 'learning_rate': 3.998150110284639e-05, 'epoch': 2.16} 4%|▍ | 2197/50750 [6:02:52<79:51:09, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:45:36,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:45:36,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.75 | bwd_microstep: 3839.56 | bwd_inner_microstep: 3832.08 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.17 [2024-11-13 22:45:36,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.72 | bwd: 3839.57 | bwd_inner: 3832.08 | bwd_allreduce: 7.45 | step: 21.18 4%|▍ | 2198/50750 [6:02:58<79:50:24, 5.92s/it] {'loss': 0.3893, 'learning_rate': 3.998144617774992e-05, 'epoch': 2.17} 4%|▍ | 2198/50750 [6:02:58<79:50:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:45:42,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 22:45:42,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.77 | bwd_microstep: 3850.69 | bwd_inner_microstep: 3843.19 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.16 [2024-11-13 22:45:42,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.77 | bwd: 3850.70 | bwd_inner: 3843.19 | bwd_allreduce: 7.47 | step: 21.16 4%|▍ | 2199/50750 [6:03:04<79:51:12, 5.92s/it] {'loss': 0.3267, 'learning_rate': 3.998139117127304e-05, 'epoch': 2.17} 4%|▍ | 2199/50750 [6:03:04<79:51:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:45:48,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:45:48,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.18 | bwd_microstep: 3852.78 | bwd_inner_microstep: 3845.28 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.04 [2024-11-13 22:45:48,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.18 | bwd: 3852.79 | bwd_inner: 3845.28 | bwd_allreduce: 7.47 | step: 21.04 4%|▍ | 2200/50750 [6:03:10<79:51:37, 5.92s/it] {'loss': 0.0049, 'learning_rate': 3.998133608341596e-05, 'epoch': 2.17} 4%|▍ | 2200/50750 [6:03:10<79:51:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:45:54,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:45:54,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.82 | bwd_microstep: 3847.46 | bwd_inner_microstep: 3839.85 | bwd_allreduce_microstep: 7.57 | step_microstep: 22.20 [2024-11-13 22:45:54,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.82 | bwd: 3847.48 | bwd_inner: 3839.85 | bwd_allreduce: 7.58 | step: 22.20 4%|▍ | 2201/50750 [6:03:16<79:51:32, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.998128091417891e-05, 'epoch': 2.17} 4%|▍ | 2201/50750 [6:03:16<79:51:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:46:00,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:46:00,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.23 | bwd_microstep: 3838.98 | bwd_inner_microstep: 3831.52 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.05 [2024-11-13 22:46:00,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.23 | bwd: 3838.99 | bwd_inner: 3831.52 | bwd_allreduce: 7.44 | step: 21.06 4%|▍ | 2202/50750 [6:03:22<79:48:28, 5.92s/it] {'loss': 0.0091, 'learning_rate': 3.998122566356212e-05, 'epoch': 2.17} 4%|▍ | 2202/50750 [6:03:22<79:48:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:46:06,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 5.03 [2024-11-13 22:46:06,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.93 | bwd_microstep: 3845.10 | bwd_inner_microstep: 3836.09 | bwd_allreduce_microstep: 8.97 | step_microstep: 23.05 [2024-11-13 22:46:06,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.92 | bwd: 3845.12 | bwd_inner: 3836.09 | bwd_allreduce: 8.99 | step: 23.05 4%|▍ | 2203/50750 [6:03:28<79:49:15, 5.92s/it] {'loss': 0.1166, 'learning_rate': 3.9981170331565796e-05, 'epoch': 2.17} 4%|▍ | 2203/50750 [6:03:28<79:49:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:46:12,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:46:12,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.99 | bwd_microstep: 3842.66 | bwd_inner_microstep: 3835.18 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.92 [2024-11-13 22:46:12,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.99 | bwd: 3842.67 | bwd_inner: 3835.18 | bwd_allreduce: 7.45 | step: 20.93 4%|▍ | 2204/50750 [6:03:34<79:48:29, 5.92s/it] {'loss': 0.4427, 'learning_rate': 3.998111491819019e-05, 'epoch': 2.17} 4%|▍ | 2204/50750 [6:03:34<79:48:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:46:17,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 22:46:17,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.64 | bwd_microstep: 3844.99 | bwd_inner_microstep: 3837.52 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.01 [2024-11-13 22:46:17,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.64 | bwd: 3845.00 | bwd_inner: 3837.52 | bwd_allreduce: 7.45 | step: 21.01 4%|▍ | 2205/50750 [6:03:39<79:48:04, 5.92s/it] {'loss': 0.0042, 'learning_rate': 3.99810594234355e-05, 'epoch': 2.17} 4%|▍ | 2205/50750 [6:03:39<79:48:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 22:46:23,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 22:46:23,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.19 | bwd_microstep: 3852.87 | bwd_inner_microstep: 3845.26 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.35 [2024-11-13 22:46:23,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.20 | bwd: 3852.88 | bwd_inner: 3845.26 | bwd_allreduce: 7.58 | step: 21.36 4%|▍ | 2206/50750 [6:03:45<79:51:52, 5.92s/it] {'loss': 0.0945, 'learning_rate': 3.9981003847301976e-05, 'epoch': 2.17} 4%|▍ | 2206/50750 [6:03:45<79:51:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:46:29,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 22:46:29,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.80 | bwd_microstep: 3848.18 | bwd_inner_microstep: 3840.18 | bwd_allreduce_microstep: 7.94 | step_microstep: 21.97 [2024-11-13 22:46:29,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.79 | bwd: 3848.20 | bwd_inner: 3840.18 | bwd_allreduce: 7.96 | step: 21.97 4%|▍ | 2207/50750 [6:03:51<79:53:16, 5.92s/it] {'loss': 0.3022, 'learning_rate': 3.9980948189789834e-05, 'epoch': 2.17} 4%|▍ | 2207/50750 [6:03:51<79:53:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:46:35,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:46:35,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.62 | bwd_microstep: 3853.06 | bwd_inner_microstep: 3845.58 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.89 [2024-11-13 22:46:35,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.60 | bwd: 3853.07 | bwd_inner: 3845.58 | bwd_allreduce: 7.45 | step: 20.89 4%|▍ | 2208/50750 [6:03:57<79:54:33, 5.93s/it] {'loss': 0.005, 'learning_rate': 3.9980892450899294e-05, 'epoch': 2.18} 4%|▍ | 2208/50750 [6:03:57<79:54:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:46:41,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:46:41,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.21 | bwd_microstep: 3848.13 | bwd_inner_microstep: 3840.67 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.31 [2024-11-13 22:46:41,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.21 | bwd: 3848.15 | bwd_inner: 3840.67 | bwd_allreduce: 7.44 | step: 21.32 4%|▍ | 2209/50750 [6:04:03<79:52:15, 5.92s/it] {'loss': 0.3538, 'learning_rate': 3.998083663063059e-05, 'epoch': 2.18} 4%|▍ | 2209/50750 [6:04:03<79:52:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:46:47,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:46:47,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.18 | bwd_microstep: 3842.80 | bwd_inner_microstep: 3835.33 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.96 [2024-11-13 22:46:47,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.18 | bwd: 3842.82 | bwd_inner: 3835.33 | bwd_allreduce: 7.45 | step: 20.96 4%|▍ | 2210/50750 [6:04:09<79:49:29, 5.92s/it] {'loss': 0.2298, 'learning_rate': 3.998078072898396e-05, 'epoch': 2.18} 4%|▍ | 2210/50750 [6:04:09<79:49:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:46:53,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:46:53,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.36 | bwd_microstep: 3843.96 | bwd_inner_microstep: 3836.47 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.90 [2024-11-13 22:46:53,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.36 | bwd: 3843.97 | bwd_inner: 3836.47 | bwd_allreduce: 7.46 | step: 20.90 4%|▍ | 2211/50750 [6:04:15<79:47:32, 5.92s/it] {'loss': 0.1292, 'learning_rate': 3.998072474595961e-05, 'epoch': 2.18} 4%|▍ | 2211/50750 [6:04:15<79:47:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:46:59,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-13 22:46:59,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.51 | bwd_microstep: 3844.95 | bwd_inner_microstep: 3837.42 | bwd_allreduce_microstep: 7.49 | step_microstep: 22.84 [2024-11-13 22:46:59,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.51 | bwd: 3844.96 | bwd_inner: 3837.42 | bwd_allreduce: 7.50 | step: 22.85 4%|▍ | 2212/50750 [6:04:21<79:47:56, 5.92s/it] {'loss': 0.1465, 'learning_rate': 3.9980668681557784e-05, 'epoch': 2.18} 4%|▍ | 2212/50750 [6:04:21<79:47:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:47:05,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.94 [2024-11-13 22:47:05,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.74 | bwd_microstep: 3836.54 | bwd_inner_microstep: 3829.01 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.19 [2024-11-13 22:47:05,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.72 | bwd: 3836.55 | bwd_inner: 3829.01 | bwd_allreduce: 7.51 | step: 21.20 4%|▍ | 2213/50750 [6:04:27<79:46:35, 5.92s/it] {'loss': 0.0668, 'learning_rate': 3.9980612535778704e-05, 'epoch': 2.18} 4%|▍ | 2213/50750 [6:04:27<79:46:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:47:11,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:47:11,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.18 | bwd_microstep: 3848.22 | bwd_inner_microstep: 3840.74 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.12 [2024-11-13 22:47:11,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.18 | bwd: 3848.23 | bwd_inner: 3840.74 | bwd_allreduce: 7.46 | step: 21.12 4%|▍ | 2214/50750 [6:04:33<79:47:35, 5.92s/it] {'loss': 0.0327, 'learning_rate': 3.99805563086226e-05, 'epoch': 2.18} 4%|▍ | 2214/50750 [6:04:33<79:47:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:47:17,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 22:47:17,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.87 | bwd_microstep: 3845.95 | bwd_inner_microstep: 3838.15 | bwd_allreduce_microstep: 7.76 | step_microstep: 21.99 [2024-11-13 22:47:17,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.87 | bwd: 3845.97 | bwd_inner: 3838.15 | bwd_allreduce: 7.78 | step: 21.99 4%|▍ | 2215/50750 [6:04:39<79:47:48, 5.92s/it] {'loss': 0.0056, 'learning_rate': 3.99805000000897e-05, 'epoch': 2.18} 4%|▍ | 2215/50750 [6:04:39<79:47:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:47:23,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.92 [2024-11-13 22:47:23,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.28 | bwd_microstep: 3842.12 | bwd_inner_microstep: 3834.40 | bwd_allreduce_microstep: 7.67 | step_microstep: 22.31 [2024-11-13 22:47:23,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.27 | bwd: 3842.14 | bwd_inner: 3834.40 | bwd_allreduce: 7.69 | step: 22.32 4%|▍ | 2216/50750 [6:04:45<79:49:02, 5.92s/it] {'loss': 0.0353, 'learning_rate': 3.998044361018023e-05, 'epoch': 2.18} 4%|▍ | 2216/50750 [6:04:45<79:49:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:47:29,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 22:47:29,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.06 | bwd_microstep: 3844.91 | bwd_inner_microstep: 3837.28 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.29 [2024-11-13 22:47:29,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.04 | bwd: 3844.93 | bwd_inner: 3837.28 | bwd_allreduce: 7.60 | step: 21.30 4%|▍ | 2217/50750 [6:04:50<79:48:09, 5.92s/it] {'loss': 0.0658, 'learning_rate': 3.9980387138894435e-05, 'epoch': 2.18} 4%|▍ | 2217/50750 [6:04:50<79:48:09, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:47:34,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.61 | optimizer_step: 4.93 [2024-11-13 22:47:34,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.11 | bwd_microstep: 3847.54 | bwd_inner_microstep: 3838.91 | bwd_allreduce_microstep: 8.54 | step_microstep: 23.19 [2024-11-13 22:47:34,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.11 | bwd: 3847.56 | bwd_inner: 3838.91 | bwd_allreduce: 8.58 | step: 23.20 4%|▍ | 2218/50750 [6:04:56<79:49:41, 5.92s/it] {'loss': 0.0102, 'learning_rate': 3.998033058623253e-05, 'epoch': 2.19} 4%|▍ | 2218/50750 [6:04:56<79:49:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:47:40,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 22:47:40,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.97 | bwd_microstep: 3842.32 | bwd_inner_microstep: 3834.74 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.27 [2024-11-13 22:47:40,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.96 | bwd: 3842.33 | bwd_inner: 3834.74 | bwd_allreduce: 7.55 | step: 21.27 4%|▍ | 2219/50750 [6:05:02<79:50:12, 5.92s/it] {'loss': 0.0145, 'learning_rate': 3.9980273952194744e-05, 'epoch': 2.19} 4%|▍ | 2219/50750 [6:05:02<79:50:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:47:46,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 22:47:46,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.08 | bwd_microstep: 3849.30 | bwd_inner_microstep: 3841.65 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.55 [2024-11-13 22:47:46,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.07 | bwd: 3849.32 | bwd_inner: 3841.65 | bwd_allreduce: 7.63 | step: 21.56 4%|▍ | 2220/50750 [6:05:08<79:53:06, 5.93s/it] {'loss': 0.0402, 'learning_rate': 3.998021723678132e-05, 'epoch': 2.19} 4%|▍ | 2220/50750 [6:05:08<79:53:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 22:47:52,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 22:47:52,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.95 | bwd_microstep: 3848.30 | bwd_inner_microstep: 3840.70 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.47 [2024-11-13 22:47:52,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.95 | bwd: 3848.32 | bwd_inner: 3840.70 | bwd_allreduce: 7.57 | step: 21.47 4%|▍ | 2221/50750 [6:05:14<79:53:45, 5.93s/it] {'loss': 0.0326, 'learning_rate': 3.9980160439992475e-05, 'epoch': 2.19} 4%|▍ | 2221/50750 [6:05:14<79:53:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:47:58,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:47:58,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.77 | bwd_microstep: 3850.28 | bwd_inner_microstep: 3842.71 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.29 [2024-11-13 22:47:58,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.76 | bwd: 3850.30 | bwd_inner: 3842.71 | bwd_allreduce: 7.54 | step: 21.31 4%|▍ | 2222/50750 [6:05:20<79:54:35, 5.93s/it] {'loss': 0.0138, 'learning_rate': 3.998010356182845e-05, 'epoch': 2.19} 4%|▍ | 2222/50750 [6:05:20<79:54:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:48:04,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:48:04,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.79 | bwd_microstep: 3847.37 | bwd_inner_microstep: 3839.85 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.34 [2024-11-13 22:48:04,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.79 | bwd: 3847.39 | bwd_inner: 3839.85 | bwd_allreduce: 7.49 | step: 21.35 4%|▍ | 2223/50750 [6:05:26<79:55:16, 5.93s/it] {'loss': 0.0388, 'learning_rate': 3.998004660228947e-05, 'epoch': 2.19} 4%|▍ | 2223/50750 [6:05:26<79:55:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:48:10,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 22:48:10,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.78 | bwd_microstep: 3844.50 | bwd_inner_microstep: 3836.76 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.42 [2024-11-13 22:48:10,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.77 | bwd: 3844.51 | bwd_inner: 3836.76 | bwd_allreduce: 7.71 | step: 21.43 4%|▍ | 2224/50750 [6:05:32<79:53:20, 5.93s/it] {'loss': 0.0041, 'learning_rate': 3.997998956137578e-05, 'epoch': 2.19} 4%|▍ | 2224/50750 [6:05:32<79:53:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:48:16,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-13 22:48:16,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.31 | bwd_microstep: 3849.07 | bwd_inner_microstep: 3841.35 | bwd_allreduce_microstep: 7.67 | step_microstep: 22.09 [2024-11-13 22:48:16,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.30 | bwd: 3849.08 | bwd_inner: 3841.35 | bwd_allreduce: 7.69 | step: 22.10 4%|▍ | 2225/50750 [6:05:38<79:55:30, 5.93s/it] {'loss': 0.4036, 'learning_rate': 3.9979932439087596e-05, 'epoch': 2.19} 4%|▍ | 2225/50750 [6:05:38<79:55:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:48:22,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 22:48:22,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.04 | bwd_microstep: 3843.61 | bwd_inner_microstep: 3835.70 | bwd_allreduce_microstep: 7.86 | step_microstep: 21.95 [2024-11-13 22:48:22,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.02 | bwd: 3843.62 | bwd_inner: 3835.70 | bwd_allreduce: 7.88 | step: 21.95 4%|▍ | 2226/50750 [6:05:44<79:53:44, 5.93s/it] {'loss': 0.0009, 'learning_rate': 3.997987523542516e-05, 'epoch': 2.19} 4%|▍ | 2226/50750 [6:05:44<79:53:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:48:28,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 22:48:28,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.69 | bwd_microstep: 3841.90 | bwd_inner_microstep: 3834.28 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.71 [2024-11-13 22:48:28,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.67 | bwd: 3841.91 | bwd_inner: 3834.28 | bwd_allreduce: 7.59 | step: 21.72 4%|▍ | 2227/50750 [6:05:50<79:52:46, 5.93s/it] {'loss': 0.8606, 'learning_rate': 3.99798179503887e-05, 'epoch': 2.19} 4%|▍ | 2227/50750 [6:05:50<79:52:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:48:34,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 22:48:34,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.13 | bwd_microstep: 3862.79 | bwd_inner_microstep: 3855.12 | bwd_allreduce_microstep: 7.62 | step_microstep: 22.68 [2024-11-13 22:48:34,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.12 | bwd: 3862.81 | bwd_inner: 3855.12 | bwd_allreduce: 7.64 | step: 22.68 4%|▍ | 2228/50750 [6:05:56<79:55:36, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.997976058397845e-05, 'epoch': 2.2} 4%|▍ | 2228/50750 [6:05:56<79:55:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:48:40,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.72 | optimizer_step: 4.93 [2024-11-13 22:48:40,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.93 | bwd_microstep: 3841.38 | bwd_inner_microstep: 3833.87 | bwd_allreduce_microstep: 7.47 | step_microstep: 23.75 [2024-11-13 22:48:40,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.92 | bwd: 3841.40 | bwd_inner: 3833.87 | bwd_allreduce: 7.49 | step: 23.76 4%|▍ | 2229/50750 [6:06:02<79:52:32, 5.93s/it] {'loss': 0.0059, 'learning_rate': 3.997970313619466e-05, 'epoch': 2.2} 4%|▍ | 2229/50750 [6:06:02<79:52:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:48:46,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 22:48:46,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.62 | bwd_microstep: 3843.85 | bwd_inner_microstep: 3836.32 | bwd_allreduce_microstep: 7.49 | step_microstep: 22.41 [2024-11-13 22:48:46,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.60 | bwd: 3843.86 | bwd_inner: 3836.32 | bwd_allreduce: 7.50 | step: 22.41 4%|▍ | 2230/50750 [6:06:08<79:50:49, 5.92s/it] {'loss': 0.0219, 'learning_rate': 3.9979645607037535e-05, 'epoch': 2.2} 4%|▍ | 2230/50750 [6:06:08<79:50:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:48:52,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 22:48:52,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.91 | bwd_microstep: 3866.31 | bwd_inner_microstep: 3858.79 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.17 [2024-11-13 22:48:52,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.91 | bwd: 3866.33 | bwd_inner: 3858.79 | bwd_allreduce: 7.49 | step: 21.18 4%|▍ | 2231/50750 [6:06:13<79:54:07, 5.93s/it] {'loss': 0.0075, 'learning_rate': 3.997958799650733e-05, 'epoch': 2.2} 4%|▍ | 2231/50750 [6:06:13<79:54:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:48:57,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 22:48:57,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.89 | bwd_microstep: 3842.47 | bwd_inner_microstep: 3834.70 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.71 [2024-11-13 22:48:57,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.89 | bwd: 3842.48 | bwd_inner: 3834.70 | bwd_allreduce: 7.74 | step: 21.71 4%|▍ | 2232/50750 [6:06:19<79:50:29, 5.92s/it] {'loss': 0.0089, 'learning_rate': 3.997953030460427e-05, 'epoch': 2.2} 4%|▍ | 2232/50750 [6:06:19<79:50:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:49:03,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 22:49:03,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.29 | bwd_microstep: 3849.56 | bwd_inner_microstep: 3841.62 | bwd_allreduce_microstep: 7.89 | step_microstep: 21.82 [2024-11-13 22:49:03,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.27 | bwd: 3849.57 | bwd_inner: 3841.62 | bwd_allreduce: 7.91 | step: 21.82 4%|▍ | 2233/50750 [6:06:25<79:51:15, 5.93s/it] {'loss': 0.5431, 'learning_rate': 3.9979472531328605e-05, 'epoch': 2.2} 4%|▍ | 2233/50750 [6:06:25<79:51:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:49:09,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 22:49:09,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.85 | bwd_microstep: 3844.85 | bwd_inner_microstep: 3837.39 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.84 [2024-11-13 22:49:09,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.84 | bwd: 3844.86 | bwd_inner: 3837.39 | bwd_allreduce: 7.43 | step: 20.84 4%|▍ | 2234/50750 [6:06:31<79:49:22, 5.92s/it] {'loss': 0.0083, 'learning_rate': 3.997941467668055e-05, 'epoch': 2.2} 4%|▍ | 2234/50750 [6:06:31<79:49:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:49:15,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.36 | optimizer_step: 4.93 [2024-11-13 22:49:15,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.76 | bwd_microstep: 3851.18 | bwd_inner_microstep: 3843.50 | bwd_allreduce_microstep: 7.64 | step_microstep: 23.43 [2024-11-13 22:49:15,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.76 | bwd: 3851.20 | bwd_inner: 3843.50 | bwd_allreduce: 7.66 | step: 23.44 4%|▍ | 2235/50750 [6:06:37<79:53:18, 5.93s/it] {'loss': 0.0023, 'learning_rate': 3.9979356740660346e-05, 'epoch': 2.2} 4%|▍ | 2235/50750 [6:06:37<79:53:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:49:21,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 22:49:21,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.21 | bwd_microstep: 3849.71 | bwd_inner_microstep: 3842.12 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.37 [2024-11-13 22:49:21,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.20 | bwd: 3849.72 | bwd_inner: 3842.12 | bwd_allreduce: 7.56 | step: 21.38 4%|▍ | 2236/50750 [6:06:43<79:53:01, 5.93s/it] {'loss': 0.0017, 'learning_rate': 3.997929872326823e-05, 'epoch': 2.2} 4%|▍ | 2236/50750 [6:06:43<79:53:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:49:27,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:49:27,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.00 | bwd_microstep: 3842.58 | bwd_inner_microstep: 3835.12 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.94 [2024-11-13 22:49:27,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.99 | bwd: 3842.59 | bwd_inner: 3835.12 | bwd_allreduce: 7.44 | step: 20.95 4%|▍ | 2237/50750 [6:06:49<79:49:15, 5.92s/it] {'loss': 0.0059, 'learning_rate': 3.997924062450446e-05, 'epoch': 2.2} 4%|▍ | 2237/50750 [6:06:49<79:49:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:49:33,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:49:33,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.60 | bwd_microstep: 3844.33 | bwd_inner_microstep: 3836.85 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.74 [2024-11-13 22:49:33,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.59 | bwd: 3844.34 | bwd_inner: 3836.85 | bwd_allreduce: 7.45 | step: 20.75 4%|▍ | 2238/50750 [6:06:55<79:47:42, 5.92s/it] {'loss': 0.0077, 'learning_rate': 3.9979182444369235e-05, 'epoch': 2.2} 4%|▍ | 2238/50750 [6:06:55<79:47:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:49:39,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 22:49:39,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.54 | bwd_microstep: 3855.87 | bwd_inner_microstep: 3847.30 | bwd_allreduce_microstep: 8.50 | step_microstep: 22.45 [2024-11-13 22:49:39,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.54 | bwd: 3855.89 | bwd_inner: 3847.30 | bwd_allreduce: 8.53 | step: 22.44 4%|▍ | 2239/50750 [6:07:01<79:50:07, 5.92s/it] {'loss': 0.053, 'learning_rate': 3.9979124182862816e-05, 'epoch': 2.21} 4%|▍ | 2239/50750 [6:07:01<79:50:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:49:45,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:49:45,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.92 | bwd_microstep: 3852.67 | bwd_inner_microstep: 3845.09 | bwd_allreduce_microstep: 7.53 | step_microstep: 20.77 [2024-11-13 22:49:45,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.92 | bwd: 3852.68 | bwd_inner: 3845.09 | bwd_allreduce: 7.55 | step: 20.77 4%|▍ | 2240/50750 [6:07:07<79:49:35, 5.92s/it] {'loss': 0.0021, 'learning_rate': 3.997906583998544e-05, 'epoch': 2.21} 4%|▍ | 2240/50750 [6:07:07<79:49:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:49:51,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 5.03 [2024-11-13 22:49:51,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.65 | bwd_microstep: 3850.97 | bwd_inner_microstep: 3843.28 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.49 [2024-11-13 22:49:51,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.65 | bwd: 3850.99 | bwd_inner: 3843.28 | bwd_allreduce: 7.66 | step: 21.49 4%|▍ | 2241/50750 [6:07:13<79:50:11, 5.92s/it] {'loss': 0.0729, 'learning_rate': 3.997900741573733e-05, 'epoch': 2.21} 4%|▍ | 2241/50750 [6:07:13<79:50:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:49:57,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:49:57,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.34 | bwd_microstep: 3850.84 | bwd_inner_microstep: 3843.34 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.09 [2024-11-13 22:49:57,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.34 | bwd: 3850.85 | bwd_inner: 3843.34 | bwd_allreduce: 7.48 | step: 21.09 4%|▍ | 2242/50750 [6:07:19<79:48:39, 5.92s/it] {'loss': 0.0008, 'learning_rate': 3.997894891011874e-05, 'epoch': 2.21} 4%|▍ | 2242/50750 [6:07:19<79:48:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:50:03,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:50:03,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.93 | bwd_microstep: 3848.98 | bwd_inner_microstep: 3841.46 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.01 [2024-11-13 22:50:03,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.93 | bwd: 3848.99 | bwd_inner: 3841.46 | bwd_allreduce: 7.49 | step: 21.01 4%|▍ | 2243/50750 [6:07:25<79:47:14, 5.92s/it] {'loss': 0.5342, 'learning_rate': 3.99788903231299e-05, 'epoch': 2.21} 4%|▍ | 2243/50750 [6:07:25<79:47:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:50:08,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:50:08,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.12 | bwd_microstep: 3836.95 | bwd_inner_microstep: 3829.43 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.08 [2024-11-13 22:50:08,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.12 | bwd: 3836.96 | bwd_inner: 3829.43 | bwd_allreduce: 7.49 | step: 21.08 4%|▍ | 2244/50750 [6:07:30<79:43:07, 5.92s/it] {'loss': 0.0188, 'learning_rate': 3.9978831654771046e-05, 'epoch': 2.21} 4%|▍ | 2244/50750 [6:07:30<79:43:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:50:14,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-13 22:50:14,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.79 | bwd_microstep: 3841.42 | bwd_inner_microstep: 3833.92 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.69 [2024-11-13 22:50:14,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.79 | bwd: 3841.43 | bwd_inner: 3833.92 | bwd_allreduce: 7.47 | step: 21.70 4%|▍ | 2245/50750 [6:07:36<79:41:29, 5.91s/it] {'loss': 0.0022, 'learning_rate': 3.997877290504242e-05, 'epoch': 2.21} 4%|▍ | 2245/50750 [6:07:36<79:41:29, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:50:20,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:50:20,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.94 | bwd_microstep: 3840.96 | bwd_inner_microstep: 3833.37 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.30 [2024-11-13 22:50:20,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.93 | bwd: 3840.98 | bwd_inner: 3833.37 | bwd_allreduce: 7.56 | step: 21.30 4%|▍ | 2246/50750 [6:07:42<79:42:25, 5.92s/it] {'loss': 0.0101, 'learning_rate': 3.997871407394427e-05, 'epoch': 2.21} 4%|▍ | 2246/50750 [6:07:42<79:42:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:50:26,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 22:50:26,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.90 | bwd_microstep: 3839.98 | bwd_inner_microstep: 3832.24 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.69 [2024-11-13 22:50:26,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.89 | bwd: 3839.99 | bwd_inner: 3832.24 | bwd_allreduce: 7.71 | step: 21.70 4%|▍ | 2247/50750 [6:07:48<79:43:26, 5.92s/it] {'loss': 0.008, 'learning_rate': 3.9978655161476824e-05, 'epoch': 2.21} 4%|▍ | 2247/50750 [6:07:48<79:43:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:50:32,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 22:50:32,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.07 | bwd_microstep: 3838.94 | bwd_inner_microstep: 3831.35 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.27 [2024-11-13 22:50:32,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.06 | bwd: 3838.95 | bwd_inner: 3831.35 | bwd_allreduce: 7.56 | step: 21.27 4%|▍ | 2248/50750 [6:07:54<79:43:39, 5.92s/it] {'loss': 0.0129, 'learning_rate': 3.997859616764033e-05, 'epoch': 2.21} 4%|▍ | 2248/50750 [6:07:54<79:43:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 22:50:38,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.47 | optimizer_step: 4.93 [2024-11-13 22:50:38,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.24 | bwd_microstep: 3851.22 | bwd_inner_microstep: 3843.64 | bwd_allreduce_microstep: 7.54 | step_microstep: 22.74 [2024-11-13 22:50:38,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.23 | bwd: 3851.23 | bwd_inner: 3843.64 | bwd_allreduce: 7.55 | step: 22.75 4%|▍ | 2249/50750 [6:08:00<79:45:54, 5.92s/it] {'loss': 0.0014, 'learning_rate': 3.997853709243502e-05, 'epoch': 2.22} 4%|▍ | 2249/50750 [6:08:00<79:45:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:50:44,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.87 | optimizer_step: 4.93 [2024-11-13 22:50:44,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.08 | bwd_microstep: 3842.54 | bwd_inner_microstep: 3834.96 | bwd_allreduce_microstep: 7.53 | step_microstep: 23.76 [2024-11-13 22:50:44,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.08 | bwd: 3842.55 | bwd_inner: 3834.96 | bwd_allreduce: 7.55 | step: 23.78 4%|▍ | 2250/50750 [6:08:06<79:45:59, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.997847793586114e-05, 'epoch': 2.22} 4%|▍ | 2250/50750 [6:08:06<79:45:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:50:50,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:50:50,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.61 | bwd_microstep: 3839.57 | bwd_inner_microstep: 3832.00 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.35 [2024-11-13 22:50:50,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.61 | bwd: 3839.58 | bwd_inner: 3832.00 | bwd_allreduce: 7.54 | step: 21.35 4%|▍ | 2251/50750 [6:08:12<79:43:02, 5.92s/it] {'loss': 0.7828, 'learning_rate': 3.9978418697918935e-05, 'epoch': 2.22} 4%|▍ | 2251/50750 [6:08:12<79:43:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:50:56,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:50:56,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.05 | bwd_microstep: 3838.71 | bwd_inner_microstep: 3831.20 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.46 [2024-11-13 22:50:56,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.05 | bwd: 3838.72 | bwd_inner: 3831.20 | bwd_allreduce: 7.49 | step: 21.46 4%|▍ | 2252/50750 [6:08:18<79:40:20, 5.91s/it] {'loss': 0.3545, 'learning_rate': 3.997835937860863e-05, 'epoch': 2.22} 4%|▍ | 2252/50750 [6:08:18<79:40:20, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:51:02,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:51:02,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.93 | bwd_microstep: 3838.60 | bwd_inner_microstep: 3831.15 | bwd_allreduce_microstep: 7.41 | step_microstep: 21.34 [2024-11-13 22:51:02,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.93 | bwd: 3838.61 | bwd_inner: 3831.15 | bwd_allreduce: 7.43 | step: 21.34 4%|▍ | 2253/50750 [6:08:24<79:38:59, 5.91s/it] {'loss': 0.0115, 'learning_rate': 3.9978299977930485e-05, 'epoch': 2.22} 4%|▍ | 2253/50750 [6:08:24<79:38:59, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:51:08,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 22:51:08,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.32 | bwd_microstep: 3853.47 | bwd_inner_microstep: 3846.00 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.20 [2024-11-13 22:51:08,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.32 | bwd: 3853.48 | bwd_inner: 3846.00 | bwd_allreduce: 7.44 | step: 21.20 4%|▍ | 2254/50750 [6:08:30<79:41:33, 5.92s/it] {'loss': 0.0008, 'learning_rate': 3.997824049588473e-05, 'epoch': 2.22} 4%|▍ | 2254/50750 [6:08:30<79:41:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:51:14,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:51:14,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.55 | bwd_microstep: 3837.56 | bwd_inner_microstep: 3830.08 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.90 [2024-11-13 22:51:14,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.55 | bwd: 3837.58 | bwd_inner: 3830.08 | bwd_allreduce: 7.44 | step: 20.91 4%|▍ | 2255/50750 [6:08:36<79:39:40, 5.91s/it] {'loss': 0.673, 'learning_rate': 3.997818093247162e-05, 'epoch': 2.22} 4%|▍ | 2255/50750 [6:08:36<79:39:40, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 22:51:19,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:51:19,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.51 | bwd_microstep: 3845.60 | bwd_inner_microstep: 3838.00 | bwd_allreduce_microstep: 7.56 | step_microstep: 22.09 [2024-11-13 22:51:19,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.51 | bwd: 3845.62 | bwd_inner: 3838.00 | bwd_allreduce: 7.58 | step: 22.09 4%|▍ | 2256/50750 [6:08:41<79:40:49, 5.92s/it] {'loss': 0.0101, 'learning_rate': 3.997812128769137e-05, 'epoch': 2.22} 4%|▍ | 2256/50750 [6:08:41<79:40:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:51:25,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 22:51:25,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.16 | bwd_microstep: 3849.65 | bwd_inner_microstep: 3842.10 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.11 [2024-11-13 22:51:25,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.15 | bwd: 3849.66 | bwd_inner: 3842.10 | bwd_allreduce: 7.52 | step: 21.12 4%|▍ | 2257/50750 [6:08:47<79:42:49, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.997806156154426e-05, 'epoch': 2.22} 4%|▍ | 2257/50750 [6:08:47<79:42:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:51:31,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 22:51:31,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.22 | bwd_microstep: 3847.20 | bwd_inner_microstep: 3839.49 | bwd_allreduce_microstep: 7.65 | step_microstep: 22.81 [2024-11-13 22:51:31,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.22 | bwd: 3847.21 | bwd_inner: 3839.49 | bwd_allreduce: 7.67 | step: 22.80 4%|▍ | 2258/50750 [6:08:53<79:43:40, 5.92s/it] {'loss': 0.662, 'learning_rate': 3.997800175403051e-05, 'epoch': 2.22} 4%|▍ | 2258/50750 [6:08:53<79:43:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:51:37,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:51:37,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.41 | bwd_microstep: 3848.31 | bwd_inner_microstep: 3840.74 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.19 [2024-11-13 22:51:37,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.41 | bwd: 3848.32 | bwd_inner: 3840.74 | bwd_allreduce: 7.54 | step: 21.19 4%|▍ | 2259/50750 [6:08:59<79:44:00, 5.92s/it] {'loss': 0.0025, 'learning_rate': 3.997794186515037e-05, 'epoch': 2.23} 4%|▍ | 2259/50750 [6:08:59<79:44:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:51:43,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:51:43,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.41 | bwd_microstep: 3847.40 | bwd_inner_microstep: 3839.87 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.11 [2024-11-13 22:51:43,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.41 | bwd: 3847.41 | bwd_inner: 3839.87 | bwd_allreduce: 7.50 | step: 21.11 4%|▍ | 2260/50750 [6:09:05<79:44:08, 5.92s/it] {'loss': 0.0025, 'learning_rate': 3.997788189490409e-05, 'epoch': 2.23} 4%|▍ | 2260/50750 [6:09:05<79:44:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:51:49,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:51:49,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.76 | bwd_microstep: 3853.28 | bwd_inner_microstep: 3845.73 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.65 [2024-11-13 22:51:49,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.76 | bwd: 3853.30 | bwd_inner: 3845.73 | bwd_allreduce: 7.52 | step: 21.65 4%|▍ | 2261/50750 [6:09:11<79:45:34, 5.92s/it] {'loss': 0.0117, 'learning_rate': 3.99778218432919e-05, 'epoch': 2.23} 4%|▍ | 2261/50750 [6:09:11<79:45:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:51:55,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 22:51:55,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.64 | bwd_microstep: 3844.21 | bwd_inner_microstep: 3836.69 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.23 [2024-11-13 22:51:55,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.64 | bwd: 3844.23 | bwd_inner: 3836.69 | bwd_allreduce: 7.50 | step: 21.23 4%|▍ | 2262/50750 [6:09:17<79:44:15, 5.92s/it] {'loss': 0.324, 'learning_rate': 3.997776171031405e-05, 'epoch': 2.23} 4%|▍ | 2262/50750 [6:09:17<79:44:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:52:01,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 22:52:01,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.73 | bwd_microstep: 3838.53 | bwd_inner_microstep: 3830.97 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.35 [2024-11-13 22:52:01,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.73 | bwd: 3838.55 | bwd_inner: 3830.97 | bwd_allreduce: 7.53 | step: 21.35 4%|▍ | 2263/50750 [6:09:23<79:42:09, 5.92s/it] {'loss': 0.0422, 'learning_rate': 3.997770149597079e-05, 'epoch': 2.23} 4%|▍ | 2263/50750 [6:09:23<79:42:09, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:52:07,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:52:07,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.05 | bwd_microstep: 3851.24 | bwd_inner_microstep: 3841.59 | bwd_allreduce_microstep: 9.60 | step_microstep: 21.82 [2024-11-13 22:52:07,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.05 | bwd: 3851.26 | bwd_inner: 3841.60 | bwd_allreduce: 9.62 | step: 21.83 4%|▍ | 2264/50750 [6:09:29<79:43:47, 5.92s/it] {'loss': 0.0037, 'learning_rate': 3.997764120026237e-05, 'epoch': 2.23} 4%|▍ | 2264/50750 [6:09:29<79:43:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:52:13,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.92 [2024-11-13 22:52:13,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.84 | bwd_microstep: 3858.74 | bwd_inner_microstep: 3850.43 | bwd_allreduce_microstep: 8.25 | step_microstep: 27.08 [2024-11-13 22:52:13,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.81 | bwd: 3858.76 | bwd_inner: 3850.43 | bwd_allreduce: 8.27 | step: 27.08 4%|▍ | 2265/50750 [6:09:35<79:49:25, 5.93s/it] {'loss': 0.0143, 'learning_rate': 3.997758082318902e-05, 'epoch': 2.23} 4%|▍ | 2265/50750 [6:09:35<79:49:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:52:19,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:52:19,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.01 | bwd_microstep: 3844.58 | bwd_inner_microstep: 3837.03 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.22 [2024-11-13 22:52:19,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.99 | bwd: 3844.59 | bwd_inner: 3837.03 | bwd_allreduce: 7.52 | step: 21.22 4%|▍ | 2266/50750 [6:09:41<79:47:45, 5.92s/it] {'loss': 0.0017, 'learning_rate': 3.9977520364750984e-05, 'epoch': 2.23} 4%|▍ | 2266/50750 [6:09:41<79:47:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:52:25,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 5.02 [2024-11-13 22:52:25,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.62 | bwd_microstep: 3844.88 | bwd_inner_microstep: 3837.03 | bwd_allreduce_microstep: 7.79 | step_microstep: 22.43 [2024-11-13 22:52:25,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.62 | bwd: 3844.90 | bwd_inner: 3837.03 | bwd_allreduce: 7.81 | step: 22.43 4%|▍ | 2267/50750 [6:09:47<79:46:25, 5.92s/it] {'loss': 0.4755, 'learning_rate': 3.997745982494853e-05, 'epoch': 2.23} 4%|▍ | 2267/50750 [6:09:47<79:46:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:52:31,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 22:52:31,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.73 | bwd_microstep: 3850.18 | bwd_inner_microstep: 3842.71 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.27 [2024-11-13 22:52:31,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.71 | bwd: 3850.20 | bwd_inner: 3842.71 | bwd_allreduce: 7.45 | step: 21.28 4%|▍ | 2268/50750 [6:09:53<79:46:02, 5.92s/it] {'loss': 0.0022, 'learning_rate': 3.997739920378188e-05, 'epoch': 2.23} 4%|▍ | 2268/50750 [6:09:53<79:46:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:52:36,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 22:52:36,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.77 | bwd_microstep: 3846.74 | bwd_inner_microstep: 3839.03 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.20 [2024-11-13 22:52:36,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.77 | bwd: 3846.76 | bwd_inner: 3839.03 | bwd_allreduce: 7.69 | step: 21.20 4%|▍ | 2269/50750 [6:09:58<79:45:07, 5.92s/it] {'loss': 0.0761, 'learning_rate': 3.997733850125131e-05, 'epoch': 2.24} 4%|▍ | 2269/50750 [6:09:58<79:45:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:52:42,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:52:42,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.85 | bwd_microstep: 3847.73 | bwd_inner_microstep: 3840.19 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.05 [2024-11-13 22:52:42,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.85 | bwd: 3847.74 | bwd_inner: 3840.19 | bwd_allreduce: 7.50 | step: 21.06 4%|▍ | 2270/50750 [6:10:04<79:44:02, 5.92s/it] {'loss': 0.0045, 'learning_rate': 3.997727771735704e-05, 'epoch': 2.24} 4%|▍ | 2270/50750 [6:10:04<79:44:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:52:48,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:52:48,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.89 | bwd_microstep: 3847.83 | bwd_inner_microstep: 3839.99 | bwd_allreduce_microstep: 7.78 | step_microstep: 24.89 [2024-11-13 22:52:48,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.89 | bwd: 3847.86 | bwd_inner: 3839.99 | bwd_allreduce: 7.80 | step: 24.89 4%|▍ | 2271/50750 [6:10:10<79:45:12, 5.92s/it] {'loss': 0.2398, 'learning_rate': 3.997721685209933e-05, 'epoch': 2.24} 4%|▍ | 2271/50750 [6:10:10<79:45:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:52:54,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:52:54,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.45 | bwd_microstep: 3841.99 | bwd_inner_microstep: 3834.49 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.00 [2024-11-13 22:52:54,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.45 | bwd: 3842.00 | bwd_inner: 3834.49 | bwd_allreduce: 7.48 | step: 21.01 4%|▍ | 2272/50750 [6:10:16<79:41:53, 5.92s/it] {'loss': 0.0142, 'learning_rate': 3.997715590547842e-05, 'epoch': 2.24} 4%|▍ | 2272/50750 [6:10:16<79:41:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:53:00,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:53:00,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.45 | bwd_microstep: 3850.16 | bwd_inner_microstep: 3842.27 | bwd_allreduce_microstep: 7.81 | step_microstep: 24.19 [2024-11-13 22:53:00,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.45 | bwd: 3850.18 | bwd_inner: 3842.27 | bwd_allreduce: 7.84 | step: 24.18 4%|▍ | 2273/50750 [6:10:22<79:43:03, 5.92s/it] {'loss': 0.0019, 'learning_rate': 3.9977094877494574e-05, 'epoch': 2.24} 4%|▍ | 2273/50750 [6:10:22<79:43:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:53:06,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.34 | optimizer_step: 4.93 [2024-11-13 22:53:06,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.81 | bwd_microstep: 3854.62 | bwd_inner_microstep: 3846.99 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.63 [2024-11-13 22:53:06,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.81 | bwd: 3854.63 | bwd_inner: 3846.99 | bwd_allreduce: 7.60 | step: 21.63 4%|▍ | 2274/50750 [6:10:28<79:45:16, 5.92s/it] {'loss': 0.1429, 'learning_rate': 3.997703376814803e-05, 'epoch': 2.24} 4%|▍ | 2274/50750 [6:10:28<79:45:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:53:12,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 22:53:12,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.54 | bwd_microstep: 3840.33 | bwd_inner_microstep: 3832.59 | bwd_allreduce_microstep: 7.68 | step_microstep: 25.04 [2024-11-13 22:53:12,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.54 | bwd: 3840.35 | bwd_inner: 3832.59 | bwd_allreduce: 7.70 | step: 25.04 4%|▍ | 2275/50750 [6:10:34<79:43:53, 5.92s/it] {'loss': 0.1097, 'learning_rate': 3.9976972577439033e-05, 'epoch': 2.24} 4%|▍ | 2275/50750 [6:10:34<79:43:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:53:18,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:53:18,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.96 | bwd_microstep: 3838.99 | bwd_inner_microstep: 3831.47 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.09 [2024-11-13 22:53:18,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.96 | bwd: 3839.00 | bwd_inner: 3831.47 | bwd_allreduce: 7.49 | step: 21.09 4%|▍ | 2276/50750 [6:10:40<79:42:01, 5.92s/it] {'loss': 0.0625, 'learning_rate': 3.997691130536784e-05, 'epoch': 2.24} 4%|▍ | 2276/50750 [6:10:40<79:42:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:53:24,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 22:53:24,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.28 | bwd_microstep: 3846.15 | bwd_inner_microstep: 3838.45 | bwd_allreduce_microstep: 7.64 | step_microstep: 22.14 [2024-11-13 22:53:24,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.28 | bwd: 3846.16 | bwd_inner: 3838.45 | bwd_allreduce: 7.66 | step: 22.15 4%|▍ | 2277/50750 [6:10:46<79:42:14, 5.92s/it] {'loss': 0.0036, 'learning_rate': 3.997684995193469e-05, 'epoch': 2.24} 4%|▍ | 2277/50750 [6:10:46<79:42:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:53:30,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 22:53:30,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.45 | bwd_microstep: 3859.28 | bwd_inner_microstep: 3851.76 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.18 [2024-11-13 22:53:30,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.44 | bwd: 3859.30 | bwd_inner: 3851.76 | bwd_allreduce: 7.49 | step: 21.19 4%|▍ | 2278/50750 [6:10:52<79:46:02, 5.92s/it] {'loss': 0.1264, 'learning_rate': 3.9976788517139846e-05, 'epoch': 2.24} 4%|▍ | 2278/50750 [6:10:52<79:46:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:53:36,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.92 [2024-11-13 22:53:36,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1994.29 | bwd_microstep: 3784.02 | bwd_inner_microstep: 3776.24 | bwd_allreduce_microstep: 7.73 | step_microstep: 24.37 [2024-11-13 22:53:36,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1994.29 | bwd: 3784.04 | bwd_inner: 3776.24 | bwd_allreduce: 7.75 | step: 24.37 4%|▍ | 2279/50750 [6:10:58<79:22:58, 5.90s/it] {'loss': 0.0142, 'learning_rate': 3.9976727000983554e-05, 'epoch': 2.25} 4%|▍ | 2279/50750 [6:10:58<79:22:58, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:53:42,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-13 22:53:42,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.05 | bwd_microstep: 3839.81 | bwd_inner_microstep: 3832.26 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.76 [2024-11-13 22:53:42,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.05 | bwd: 3839.82 | bwd_inner: 3832.26 | bwd_allreduce: 7.52 | step: 21.77 4%|▍ | 2280/50750 [6:11:03<79:27:26, 5.90s/it] {'loss': 0.2136, 'learning_rate': 3.997666540346606e-05, 'epoch': 2.25} 4%|▍ | 2280/50750 [6:11:03<79:27:26, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:53:47,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.96 [2024-11-13 22:53:47,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.59 | bwd_microstep: 3847.35 | bwd_inner_microstep: 3839.87 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.87 [2024-11-13 22:53:47,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.59 | bwd: 3847.37 | bwd_inner: 3839.87 | bwd_allreduce: 7.46 | step: 20.87 4%|▍ | 2281/50750 [6:11:09<79:31:08, 5.91s/it] {'loss': 0.039, 'learning_rate': 3.997660372458762e-05, 'epoch': 2.25} 4%|▍ | 2281/50750 [6:11:09<79:31:08, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:53:53,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.92 [2024-11-13 22:53:53,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.58 | bwd_microstep: 3869.59 | bwd_inner_microstep: 3861.86 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.80 [2024-11-13 22:53:53,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.58 | bwd: 3869.61 | bwd_inner: 3861.87 | bwd_allreduce: 7.70 | step: 21.81 4%|▍ | 2282/50750 [6:11:15<79:39:56, 5.92s/it] {'loss': 0.3223, 'learning_rate': 3.997654196434848e-05, 'epoch': 2.25} 4%|▍ | 2282/50750 [6:11:15<79:39:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:53:59,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 22:53:59,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.03 | bwd_microstep: 3849.80 | bwd_inner_microstep: 3842.36 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.87 [2024-11-13 22:53:59,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.03 | bwd: 3849.81 | bwd_inner: 3842.36 | bwd_allreduce: 7.42 | step: 20.87 4%|▍ | 2283/50750 [6:11:21<79:41:32, 5.92s/it] {'loss': 0.0053, 'learning_rate': 3.9976480122748895e-05, 'epoch': 2.25} 4%|▍ | 2283/50750 [6:11:21<79:41:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:54:05,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:54:05,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.42 | bwd_microstep: 3844.01 | bwd_inner_microstep: 3836.56 | bwd_allreduce_microstep: 7.41 | step_microstep: 21.36 [2024-11-13 22:54:05,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.41 | bwd: 3844.02 | bwd_inner: 3836.56 | bwd_allreduce: 7.43 | step: 21.36 5%|▍ | 2284/50750 [6:11:27<79:40:25, 5.92s/it] {'loss': 0.0224, 'learning_rate': 3.997641819978912e-05, 'epoch': 2.25} 5%|▍ | 2284/50750 [6:11:27<79:40:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:54:11,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:54:11,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.66 | bwd_microstep: 3846.12 | bwd_inner_microstep: 3838.65 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.91 [2024-11-13 22:54:11,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.66 | bwd: 3846.13 | bwd_inner: 3838.65 | bwd_allreduce: 7.44 | step: 20.92 5%|▍ | 2285/50750 [6:11:33<79:40:28, 5.92s/it] {'loss': 0.1308, 'learning_rate': 3.99763561954694e-05, 'epoch': 2.25} 5%|▍ | 2285/50750 [6:11:33<79:40:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:54:17,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 22:54:17,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.34 | bwd_microstep: 3841.59 | bwd_inner_microstep: 3834.12 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.48 [2024-11-13 22:54:17,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.34 | bwd: 3841.60 | bwd_inner: 3834.12 | bwd_allreduce: 7.44 | step: 21.48 5%|▍ | 2286/50750 [6:11:39<79:39:40, 5.92s/it] {'loss': 0.3006, 'learning_rate': 3.9976294109789994e-05, 'epoch': 2.25} 5%|▍ | 2286/50750 [6:11:39<79:39:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:54:23,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:54:23,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.49 | bwd_microstep: 3843.35 | bwd_inner_microstep: 3835.84 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.27 [2024-11-13 22:54:23,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.47 | bwd: 3843.36 | bwd_inner: 3835.84 | bwd_allreduce: 7.48 | step: 21.28 5%|▍ | 2287/50750 [6:11:45<79:40:07, 5.92s/it] {'loss': 0.0291, 'learning_rate': 3.997623194275115e-05, 'epoch': 2.25} 5%|▍ | 2287/50750 [6:11:45<79:40:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:54:29,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:54:29,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.28 | bwd_microstep: 3848.64 | bwd_inner_microstep: 3841.12 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.06 [2024-11-13 22:54:29,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.28 | bwd: 3848.65 | bwd_inner: 3841.13 | bwd_allreduce: 7.48 | step: 21.07 5%|▍ | 2288/50750 [6:11:51<79:41:03, 5.92s/it] {'loss': 0.0097, 'learning_rate': 3.9976169694353124e-05, 'epoch': 2.25} 5%|▍ | 2288/50750 [6:11:51<79:41:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:54:35,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 22:54:35,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.25 | bwd_microstep: 3846.74 | bwd_inner_microstep: 3838.96 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.80 [2024-11-13 22:54:35,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.25 | bwd: 3846.75 | bwd_inner: 3838.96 | bwd_allreduce: 7.75 | step: 21.81 5%|▍ | 2289/50750 [6:11:57<79:40:35, 5.92s/it] {'loss': 0.3388, 'learning_rate': 3.997610736459617e-05, 'epoch': 2.26} 5%|▍ | 2289/50750 [6:11:57<79:40:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:54:41,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.96 [2024-11-13 22:54:41,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.83 | bwd_microstep: 3842.29 | bwd_inner_microstep: 3834.76 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.23 [2024-11-13 22:54:41,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.83 | bwd: 3842.30 | bwd_inner: 3834.76 | bwd_allreduce: 7.50 | step: 21.23 5%|▍ | 2290/50750 [6:12:03<79:39:17, 5.92s/it] {'loss': 0.5091, 'learning_rate': 3.997604495348054e-05, 'epoch': 2.26} 5%|▍ | 2290/50750 [6:12:03<79:39:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:54:47,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:54:47,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.95 | bwd_microstep: 3845.25 | bwd_inner_microstep: 3837.19 | bwd_allreduce_microstep: 8.02 | step_microstep: 21.61 [2024-11-13 22:54:47,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.95 | bwd: 3845.26 | bwd_inner: 3837.19 | bwd_allreduce: 8.03 | step: 21.62 5%|▍ | 2291/50750 [6:12:09<79:40:20, 5.92s/it] {'loss': 0.0023, 'learning_rate': 3.9975982461006486e-05, 'epoch': 2.26} 5%|▍ | 2291/50750 [6:12:09<79:40:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:54:53,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 22:54:53,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.49 | bwd_microstep: 3847.09 | bwd_inner_microstep: 3839.58 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.89 [2024-11-13 22:54:53,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.49 | bwd: 3847.10 | bwd_inner: 3839.58 | bwd_allreduce: 7.48 | step: 20.89 5%|▍ | 2292/50750 [6:12:15<79:41:00, 5.92s/it] {'loss': 0.0205, 'learning_rate': 3.997591988717426e-05, 'epoch': 2.26} 5%|▍ | 2292/50750 [6:12:15<79:41:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:54:58,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:54:58,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.95 | bwd_microstep: 3843.65 | bwd_inner_microstep: 3836.17 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.19 [2024-11-13 22:54:58,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.95 | bwd: 3843.66 | bwd_inner: 3836.17 | bwd_allreduce: 7.45 | step: 21.20 5%|▍ | 2293/50750 [6:12:20<79:41:00, 5.92s/it] {'loss': 0.2926, 'learning_rate': 3.997585723198414e-05, 'epoch': 2.26} 5%|▍ | 2293/50750 [6:12:20<79:41:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:55:04,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 22:55:04,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.66 | bwd_microstep: 3846.59 | bwd_inner_microstep: 3839.10 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.90 [2024-11-13 22:55:04,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.66 | bwd: 3846.60 | bwd_inner: 3839.10 | bwd_allreduce: 7.45 | step: 20.90 5%|▍ | 2294/50750 [6:12:26<79:41:22, 5.92s/it] {'loss': 0.0581, 'learning_rate': 3.997579449543635e-05, 'epoch': 2.26} 5%|▍ | 2294/50750 [6:12:26<79:41:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:55:10,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:55:10,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.03 | bwd_microstep: 3841.13 | bwd_inner_microstep: 3833.66 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.89 [2024-11-13 22:55:10,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.03 | bwd: 3841.14 | bwd_inner: 3833.67 | bwd_allreduce: 7.44 | step: 20.90 5%|▍ | 2295/50750 [6:12:32<79:38:48, 5.92s/it] {'loss': 0.0069, 'learning_rate': 3.997573167753116e-05, 'epoch': 2.26} 5%|▍ | 2295/50750 [6:12:32<79:38:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:55:16,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:55:16,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.14 | bwd_microstep: 3846.43 | bwd_inner_microstep: 3838.92 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.42 [2024-11-13 22:55:16,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.14 | bwd: 3846.45 | bwd_inner: 3838.92 | bwd_allreduce: 7.49 | step: 21.43 5%|▍ | 2296/50750 [6:12:38<79:38:28, 5.92s/it] {'loss': 0.0148, 'learning_rate': 3.9975668778268824e-05, 'epoch': 2.26} 5%|▍ | 2296/50750 [6:12:38<79:38:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:55:22,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.60 | optimizer_step: 4.93 [2024-11-13 22:55:22,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.79 | bwd_microstep: 3852.33 | bwd_inner_microstep: 3844.23 | bwd_allreduce_microstep: 8.03 | step_microstep: 28.80 [2024-11-13 22:55:22,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.79 | bwd: 3852.35 | bwd_inner: 3844.23 | bwd_allreduce: 8.06 | step: 28.81 5%|▍ | 2297/50750 [6:12:44<79:44:07, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.99756057976496e-05, 'epoch': 2.26} 5%|▍ | 2297/50750 [6:12:44<79:44:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:55:28,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:55:28,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.63 | bwd_microstep: 3849.40 | bwd_inner_microstep: 3841.90 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.97 [2024-11-13 22:55:28,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.63 | bwd: 3849.41 | bwd_inner: 3841.90 | bwd_allreduce: 7.47 | step: 20.98 5%|▍ | 2298/50750 [6:12:50<79:43:39, 5.92s/it] {'loss': 0.0101, 'learning_rate': 3.997554273567375e-05, 'epoch': 2.26} 5%|▍ | 2298/50750 [6:12:50<79:43:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:55:34,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:55:34,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.23 | bwd_microstep: 3849.34 | bwd_inner_microstep: 3841.79 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.76 [2024-11-13 22:55:34,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.21 | bwd: 3849.35 | bwd_inner: 3841.79 | bwd_allreduce: 7.52 | step: 21.76 5%|▍ | 2299/50750 [6:12:56<79:43:40, 5.92s/it] {'loss': 0.0012, 'learning_rate': 3.997547959234151e-05, 'epoch': 2.27} 5%|▍ | 2299/50750 [6:12:56<79:43:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:55:40,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 22:55:40,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.60 | bwd_microstep: 3853.26 | bwd_inner_microstep: 3845.73 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.26 [2024-11-13 22:55:40,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.60 | bwd: 3853.27 | bwd_inner: 3845.73 | bwd_allreduce: 7.50 | step: 21.27 5%|▍ | 2300/50750 [6:13:02<79:45:21, 5.93s/it] {'loss': 0.1736, 'learning_rate': 3.997541636765316e-05, 'epoch': 2.27} 5%|▍ | 2300/50750 [6:13:02<79:45:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:55:46,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.41 | optimizer_step: 4.93 [2024-11-13 22:55:46,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.18 | bwd_microstep: 3856.72 | bwd_inner_microstep: 3848.63 | bwd_allreduce_microstep: 8.02 | step_microstep: 27.78 [2024-11-13 22:55:46,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.16 | bwd: 3856.74 | bwd_inner: 3848.63 | bwd_allreduce: 8.05 | step: 27.80 5%|▍ | 2301/50750 [6:13:08<79:50:07, 5.93s/it] {'loss': 0.5082, 'learning_rate': 3.997535306160895e-05, 'epoch': 2.27} 5%|▍ | 2301/50750 [6:13:08<79:50:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:55:52,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:55:52,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.61 | bwd_microstep: 3861.53 | bwd_inner_microstep: 3854.03 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.01 [2024-11-13 22:55:52,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.59 | bwd: 3861.54 | bwd_inner: 3854.03 | bwd_allreduce: 7.46 | step: 21.02 5%|▍ | 2302/50750 [6:13:14<79:51:08, 5.93s/it] {'loss': 0.004, 'learning_rate': 3.9975289674209135e-05, 'epoch': 2.27} 5%|▍ | 2302/50750 [6:13:14<79:51:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:55:58,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 22:55:58,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.43 | bwd_microstep: 3847.52 | bwd_inner_microstep: 3839.78 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.21 [2024-11-13 22:55:58,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.43 | bwd: 3847.53 | bwd_inner: 3839.78 | bwd_allreduce: 7.70 | step: 21.21 5%|▍ | 2303/50750 [6:13:20<79:47:22, 5.93s/it] {'loss': 0.4291, 'learning_rate': 3.9975226205453975e-05, 'epoch': 2.27} 5%|▍ | 2303/50750 [6:13:20<79:47:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:56:04,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:56:04,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.76 | bwd_microstep: 3845.34 | bwd_inner_microstep: 3837.75 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.69 [2024-11-13 22:56:04,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.76 | bwd: 3845.35 | bwd_inner: 3837.75 | bwd_allreduce: 7.56 | step: 21.69 5%|▍ | 2304/50750 [6:13:26<79:44:46, 5.93s/it] {'loss': 0.0045, 'learning_rate': 3.9975162655343725e-05, 'epoch': 2.27} 5%|▍ | 2304/50750 [6:13:26<79:44:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:56:10,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 22:56:10,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.31 | bwd_microstep: 3844.90 | bwd_inner_microstep: 3837.42 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.89 [2024-11-13 22:56:10,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.31 | bwd: 3844.91 | bwd_inner: 3837.42 | bwd_allreduce: 7.45 | step: 20.89 5%|▍ | 2305/50750 [6:13:32<79:43:11, 5.92s/it] {'loss': 0.1359, 'learning_rate': 3.997509902387865e-05, 'epoch': 2.27} 5%|▍ | 2305/50750 [6:13:32<79:43:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:56:16,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:56:16,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.65 | bwd_microstep: 3845.35 | bwd_inner_microstep: 3837.85 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.11 [2024-11-13 22:56:16,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.65 | bwd: 3845.37 | bwd_inner: 3837.85 | bwd_allreduce: 7.48 | step: 21.11 5%|▍ | 2306/50750 [6:13:37<79:41:54, 5.92s/it] {'loss': 0.016, 'learning_rate': 3.997503531105901e-05, 'epoch': 2.27} 5%|▍ | 2306/50750 [6:13:37<79:41:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:56:21,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:56:21,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.23 | bwd_microstep: 3839.78 | bwd_inner_microstep: 3832.09 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.10 [2024-11-13 22:56:21,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.23 | bwd: 3839.79 | bwd_inner: 3832.10 | bwd_allreduce: 7.66 | step: 21.10 5%|▍ | 2307/50750 [6:13:43<79:39:08, 5.92s/it] {'loss': 0.011, 'learning_rate': 3.9974971516885054e-05, 'epoch': 2.27} 5%|▍ | 2307/50750 [6:13:43<79:39:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:56:27,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:56:27,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.05 | bwd_microstep: 3851.88 | bwd_inner_microstep: 3844.43 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.96 [2024-11-13 22:56:27,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.04 | bwd: 3851.89 | bwd_inner: 3844.43 | bwd_allreduce: 7.43 | step: 20.96 5%|▍ | 2308/50750 [6:13:49<79:41:55, 5.92s/it] {'loss': 0.0186, 'learning_rate': 3.997490764135706e-05, 'epoch': 2.27} 5%|▍ | 2308/50750 [6:13:49<79:41:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:56:33,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 22:56:33,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.89 | bwd_microstep: 3842.59 | bwd_inner_microstep: 3834.88 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.73 [2024-11-13 22:56:33,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.89 | bwd: 3842.61 | bwd_inner: 3834.88 | bwd_allreduce: 7.69 | step: 21.73 5%|▍ | 2309/50750 [6:13:55<79:39:55, 5.92s/it] {'loss': 0.0029, 'learning_rate': 3.997484368447527e-05, 'epoch': 2.27} 5%|▍ | 2309/50750 [6:13:55<79:39:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:56:39,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:56:39,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.31 | bwd_microstep: 3844.91 | bwd_inner_microstep: 3837.44 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.89 [2024-11-13 22:56:39,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.30 | bwd: 3844.92 | bwd_inner: 3837.44 | bwd_allreduce: 7.44 | step: 20.89 5%|▍ | 2310/50750 [6:14:01<79:39:32, 5.92s/it] {'loss': 0.0517, 'learning_rate': 3.997477964623995e-05, 'epoch': 2.28} 5%|▍ | 2310/50750 [6:14:01<79:39:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:56:45,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 22:56:45,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.46 | bwd_microstep: 3845.28 | bwd_inner_microstep: 3837.72 | bwd_allreduce_microstep: 7.51 | step_microstep: 22.25 [2024-11-13 22:56:45,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.46 | bwd: 3845.29 | bwd_inner: 3837.72 | bwd_allreduce: 7.53 | step: 22.25 5%|▍ | 2311/50750 [6:14:07<79:39:07, 5.92s/it] {'loss': 1.2056, 'learning_rate': 3.9974715526651364e-05, 'epoch': 2.28} 5%|▍ | 2311/50750 [6:14:07<79:39:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:56:51,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 22:56:51,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.54 | bwd_microstep: 3841.53 | bwd_inner_microstep: 3834.02 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.67 [2024-11-13 22:56:51,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.51 | bwd: 3841.54 | bwd_inner: 3834.02 | bwd_allreduce: 7.48 | step: 21.67 5%|▍ | 2312/50750 [6:14:13<79:39:15, 5.92s/it] {'loss': 0.267, 'learning_rate': 3.997465132570977e-05, 'epoch': 2.28} 5%|▍ | 2312/50750 [6:14:13<79:39:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:56:57,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 5.10 [2024-11-13 22:56:57,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.95 | bwd_microstep: 3850.78 | bwd_inner_microstep: 3843.24 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.55 [2024-11-13 22:56:57,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.95 | bwd: 3850.79 | bwd_inner: 3843.24 | bwd_allreduce: 7.51 | step: 21.55 5%|▍ | 2313/50750 [6:14:19<79:39:40, 5.92s/it] {'loss': 0.052, 'learning_rate': 3.9974587043415425e-05, 'epoch': 2.28} 5%|▍ | 2313/50750 [6:14:19<79:39:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 22:57:03,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 22:57:03,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.96 | bwd_microstep: 3851.54 | bwd_inner_microstep: 3843.77 | bwd_allreduce_microstep: 7.71 | step_microstep: 24.34 [2024-11-13 22:57:03,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.96 | bwd: 3851.56 | bwd_inner: 3843.77 | bwd_allreduce: 7.74 | step: 24.34 5%|▍ | 2314/50750 [6:14:25<79:43:59, 5.93s/it] {'loss': 0.0001, 'learning_rate': 3.997452267976861e-05, 'epoch': 2.28} 5%|▍ | 2314/50750 [6:14:25<79:43:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:57:09,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 22:57:09,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.89 | bwd_microstep: 3846.55 | bwd_inner_microstep: 3838.83 | bwd_allreduce_microstep: 7.67 | step_microstep: 22.21 [2024-11-13 22:57:09,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.89 | bwd: 3846.57 | bwd_inner: 3838.83 | bwd_allreduce: 7.69 | step: 22.22 5%|▍ | 2315/50750 [6:14:31<79:42:55, 5.92s/it] {'loss': 0.0201, 'learning_rate': 3.997445823476956e-05, 'epoch': 2.28} 5%|▍ | 2315/50750 [6:14:31<79:42:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:57:15,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 22:57:15,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.08 | bwd_microstep: 3843.14 | bwd_inner_microstep: 3835.25 | bwd_allreduce_microstep: 7.84 | step_microstep: 21.73 [2024-11-13 22:57:15,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.06 | bwd: 3843.15 | bwd_inner: 3835.25 | bwd_allreduce: 7.86 | step: 21.74 5%|▍ | 2316/50750 [6:14:37<79:41:44, 5.92s/it] {'loss': 0.0381, 'learning_rate': 3.9974393708418556e-05, 'epoch': 2.28} 5%|▍ | 2316/50750 [6:14:37<79:41:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:57:21,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:57:21,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.12 | bwd_microstep: 3855.11 | bwd_inner_microstep: 3847.58 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.94 [2024-11-13 22:57:21,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.11 | bwd: 3855.12 | bwd_inner: 3847.58 | bwd_allreduce: 7.50 | step: 20.94 5%|▍ | 2317/50750 [6:14:43<79:43:32, 5.93s/it] {'loss': 0.0353, 'learning_rate': 3.997432910071586e-05, 'epoch': 2.28} 5%|▍ | 2317/50750 [6:14:43<79:43:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:57:27,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 22:57:27,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.17 | bwd_microstep: 3850.01 | bwd_inner_microstep: 3842.47 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.16 [2024-11-13 22:57:27,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.17 | bwd: 3850.02 | bwd_inner: 3842.47 | bwd_allreduce: 7.51 | step: 21.17 5%|▍ | 2318/50750 [6:14:49<79:42:01, 5.92s/it] {'loss': 0.0534, 'learning_rate': 3.997426441166173e-05, 'epoch': 2.28} 5%|▍ | 2318/50750 [6:14:49<79:42:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:57:33,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 22:57:33,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.33 | bwd_microstep: 3844.13 | bwd_inner_microstep: 3836.61 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.05 [2024-11-13 22:57:33,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.33 | bwd: 3844.14 | bwd_inner: 3836.61 | bwd_allreduce: 7.49 | step: 21.06 5%|▍ | 2319/50750 [6:14:54<79:39:05, 5.92s/it] {'loss': 0.0024, 'learning_rate': 3.997419964125643e-05, 'epoch': 2.28} 5%|▍ | 2319/50750 [6:14:54<79:39:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:57:38,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 22:57:38,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.06 | bwd_microstep: 3846.25 | bwd_inner_microstep: 3838.74 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.26 [2024-11-13 22:57:38,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.06 | bwd: 3846.26 | bwd_inner: 3838.74 | bwd_allreduce: 7.49 | step: 21.26 5%|▍ | 2320/50750 [6:15:00<79:38:22, 5.92s/it] {'loss': 0.0423, 'learning_rate': 3.997413478950022e-05, 'epoch': 2.29} 5%|▍ | 2320/50750 [6:15:00<79:38:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:57:44,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:57:44,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.09 | bwd_microstep: 3845.38 | bwd_inner_microstep: 3837.88 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.31 [2024-11-13 22:57:44,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.09 | bwd: 3845.39 | bwd_inner: 3837.88 | bwd_allreduce: 7.47 | step: 21.32 5%|▍ | 2321/50750 [6:15:06<79:37:28, 5.92s/it] {'loss': 0.0898, 'learning_rate': 3.9974069856393375e-05, 'epoch': 2.29} 5%|▍ | 2321/50750 [6:15:06<79:37:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:57:50,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 22:57:50,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.49 | bwd_microstep: 3846.68 | bwd_inner_microstep: 3839.16 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.37 [2024-11-13 22:57:50,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.49 | bwd: 3846.69 | bwd_inner: 3839.16 | bwd_allreduce: 7.49 | step: 21.38 5%|▍ | 2322/50750 [6:15:12<79:37:55, 5.92s/it] {'loss': 0.311, 'learning_rate': 3.9974004841936145e-05, 'epoch': 2.29} 5%|▍ | 2322/50750 [6:15:12<79:37:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:57:56,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:57:56,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.83 | bwd_microstep: 3857.82 | bwd_inner_microstep: 3850.30 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.07 [2024-11-13 22:57:56,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.83 | bwd: 3857.83 | bwd_inner: 3850.30 | bwd_allreduce: 7.49 | step: 21.07 5%|▍ | 2323/50750 [6:15:18<79:40:33, 5.92s/it] {'loss': 0.0017, 'learning_rate': 3.997393974612881e-05, 'epoch': 2.29} 5%|▍ | 2323/50750 [6:15:18<79:40:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:58:02,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 5.05 [2024-11-13 22:58:02,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.86 | bwd_microstep: 3841.72 | bwd_inner_microstep: 3833.97 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.91 [2024-11-13 22:58:02,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.85 | bwd: 3841.73 | bwd_inner: 3833.97 | bwd_allreduce: 7.72 | step: 21.92 5%|▍ | 2324/50750 [6:15:24<79:40:50, 5.92s/it] {'loss': 0.0018, 'learning_rate': 3.997387456897162e-05, 'epoch': 2.29} 5%|▍ | 2324/50750 [6:15:24<79:40:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:58:08,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 22:58:08,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.73 | bwd_microstep: 3844.40 | bwd_inner_microstep: 3836.75 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.58 [2024-11-13 22:58:08,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.71 | bwd: 3844.41 | bwd_inner: 3836.75 | bwd_allreduce: 7.61 | step: 21.59 5%|▍ | 2325/50750 [6:15:30<79:40:28, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.9973809310464854e-05, 'epoch': 2.29} 5%|▍ | 2325/50750 [6:15:30<79:40:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:58:14,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 22:58:14,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.83 | bwd_microstep: 3837.05 | bwd_inner_microstep: 3829.53 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.04 [2024-11-13 22:58:14,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.81 | bwd: 3837.06 | bwd_inner: 3829.53 | bwd_allreduce: 7.49 | step: 21.05 5%|▍ | 2326/50750 [6:15:36<79:37:47, 5.92s/it] {'loss': 0.1433, 'learning_rate': 3.997374397060878e-05, 'epoch': 2.29} 5%|▍ | 2326/50750 [6:15:36<79:37:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:58:20,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 22:58:20,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.78 | bwd_microstep: 3843.56 | bwd_inner_microstep: 3836.04 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.16 [2024-11-13 22:58:20,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.78 | bwd: 3843.58 | bwd_inner: 3836.04 | bwd_allreduce: 7.50 | step: 21.17 5%|▍ | 2327/50750 [6:15:42<79:36:26, 5.92s/it] {'loss': 0.0038, 'learning_rate': 3.997367854940364e-05, 'epoch': 2.29} 5%|▍ | 2327/50750 [6:15:42<79:36:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:58:26,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:58:26,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.15 | bwd_microstep: 3838.84 | bwd_inner_microstep: 3831.08 | bwd_allreduce_microstep: 7.71 | step_microstep: 23.54 [2024-11-13 22:58:26,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.13 | bwd: 3838.86 | bwd_inner: 3831.08 | bwd_allreduce: 7.73 | step: 23.54 5%|▍ | 2328/50750 [6:15:48<79:37:22, 5.92s/it] {'loss': 0.0016, 'learning_rate': 3.9973613046849726e-05, 'epoch': 2.29} 5%|▍ | 2328/50750 [6:15:48<79:37:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:58:32,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:58:32,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.92 | bwd_microstep: 3843.03 | bwd_inner_microstep: 3835.51 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.05 [2024-11-13 22:58:32,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.91 | bwd: 3843.04 | bwd_inner: 3835.51 | bwd_allreduce: 7.49 | step: 21.05 5%|▍ | 2329/50750 [6:15:54<79:36:30, 5.92s/it] {'loss': 0.0018, 'learning_rate': 3.9973547462947295e-05, 'epoch': 2.29} 5%|▍ | 2329/50750 [6:15:54<79:36:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:58:38,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 22:58:38,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.44 | bwd_microstep: 3856.67 | bwd_inner_microstep: 3849.17 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.02 [2024-11-13 22:58:38,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.44 | bwd: 3856.68 | bwd_inner: 3849.17 | bwd_allreduce: 7.48 | step: 21.02 5%|▍ | 2330/50750 [6:16:00<79:37:55, 5.92s/it] {'loss': 0.0398, 'learning_rate': 3.997348179769661e-05, 'epoch': 2.3} 5%|▍ | 2330/50750 [6:16:00<79:37:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:58:44,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:58:44,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.87 | bwd_microstep: 3847.52 | bwd_inner_microstep: 3840.02 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.10 [2024-11-13 22:58:44,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.87 | bwd: 3847.54 | bwd_inner: 3840.02 | bwd_allreduce: 7.48 | step: 21.11 5%|▍ | 2331/50750 [6:16:06<79:37:54, 5.92s/it] {'loss': 0.0018, 'learning_rate': 3.997341605109795e-05, 'epoch': 2.3} 5%|▍ | 2331/50750 [6:16:06<79:37:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:58:49,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 22:58:49,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.18 | bwd_microstep: 3846.89 | bwd_inner_microstep: 3839.19 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.45 [2024-11-13 22:58:49,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.18 | bwd: 3846.90 | bwd_inner: 3839.19 | bwd_allreduce: 7.67 | step: 21.45 5%|▍ | 2332/50750 [6:16:11<79:37:00, 5.92s/it] {'loss': 0.6759, 'learning_rate': 3.997335022315157e-05, 'epoch': 2.3} 5%|▍ | 2332/50750 [6:16:11<79:37:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:58:55,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:58:55,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.60 | bwd_microstep: 3850.35 | bwd_inner_microstep: 3842.85 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.92 [2024-11-13 22:58:55,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.58 | bwd: 3850.36 | bwd_inner: 3842.85 | bwd_allreduce: 7.47 | step: 20.92 5%|▍ | 2333/50750 [6:16:17<79:37:29, 5.92s/it] {'loss': 0.0089, 'learning_rate': 3.997328431385775e-05, 'epoch': 2.3} 5%|▍ | 2333/50750 [6:16:17<79:37:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:59:01,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 22:59:01,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.76 | bwd_microstep: 3845.47 | bwd_inner_microstep: 3837.95 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.96 [2024-11-13 22:59:01,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.76 | bwd: 3845.48 | bwd_inner: 3837.95 | bwd_allreduce: 7.49 | step: 21.97 5%|▍ | 2334/50750 [6:16:23<79:36:50, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.997321832321674e-05, 'epoch': 2.3} 5%|▍ | 2334/50750 [6:16:23<79:36:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:59:07,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 22:59:07,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.04 | bwd_microstep: 3858.01 | bwd_inner_microstep: 3850.50 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.55 [2024-11-13 22:59:07,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.04 | bwd: 3858.02 | bwd_inner: 3850.50 | bwd_allreduce: 7.48 | step: 21.36 5%|▍ | 2335/50750 [6:16:29<79:39:19, 5.92s/it] {'loss': 0.3087, 'learning_rate': 3.9973152251228835e-05, 'epoch': 2.3} 5%|▍ | 2335/50750 [6:16:29<79:39:19, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 22:59:13,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 22:59:13,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.24 | bwd_microstep: 3840.36 | bwd_inner_microstep: 3832.76 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.51 [2024-11-13 22:59:13,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.24 | bwd: 3840.37 | bwd_inner: 3832.76 | bwd_allreduce: 7.57 | step: 21.52 5%|▍ | 2336/50750 [6:16:35<79:36:20, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.997308609789429e-05, 'epoch': 2.3} 5%|▍ | 2336/50750 [6:16:35<79:36:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:59:19,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 22:59:19,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.41 | bwd_microstep: 3841.31 | bwd_inner_microstep: 3833.80 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.93 [2024-11-13 22:59:19,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.39 | bwd: 3841.32 | bwd_inner: 3833.80 | bwd_allreduce: 7.48 | step: 20.94 5%|▍ | 2337/50750 [6:16:41<79:36:22, 5.92s/it] {'loss': 0.0264, 'learning_rate': 3.997301986321336e-05, 'epoch': 2.3} 5%|▍ | 2337/50750 [6:16:41<79:36:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:59:25,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:59:25,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.91 | bwd_microstep: 3844.57 | bwd_inner_microstep: 3837.07 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.05 [2024-11-13 22:59:25,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.91 | bwd: 3844.58 | bwd_inner: 3837.07 | bwd_allreduce: 7.47 | step: 21.05 5%|▍ | 2338/50750 [6:16:47<79:34:16, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.997295354718634e-05, 'epoch': 2.3} 5%|▍ | 2338/50750 [6:16:47<79:34:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 22:59:31,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.92 [2024-11-13 22:59:31,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.27 | bwd_microstep: 3842.86 | bwd_inner_microstep: 3835.19 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.44 [2024-11-13 22:59:31,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.27 | bwd: 3842.87 | bwd_inner: 3835.19 | bwd_allreduce: 7.64 | step: 21.45 5%|▍ | 2339/50750 [6:16:53<79:33:08, 5.92s/it] {'loss': 0.7312, 'learning_rate': 3.9972887149813493e-05, 'epoch': 2.3} 5%|▍ | 2339/50750 [6:16:53<79:33:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 22:59:37,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 22:59:37,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.44 | bwd_microstep: 3842.40 | bwd_inner_microstep: 3834.79 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.66 [2024-11-13 22:59:37,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.43 | bwd: 3842.41 | bwd_inner: 3834.79 | bwd_allreduce: 7.58 | step: 21.66 5%|▍ | 2340/50750 [6:16:59<79:33:32, 5.92s/it] {'loss': 0.3913, 'learning_rate': 3.9972820671095076e-05, 'epoch': 2.31} 5%|▍ | 2340/50750 [6:16:59<79:33:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:59:43,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 22:59:43,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.14 | bwd_microstep: 3844.28 | bwd_inner_microstep: 3836.78 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.01 [2024-11-13 22:59:43,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.12 | bwd: 3844.29 | bwd_inner: 3836.78 | bwd_allreduce: 7.47 | step: 21.01 5%|▍ | 2341/50750 [6:17:05<79:36:20, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.9972754111031375e-05, 'epoch': 2.31} 5%|▍ | 2341/50750 [6:17:05<79:36:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 22:59:49,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 22:59:49,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.16 | bwd_microstep: 3839.67 | bwd_inner_microstep: 3832.15 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.28 [2024-11-13 22:59:49,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.16 | bwd: 3839.68 | bwd_inner: 3832.15 | bwd_allreduce: 7.49 | step: 21.28 5%|▍ | 2342/50750 [6:17:11<79:33:57, 5.92s/it] {'loss': 0.5871, 'learning_rate': 3.9972687469622654e-05, 'epoch': 2.31} 5%|▍ | 2342/50750 [6:17:11<79:33:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 22:59:55,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 22:59:55,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.49 | bwd_microstep: 3843.60 | bwd_inner_microstep: 3836.14 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.88 [2024-11-13 22:59:55,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.49 | bwd: 3843.61 | bwd_inner: 3836.14 | bwd_allreduce: 7.44 | step: 20.89 5%|▍ | 2343/50750 [6:17:17<79:34:04, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.997262074686919e-05, 'epoch': 2.31} 5%|▍ | 2343/50750 [6:17:17<79:34:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:00:00,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 23:00:00,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.93 | bwd_microstep: 3842.39 | bwd_inner_microstep: 3834.84 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.83 [2024-11-13 23:00:00,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.93 | bwd: 3842.41 | bwd_inner: 3834.85 | bwd_allreduce: 7.52 | step: 21.84 5%|▍ | 2344/50750 [6:17:22<79:33:23, 5.92s/it] {'loss': 0.0335, 'learning_rate': 3.9972553942771254e-05, 'epoch': 2.31} 5%|▍ | 2344/50750 [6:17:22<79:33:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:00:06,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 23:00:06,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.08 | bwd_microstep: 3847.22 | bwd_inner_microstep: 3839.76 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.97 [2024-11-13 23:00:06,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.08 | bwd: 3847.23 | bwd_inner: 3839.76 | bwd_allreduce: 7.44 | step: 20.98 5%|▍ | 2345/50750 [6:17:28<79:35:41, 5.92s/it] {'loss': 0.1652, 'learning_rate': 3.99724870573291e-05, 'epoch': 2.31} 5%|▍ | 2345/50750 [6:17:28<79:35:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:00:12,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 23:00:12,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.39 | bwd_microstep: 3850.20 | bwd_inner_microstep: 3842.70 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.87 [2024-11-13 23:00:12,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.39 | bwd: 3850.21 | bwd_inner: 3842.70 | bwd_allreduce: 7.48 | step: 20.87 5%|▍ | 2346/50750 [6:17:34<79:36:41, 5.92s/it] {'loss': 1.1146, 'learning_rate': 3.997242009054303e-05, 'epoch': 2.31} 5%|▍ | 2346/50750 [6:17:34<79:36:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:00:18,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 23:00:18,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.63 | bwd_microstep: 3848.39 | bwd_inner_microstep: 3840.81 | bwd_allreduce_microstep: 7.54 | step_microstep: 22.02 [2024-11-13 23:00:18,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.63 | bwd: 3848.40 | bwd_inner: 3840.81 | bwd_allreduce: 7.55 | step: 22.02 5%|▍ | 2347/50750 [6:17:40<79:37:07, 5.92s/it] {'loss': 0.0083, 'learning_rate': 3.99723530424133e-05, 'epoch': 2.31} 5%|▍ | 2347/50750 [6:17:40<79:37:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:00:24,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 23:00:24,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.09 | bwd_microstep: 3848.25 | bwd_inner_microstep: 3840.75 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.39 [2024-11-13 23:00:24,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.08 | bwd: 3848.26 | bwd_inner: 3840.75 | bwd_allreduce: 7.47 | step: 21.40 5%|▍ | 2348/50750 [6:17:46<79:38:05, 5.92s/it] {'loss': 0.0924, 'learning_rate': 3.997228591294018e-05, 'epoch': 2.31} 5%|▍ | 2348/50750 [6:17:46<79:38:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:00:30,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:00:30,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.05 | bwd_microstep: 3857.15 | bwd_inner_microstep: 3849.62 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.00 [2024-11-13 23:00:30,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.05 | bwd: 3857.16 | bwd_inner: 3849.62 | bwd_allreduce: 7.50 | step: 21.01 5%|▍ | 2349/50750 [6:17:52<79:39:19, 5.92s/it] {'loss': 0.0013, 'learning_rate': 3.9972218702123946e-05, 'epoch': 2.31} 5%|▍ | 2349/50750 [6:17:52<79:39:19, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:00:36,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 23:00:36,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.85 | bwd_microstep: 3851.31 | bwd_inner_microstep: 3842.56 | bwd_allreduce_microstep: 8.71 | step_microstep: 21.58 [2024-11-13 23:00:36,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.84 | bwd: 3851.33 | bwd_inner: 3842.56 | bwd_allreduce: 8.72 | step: 21.58 5%|▍ | 2350/50750 [6:17:58<79:38:50, 5.92s/it] {'loss': 0.0412, 'learning_rate': 3.9972151409964884e-05, 'epoch': 2.32} 5%|▍ | 2350/50750 [6:17:58<79:38:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:00:42,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 23:00:42,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.68 | bwd_microstep: 3853.96 | bwd_inner_microstep: 3846.46 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.43 [2024-11-13 23:00:42,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.65 | bwd: 3853.97 | bwd_inner: 3846.46 | bwd_allreduce: 7.46 | step: 21.44 5%|▍ | 2351/50750 [6:18:04<79:39:01, 5.92s/it] {'loss': 0.0016, 'learning_rate': 3.997208403646325e-05, 'epoch': 2.32} 5%|▍ | 2351/50750 [6:18:04<79:39:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:00:48,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:00:48,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.96 | bwd_microstep: 3855.86 | bwd_inner_microstep: 3848.34 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.07 [2024-11-13 23:00:48,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.96 | bwd: 3855.87 | bwd_inner: 3848.34 | bwd_allreduce: 7.49 | step: 21.08 5%|▍ | 2352/50750 [6:18:10<79:38:38, 5.92s/it] {'loss': 0.0145, 'learning_rate': 3.997201658161933e-05, 'epoch': 2.32} 5%|▍ | 2352/50750 [6:18:10<79:38:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:00:54,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:00:54,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.60 | bwd_microstep: 3850.09 | bwd_inner_microstep: 3842.59 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.94 [2024-11-13 23:00:54,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.60 | bwd: 3850.10 | bwd_inner: 3842.59 | bwd_allreduce: 7.47 | step: 20.94 5%|▍ | 2353/50750 [6:18:16<79:38:25, 5.92s/it] {'loss': 0.0012, 'learning_rate': 3.9971949045433394e-05, 'epoch': 2.32} 5%|▍ | 2353/50750 [6:18:16<79:38:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:01:00,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 23:01:00,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.27 | bwd_microstep: 3849.78 | bwd_inner_microstep: 3842.26 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.24 [2024-11-13 23:01:00,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.27 | bwd: 3849.80 | bwd_inner: 3842.26 | bwd_allreduce: 7.49 | step: 21.25 5%|▍ | 2354/50750 [6:18:22<79:37:27, 5.92s/it] {'loss': 0.5972, 'learning_rate': 3.997188142790572e-05, 'epoch': 2.32} 5%|▍ | 2354/50750 [6:18:22<79:37:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:01:06,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 23:01:06,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.06 | bwd_microstep: 3845.60 | bwd_inner_microstep: 3838.02 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.16 [2024-11-13 23:01:06,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.06 | bwd: 3845.61 | bwd_inner: 3838.02 | bwd_allreduce: 7.56 | step: 21.18 5%|▍ | 2355/50750 [6:18:28<79:35:30, 5.92s/it] {'loss': 0.1444, 'learning_rate': 3.997181372903659e-05, 'epoch': 2.32} 5%|▍ | 2355/50750 [6:18:28<79:35:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:01:12,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:01:12,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.23 | bwd_microstep: 3858.76 | bwd_inner_microstep: 3850.74 | bwd_allreduce_microstep: 7.97 | step_microstep: 21.11 [2024-11-13 23:01:12,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.22 | bwd: 3858.78 | bwd_inner: 3850.74 | bwd_allreduce: 7.99 | step: 21.11 5%|▍ | 2356/50750 [6:18:34<79:38:58, 5.93s/it] {'loss': 0.0116, 'learning_rate': 3.997174594882626e-05, 'epoch': 2.32} 5%|▍ | 2356/50750 [6:18:34<79:38:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:01:18,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-13 23:01:18,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.88 | bwd_microstep: 3847.65 | bwd_inner_microstep: 3839.42 | bwd_allreduce_microstep: 8.16 | step_microstep: 24.36 [2024-11-13 23:01:18,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.88 | bwd: 3847.67 | bwd_inner: 3839.42 | bwd_allreduce: 8.19 | step: 24.36 5%|▍ | 2357/50750 [6:18:39<79:38:59, 5.93s/it] {'loss': 0.1884, 'learning_rate': 3.9971678087275024e-05, 'epoch': 2.32} 5%|▍ | 2357/50750 [6:18:39<79:38:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2203 [2024-11-13 23:01:23,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.84 | optimizer_step: 4.93 [2024-11-13 23:01:23,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.74 | bwd_microstep: 3854.50 | bwd_inner_microstep: 3846.92 | bwd_allreduce_microstep: 7.53 | step_microstep: 24.03 [2024-11-13 23:01:23,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.74 | bwd: 3854.51 | bwd_inner: 3846.92 | bwd_allreduce: 7.55 | step: 24.05 5%|▍ | 2358/50750 [6:18:45<79:40:30, 5.93s/it] {'loss': 0.0077, 'learning_rate': 3.9971610144383145e-05, 'epoch': 2.32} 5%|▍ | 2358/50750 [6:18:45<79:40:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:01:29,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-13 23:01:29,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.92 | bwd_microstep: 3852.60 | bwd_inner_microstep: 3844.21 | bwd_allreduce_microstep: 8.33 | step_microstep: 22.81 [2024-11-13 23:01:29,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.92 | bwd: 3852.61 | bwd_inner: 3844.21 | bwd_allreduce: 8.36 | step: 22.82 5%|▍ | 2359/50750 [6:18:51<79:40:02, 5.93s/it] {'loss': 0.0017, 'learning_rate': 3.9971542120150915e-05, 'epoch': 2.32} 5%|▍ | 2359/50750 [6:18:51<79:40:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:01:35,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-13 23:01:35,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.19 | bwd_microstep: 3854.38 | bwd_inner_microstep: 3846.43 | bwd_allreduce_microstep: 7.90 | step_microstep: 21.90 [2024-11-13 23:01:35,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.17 | bwd: 3854.39 | bwd_inner: 3846.43 | bwd_allreduce: 7.92 | step: 21.90 5%|▍ | 2360/50750 [6:18:57<79:41:40, 5.93s/it] {'loss': 0.0059, 'learning_rate': 3.99714740145786e-05, 'epoch': 2.33} 5%|▍ | 2360/50750 [6:18:57<79:41:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 23:01:41,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:01:41,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.77 | bwd_microstep: 3854.51 | bwd_inner_microstep: 3846.88 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.35 [2024-11-13 23:01:41,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.75 | bwd: 3854.52 | bwd_inner: 3846.88 | bwd_allreduce: 7.59 | step: 21.35 5%|▍ | 2361/50750 [6:19:03<79:41:39, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.9971405827666476e-05, 'epoch': 2.33} 5%|▍ | 2361/50750 [6:19:03<79:41:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:01:47,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-13 23:01:47,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.42 | bwd_microstep: 3846.22 | bwd_inner_microstep: 3838.54 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.84 [2024-11-13 23:01:47,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.41 | bwd: 3846.24 | bwd_inner: 3838.54 | bwd_allreduce: 7.66 | step: 21.84 5%|▍ | 2362/50750 [6:19:09<79:39:48, 5.93s/it] {'loss': 0.0036, 'learning_rate': 3.997133755941483e-05, 'epoch': 2.33} 5%|▍ | 2362/50750 [6:19:09<79:39:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:01:53,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 23:01:53,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.34 | bwd_microstep: 3854.58 | bwd_inner_microstep: 3847.08 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.92 [2024-11-13 23:01:53,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.32 | bwd: 3854.59 | bwd_inner: 3847.08 | bwd_allreduce: 7.47 | step: 20.93 5%|▍ | 2363/50750 [6:19:15<79:40:10, 5.93s/it] {'loss': 0.1203, 'learning_rate': 3.9971269209823935e-05, 'epoch': 2.33} 5%|▍ | 2363/50750 [6:19:15<79:40:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:01:59,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-13 23:01:59,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.18 | bwd_microstep: 3843.10 | bwd_inner_microstep: 3835.29 | bwd_allreduce_microstep: 7.77 | step_microstep: 21.66 [2024-11-13 23:01:59,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.18 | bwd: 3843.12 | bwd_inner: 3835.29 | bwd_allreduce: 7.79 | step: 21.67 5%|▍ | 2364/50750 [6:19:21<79:36:51, 5.92s/it] {'loss': 0.1655, 'learning_rate': 3.997120077889407e-05, 'epoch': 2.33} 5%|▍ | 2364/50750 [6:19:21<79:36:51, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 23:02:05,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:02:05,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.61 | bwd_microstep: 3853.88 | bwd_inner_microstep: 3846.34 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.30 [2024-11-13 23:02:05,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.58 | bwd: 3853.90 | bwd_inner: 3846.34 | bwd_allreduce: 7.51 | step: 21.30 5%|▍ | 2365/50750 [6:19:27<79:38:51, 5.93s/it] {'loss': 0.0069, 'learning_rate': 3.997113226662551e-05, 'epoch': 2.33} 5%|▍ | 2365/50750 [6:19:27<79:38:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:02:11,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 23:02:11,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.46 | bwd_microstep: 3844.49 | bwd_inner_microstep: 3837.02 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.03 [2024-11-13 23:02:11,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.45 | bwd: 3844.51 | bwd_inner: 3837.02 | bwd_allreduce: 7.44 | step: 21.03 5%|▍ | 2366/50750 [6:19:33<79:35:53, 5.92s/it] {'loss': 0.3706, 'learning_rate': 3.997106367301854e-05, 'epoch': 2.33} 5%|▍ | 2366/50750 [6:19:33<79:35:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 23:02:17,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.36 | optimizer_step: 5.04 [2024-11-13 23:02:17,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.73 | bwd_microstep: 3850.44 | bwd_inner_microstep: 3842.33 | bwd_allreduce_microstep: 8.06 | step_microstep: 28.41 [2024-11-13 23:02:17,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.73 | bwd: 3850.46 | bwd_inner: 3842.33 | bwd_allreduce: 8.08 | step: 28.41 5%|▍ | 2367/50750 [6:19:39<79:38:48, 5.93s/it] {'loss': 0.4771, 'learning_rate': 3.997099499807343e-05, 'epoch': 2.33} 5%|▍ | 2367/50750 [6:19:39<79:38:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:02:23,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 23:02:23,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.82 | bwd_microstep: 3846.79 | bwd_inner_microstep: 3838.78 | bwd_allreduce_microstep: 7.94 | step_microstep: 25.68 [2024-11-13 23:02:23,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.82 | bwd: 3846.81 | bwd_inner: 3838.78 | bwd_allreduce: 7.97 | step: 25.68 5%|▍ | 2368/50750 [6:19:45<79:40:23, 5.93s/it] {'loss': 0.0067, 'learning_rate': 3.997092624179047e-05, 'epoch': 2.33} 5%|▍ | 2368/50750 [6:19:45<79:40:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:02:29,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-13 23:02:29,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.04 | bwd_microstep: 3850.90 | bwd_inner_microstep: 3843.12 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.51 [2024-11-13 23:02:29,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.03 | bwd: 3850.91 | bwd_inner: 3843.12 | bwd_allreduce: 7.75 | step: 21.52 5%|▍ | 2369/50750 [6:19:51<79:40:44, 5.93s/it] {'loss': 0.0346, 'learning_rate': 3.997085740416994e-05, 'epoch': 2.33} 5%|▍ | 2369/50750 [6:19:51<79:40:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:02:35,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 23:02:35,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.38 | bwd_microstep: 3850.28 | bwd_inner_microstep: 3842.81 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.96 [2024-11-13 23:02:35,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.37 | bwd: 3850.29 | bwd_inner: 3842.81 | bwd_allreduce: 7.45 | step: 20.97 5%|▍ | 2370/50750 [6:19:57<79:39:53, 5.93s/it] {'loss': 0.0113, 'learning_rate': 3.997078848521212e-05, 'epoch': 2.33} 5%|▍ | 2370/50750 [6:19:57<79:39:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:02:40,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:02:40,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.85 | bwd_microstep: 3851.11 | bwd_inner_microstep: 3843.49 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.01 [2024-11-13 23:02:40,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.83 | bwd: 3851.12 | bwd_inner: 3843.49 | bwd_allreduce: 7.59 | step: 21.01 5%|▍ | 2371/50750 [6:20:02<79:39:27, 5.93s/it] {'loss': 0.0023, 'learning_rate': 3.9970719484917274e-05, 'epoch': 2.34} 5%|▍ | 2371/50750 [6:20:02<79:39:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:02:46,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:02:46,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.70 | bwd_microstep: 3851.51 | bwd_inner_microstep: 3844.03 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.70 [2024-11-13 23:02:46,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.70 | bwd: 3851.53 | bwd_inner: 3844.03 | bwd_allreduce: 7.46 | step: 20.70 5%|▍ | 2372/50750 [6:20:08<79:38:41, 5.93s/it] {'loss': 0.0136, 'learning_rate': 3.9970650403285705e-05, 'epoch': 2.34} 5%|▍ | 2372/50750 [6:20:08<79:38:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:02:52,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.82 | optimizer_step: 4.93 [2024-11-13 23:02:52,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.73 | bwd_microstep: 3848.33 | bwd_inner_microstep: 3840.74 | bwd_allreduce_microstep: 7.55 | step_microstep: 23.21 [2024-11-13 23:02:52,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.73 | bwd: 3848.34 | bwd_inner: 3840.74 | bwd_allreduce: 7.57 | step: 23.22 5%|▍ | 2373/50750 [6:20:14<79:38:09, 5.93s/it] {'loss': 0.0034, 'learning_rate': 3.997058124031768e-05, 'epoch': 2.34} 5%|▍ | 2373/50750 [6:20:14<79:38:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:02:58,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 23:02:58,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.56 | bwd_microstep: 3848.93 | bwd_inner_microstep: 3841.43 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.92 [2024-11-13 23:02:58,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.55 | bwd: 3848.94 | bwd_inner: 3841.43 | bwd_allreduce: 7.47 | step: 20.92 5%|▍ | 2374/50750 [6:20:20<79:36:04, 5.92s/it] {'loss': 0.6711, 'learning_rate': 3.9970511996013496e-05, 'epoch': 2.34} 5%|▍ | 2374/50750 [6:20:20<79:36:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:03:04,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 23:03:04,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.45 | bwd_microstep: 3854.02 | bwd_inner_microstep: 3846.50 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.98 [2024-11-13 23:03:04,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.45 | bwd: 3854.03 | bwd_inner: 3846.50 | bwd_allreduce: 7.49 | step: 20.99 5%|▍ | 2375/50750 [6:20:26<79:36:21, 5.92s/it] {'loss': 0.0023, 'learning_rate': 3.997044267037342e-05, 'epoch': 2.34} 5%|▍ | 2375/50750 [6:20:26<79:36:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:03:10,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 23:03:10,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.50 | bwd_microstep: 3853.52 | bwd_inner_microstep: 3845.87 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.73 [2024-11-13 23:03:10,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.50 | bwd: 3853.53 | bwd_inner: 3845.87 | bwd_allreduce: 7.63 | step: 21.73 5%|▍ | 2376/50750 [6:20:32<79:36:14, 5.92s/it] {'loss': 0.0057, 'learning_rate': 3.997037326339774e-05, 'epoch': 2.34} 5%|▍ | 2376/50750 [6:20:32<79:36:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:03:16,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:03:16,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.48 | bwd_microstep: 3859.76 | bwd_inner_microstep: 3852.20 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.55 [2024-11-13 23:03:16,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.47 | bwd: 3859.78 | bwd_inner: 3852.20 | bwd_allreduce: 7.53 | step: 21.55 5%|▍ | 2377/50750 [6:20:38<79:40:29, 5.93s/it] {'loss': 0.0055, 'learning_rate': 3.997030377508674e-05, 'epoch': 2.34} 5%|▍ | 2377/50750 [6:20:38<79:40:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:03:22,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:03:22,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.84 | bwd_microstep: 3850.47 | bwd_inner_microstep: 3842.85 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.65 [2024-11-13 23:03:22,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.85 | bwd: 3850.48 | bwd_inner: 3842.85 | bwd_allreduce: 7.59 | step: 21.65 5%|▍ | 2378/50750 [6:20:44<79:39:46, 5.93s/it] {'loss': 0.0077, 'learning_rate': 3.99702342054407e-05, 'epoch': 2.34} 5%|▍ | 2378/50750 [6:20:44<79:39:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:03:28,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 23:03:28,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.40 | bwd_microstep: 3854.69 | bwd_inner_microstep: 3847.17 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.31 [2024-11-13 23:03:28,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.40 | bwd: 3854.70 | bwd_inner: 3847.17 | bwd_allreduce: 7.49 | step: 21.32 5%|▍ | 2379/50750 [6:20:50<79:40:10, 5.93s/it] {'loss': 0.1727, 'learning_rate': 3.997016455445991e-05, 'epoch': 2.34} 5%|▍ | 2379/50750 [6:20:50<79:40:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:03:34,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.36 | optimizer_step: 4.93 [2024-11-13 23:03:34,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.03 | bwd_microstep: 3846.45 | bwd_inner_microstep: 3838.53 | bwd_allreduce_microstep: 7.86 | step_microstep: 22.30 [2024-11-13 23:03:34,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.03 | bwd: 3846.47 | bwd_inner: 3838.53 | bwd_allreduce: 7.89 | step: 22.30 5%|▍ | 2380/50750 [6:20:56<79:38:02, 5.93s/it] {'loss': 0.0214, 'learning_rate': 3.9970094822144635e-05, 'epoch': 2.34} 5%|▍ | 2380/50750 [6:20:56<79:38:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:03:40,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-13 23:03:40,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.28 | bwd_microstep: 3845.88 | bwd_inner_microstep: 3838.04 | bwd_allreduce_microstep: 7.79 | step_microstep: 22.60 [2024-11-13 23:03:40,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.27 | bwd: 3845.89 | bwd_inner: 3838.04 | bwd_allreduce: 7.81 | step: 22.60 5%|▍ | 2381/50750 [6:21:02<79:37:46, 5.93s/it] {'loss': 0.3815, 'learning_rate': 3.9970025008495185e-05, 'epoch': 2.35} 5%|▍ | 2381/50750 [6:21:02<79:37:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:03:46,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 4.93 [2024-11-13 23:03:46,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.16 | bwd_microstep: 3842.04 | bwd_inner_microstep: 3834.32 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.48 [2024-11-13 23:03:46,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.14 | bwd: 3842.05 | bwd_inner: 3834.32 | bwd_allreduce: 7.69 | step: 21.48 5%|▍ | 2382/50750 [6:21:08<79:36:50, 5.93s/it] {'loss': 0.3535, 'learning_rate': 3.996995511351182e-05, 'epoch': 2.35} 5%|▍ | 2382/50750 [6:21:08<79:36:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:03:52,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:03:52,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.75 | bwd_microstep: 3841.52 | bwd_inner_microstep: 3833.97 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.48 [2024-11-13 23:03:52,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.75 | bwd: 3841.53 | bwd_inner: 3833.97 | bwd_allreduce: 7.51 | step: 21.49 5%|▍ | 2383/50750 [6:21:14<79:33:23, 5.92s/it] {'loss': 0.0066, 'learning_rate': 3.996988513719485e-05, 'epoch': 2.35} 5%|▍ | 2383/50750 [6:21:14<79:33:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:03:57,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 23:03:57,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.40 | bwd_microstep: 3841.33 | bwd_inner_microstep: 3833.81 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.79 [2024-11-13 23:03:57,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.40 | bwd: 3841.34 | bwd_inner: 3833.81 | bwd_allreduce: 7.49 | step: 21.79 5%|▍ | 2384/50750 [6:21:19<79:31:15, 5.92s/it] {'loss': 0.0031, 'learning_rate': 3.996981507954454e-05, 'epoch': 2.35} 5%|▍ | 2384/50750 [6:21:19<79:31:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:04:03,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:04:03,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.48 | bwd_microstep: 3854.41 | bwd_inner_microstep: 3846.81 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.79 [2024-11-13 23:04:03,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.48 | bwd: 3854.42 | bwd_inner: 3846.81 | bwd_allreduce: 7.57 | step: 21.79 5%|▍ | 2385/50750 [6:21:25<79:33:11, 5.92s/it] {'loss': 0.003, 'learning_rate': 3.996974494056118e-05, 'epoch': 2.35} 5%|▍ | 2385/50750 [6:21:25<79:33:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:04:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 23:04:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.43 | bwd_microstep: 3840.41 | bwd_inner_microstep: 3832.88 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.17 [2024-11-13 23:04:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.43 | bwd: 3840.42 | bwd_inner: 3832.88 | bwd_allreduce: 7.50 | step: 21.17 5%|▍ | 2386/50750 [6:21:31<79:31:24, 5.92s/it] {'loss': 0.0058, 'learning_rate': 3.9969674720245065e-05, 'epoch': 2.35} 5%|▍ | 2386/50750 [6:21:31<79:31:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:04:15,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.92 [2024-11-13 23:04:15,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.55 | bwd_microstep: 3844.71 | bwd_inner_microstep: 3837.17 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.16 [2024-11-13 23:04:15,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.55 | bwd: 3844.72 | bwd_inner: 3837.17 | bwd_allreduce: 7.52 | step: 21.17 5%|▍ | 2387/50750 [6:21:37<79:30:41, 5.92s/it] {'loss': 0.7721, 'learning_rate': 3.996960441859647e-05, 'epoch': 2.35} 5%|▍ | 2387/50750 [6:21:37<79:30:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:04:21,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 23:04:21,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.84 | bwd_microstep: 3849.15 | bwd_inner_microstep: 3841.46 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.38 [2024-11-13 23:04:21,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.84 | bwd: 3849.16 | bwd_inner: 3841.46 | bwd_allreduce: 7.67 | step: 21.39 5%|▍ | 2388/50750 [6:21:43<79:31:58, 5.92s/it] {'loss': 0.2171, 'learning_rate': 3.996953403561567e-05, 'epoch': 2.35} 5%|▍ | 2388/50750 [6:21:43<79:31:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:04:27,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:04:27,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.93 | bwd_microstep: 3851.42 | bwd_inner_microstep: 3843.82 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.66 [2024-11-13 23:04:27,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.92 | bwd: 3851.43 | bwd_inner: 3843.82 | bwd_allreduce: 7.57 | step: 21.66 5%|▍ | 2389/50750 [6:21:49<79:35:27, 5.92s/it] {'loss': 0.7663, 'learning_rate': 3.9969463571302984e-05, 'epoch': 2.35} 5%|▍ | 2389/50750 [6:21:49<79:35:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:04:33,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 23:04:33,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.26 | bwd_microstep: 3843.05 | bwd_inner_microstep: 3835.50 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.17 [2024-11-13 23:04:33,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.26 | bwd: 3843.06 | bwd_inner: 3835.50 | bwd_allreduce: 7.52 | step: 21.17 5%|▍ | 2390/50750 [6:21:55<79:33:37, 5.92s/it] {'loss': 0.0931, 'learning_rate': 3.996939302565868e-05, 'epoch': 2.35} 5%|▍ | 2390/50750 [6:21:55<79:33:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:04:39,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.94 [2024-11-13 23:04:39,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.50 | bwd_microstep: 3843.71 | bwd_inner_microstep: 3836.14 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.45 [2024-11-13 23:04:39,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.50 | bwd: 3843.73 | bwd_inner: 3836.14 | bwd_allreduce: 7.54 | step: 21.46 5%|▍ | 2391/50750 [6:22:01<79:32:20, 5.92s/it] {'loss': 0.0045, 'learning_rate': 3.9969322398683044e-05, 'epoch': 2.36} 5%|▍ | 2391/50750 [6:22:01<79:32:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:04:45,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:04:45,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.15 | bwd_microstep: 3837.59 | bwd_inner_microstep: 3830.07 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.29 [2024-11-13 23:04:45,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.15 | bwd: 3837.61 | bwd_inner: 3830.07 | bwd_allreduce: 7.50 | step: 21.29 5%|▍ | 2392/50750 [6:22:07<79:29:55, 5.92s/it] {'loss': 0.0015, 'learning_rate': 3.996925169037637e-05, 'epoch': 2.36} 5%|▍ | 2392/50750 [6:22:07<79:29:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:04:51,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 23:04:51,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.86 | bwd_microstep: 3844.00 | bwd_inner_microstep: 3836.11 | bwd_allreduce_microstep: 7.86 | step_microstep: 22.08 [2024-11-13 23:04:51,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.86 | bwd: 3844.02 | bwd_inner: 3836.11 | bwd_allreduce: 7.87 | step: 22.08 5%|▍ | 2393/50750 [6:22:13<79:29:14, 5.92s/it] {'loss': 0.0064, 'learning_rate': 3.996918090073894e-05, 'epoch': 2.36} 5%|▍ | 2393/50750 [6:22:13<79:29:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:04:57,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 23:04:57,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.06 | bwd_microstep: 3841.88 | bwd_inner_microstep: 3834.34 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.28 [2024-11-13 23:04:57,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.06 | bwd: 3841.89 | bwd_inner: 3834.34 | bwd_allreduce: 7.51 | step: 21.29 5%|▍ | 2394/50750 [6:22:19<79:30:17, 5.92s/it] {'loss': 0.0049, 'learning_rate': 3.996911002977103e-05, 'epoch': 2.36} 5%|▍ | 2394/50750 [6:22:19<79:30:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:05:03,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.34 | optimizer_step: 4.93 [2024-11-13 23:05:03,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.14 | bwd_microstep: 3851.76 | bwd_inner_microstep: 3843.76 | bwd_allreduce_microstep: 7.93 | step_microstep: 22.88 [2024-11-13 23:05:03,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.14 | bwd: 3851.78 | bwd_inner: 3843.76 | bwd_allreduce: 7.96 | step: 22.88 5%|▍ | 2395/50750 [6:22:25<79:32:36, 5.92s/it] {'loss': 0.5387, 'learning_rate': 3.996903907747296e-05, 'epoch': 2.36} 5%|▍ | 2395/50750 [6:22:25<79:32:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:05:09,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 23:05:09,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.25 | bwd_microstep: 3850.32 | bwd_inner_microstep: 3842.58 | bwd_allreduce_microstep: 7.69 | step_microstep: 23.95 [2024-11-13 23:05:09,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.24 | bwd: 3850.34 | bwd_inner: 3842.58 | bwd_allreduce: 7.72 | step: 23.95 5%|▍ | 2396/50750 [6:22:31<79:34:29, 5.92s/it] {'loss': 0.082, 'learning_rate': 3.9968968043845e-05, 'epoch': 2.36} 5%|▍ | 2396/50750 [6:22:31<79:34:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:05:14,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-13 23:05:14,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.44 | bwd_microstep: 3856.25 | bwd_inner_microstep: 3848.67 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.46 [2024-11-13 23:05:14,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.44 | bwd: 3856.26 | bwd_inner: 3848.67 | bwd_allreduce: 7.55 | step: 21.47 5%|▍ | 2397/50750 [6:22:36<79:35:46, 5.93s/it] {'loss': 0.0124, 'learning_rate': 3.996889692888744e-05, 'epoch': 2.36} 5%|▍ | 2397/50750 [6:22:36<79:35:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 23:05:20,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:05:20,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.15 | bwd_microstep: 3847.36 | bwd_inner_microstep: 3839.65 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.51 [2024-11-13 23:05:20,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.14 | bwd: 3847.37 | bwd_inner: 3839.65 | bwd_allreduce: 7.68 | step: 21.52 5%|▍ | 2398/50750 [6:22:42<79:37:04, 5.93s/it] {'loss': 0.0061, 'learning_rate': 3.996882573260057e-05, 'epoch': 2.36} 5%|▍ | 2398/50750 [6:22:42<79:37:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:05:26,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 4.93 [2024-11-13 23:05:26,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.22 | bwd_microstep: 3844.85 | bwd_inner_microstep: 3837.10 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.42 [2024-11-13 23:05:26,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.21 | bwd: 3844.86 | bwd_inner: 3837.10 | bwd_allreduce: 7.73 | step: 22.42 5%|▍ | 2399/50750 [6:22:48<79:36:59, 5.93s/it] {'loss': 0.0036, 'learning_rate': 3.9968754454984686e-05, 'epoch': 2.36} 5%|▍ | 2399/50750 [6:22:48<79:36:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:05:32,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.40 | optimizer_step: 4.93 [2024-11-13 23:05:32,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.69 | bwd_microstep: 3844.02 | bwd_inner_microstep: 3836.02 | bwd_allreduce_microstep: 7.95 | step_microstep: 22.93 [2024-11-13 23:05:32,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.67 | bwd: 3844.03 | bwd_inner: 3836.02 | bwd_allreduce: 7.96 | step: 22.93 5%|▍ | 2400/50750 [6:22:54<79:38:50, 5.93s/it] {'loss': 0.5681, 'learning_rate': 3.996868309604007e-05, 'epoch': 2.36} 5%|▍ | 2400/50750 [6:22:54<79:38:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:05:38,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.38 | optimizer_step: 4.93 [2024-11-13 23:05:38,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.75 | bwd_microstep: 3849.81 | bwd_inner_microstep: 3841.73 | bwd_allreduce_microstep: 8.03 | step_microstep: 23.11 [2024-11-13 23:05:38,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.73 | bwd: 3849.83 | bwd_inner: 3841.73 | bwd_allreduce: 8.05 | step: 23.12 5%|▍ | 2401/50750 [6:23:00<79:41:08, 5.93s/it] {'loss': 0.1713, 'learning_rate': 3.996861165576701e-05, 'epoch': 2.37} 5%|▍ | 2401/50750 [6:23:00<79:41:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:05:44,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:05:44,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.88 | bwd_microstep: 3845.51 | bwd_inner_microstep: 3837.87 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.06 [2024-11-13 23:05:44,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.87 | bwd: 3845.52 | bwd_inner: 3837.87 | bwd_allreduce: 7.61 | step: 21.07 5%|▍ | 2402/50750 [6:23:06<79:38:41, 5.93s/it] {'loss': 0.0072, 'learning_rate': 3.9968540134165816e-05, 'epoch': 2.37} 5%|▍ | 2402/50750 [6:23:06<79:38:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:05:50,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:05:50,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.60 | bwd_microstep: 3845.87 | bwd_inner_microstep: 3838.24 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.34 [2024-11-13 23:05:50,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.60 | bwd: 3845.88 | bwd_inner: 3838.24 | bwd_allreduce: 7.60 | step: 21.35 5%|▍ | 2403/50750 [6:23:12<79:35:59, 5.93s/it] {'loss': 0.0059, 'learning_rate': 3.996846853123676e-05, 'epoch': 2.37} 5%|▍ | 2403/50750 [6:23:12<79:35:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:05:56,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-13 23:05:56,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.06 | bwd_microstep: 3855.12 | bwd_inner_microstep: 3847.09 | bwd_allreduce_microstep: 7.98 | step_microstep: 21.57 [2024-11-13 23:05:56,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.04 | bwd: 3855.13 | bwd_inner: 3847.09 | bwd_allreduce: 8.00 | step: 21.60 5%|▍ | 2404/50750 [6:23:18<79:38:36, 5.93s/it] {'loss': 0.0015, 'learning_rate': 3.996839684698014e-05, 'epoch': 2.37} 5%|▍ | 2404/50750 [6:23:18<79:38:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:06:02,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:06:02,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.21 | bwd_microstep: 3843.66 | bwd_inner_microstep: 3836.12 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.25 [2024-11-13 23:06:02,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.20 | bwd: 3843.68 | bwd_inner: 3836.12 | bwd_allreduce: 7.52 | step: 21.26 5%|▍ | 2405/50750 [6:23:24<79:35:14, 5.93s/it] {'loss': 0.0022, 'learning_rate': 3.9968325081396244e-05, 'epoch': 2.37} 5%|▍ | 2405/50750 [6:23:24<79:35:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 23:06:08,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 23:06:08,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.82 | bwd_microstep: 3845.65 | bwd_inner_microstep: 3838.14 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.29 [2024-11-13 23:06:08,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.82 | bwd: 3845.66 | bwd_inner: 3838.14 | bwd_allreduce: 7.48 | step: 21.29 5%|▍ | 2406/50750 [6:23:30<79:33:22, 5.92s/it] {'loss': 0.5253, 'learning_rate': 3.996825323448537e-05, 'epoch': 2.37} 5%|▍ | 2406/50750 [6:23:30<79:33:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:06:14,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:06:14,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.32 | bwd_microstep: 3840.45 | bwd_inner_microstep: 3832.93 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.06 [2024-11-13 23:06:14,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.32 | bwd: 3840.46 | bwd_inner: 3832.93 | bwd_allreduce: 7.49 | step: 21.07 5%|▍ | 2407/50750 [6:23:36<79:30:13, 5.92s/it] {'loss': 0.0361, 'learning_rate': 3.9968181306247814e-05, 'epoch': 2.37} 5%|▍ | 2407/50750 [6:23:36<79:30:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:06:20,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:06:20,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.30 | bwd_microstep: 3856.59 | bwd_inner_microstep: 3849.07 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.05 [2024-11-13 23:06:20,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.30 | bwd: 3856.60 | bwd_inner: 3849.07 | bwd_allreduce: 7.49 | step: 21.05 5%|▍ | 2408/50750 [6:23:42<79:31:22, 5.92s/it] {'loss': 0.0044, 'learning_rate': 3.996810929668386e-05, 'epoch': 2.37} 5%|▍ | 2408/50750 [6:23:42<79:31:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:06:26,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:06:26,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.88 | bwd_microstep: 3851.77 | bwd_inner_microstep: 3844.26 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.10 [2024-11-13 23:06:26,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.88 | bwd: 3851.78 | bwd_inner: 3844.26 | bwd_allreduce: 7.48 | step: 21.10 5%|▍ | 2409/50750 [6:23:48<79:31:34, 5.92s/it] {'loss': 0.0011, 'learning_rate': 3.9968037205793806e-05, 'epoch': 2.37} 5%|▍ | 2409/50750 [6:23:48<79:31:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:06:32,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:06:32,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.18 | bwd_microstep: 3845.12 | bwd_inner_microstep: 3837.64 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.08 [2024-11-13 23:06:32,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.16 | bwd: 3845.13 | bwd_inner: 3837.64 | bwd_allreduce: 7.45 | step: 21.08 5%|▍ | 2410/50750 [6:23:53<79:30:35, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.996796503357795e-05, 'epoch': 2.37} 5%|▍ | 2410/50750 [6:23:53<79:30:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:06:37,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:06:37,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.13 | bwd_microstep: 3842.06 | bwd_inner_microstep: 3834.60 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.83 [2024-11-13 23:06:37,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.13 | bwd: 3842.07 | bwd_inner: 3834.60 | bwd_allreduce: 7.43 | step: 20.84 5%|▍ | 2411/50750 [6:23:59<79:29:27, 5.92s/it] {'loss': 0.0077, 'learning_rate': 3.996789278003657e-05, 'epoch': 2.38} 5%|▍ | 2411/50750 [6:23:59<79:29:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:06:43,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:06:43,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.33 | bwd_microstep: 3848.82 | bwd_inner_microstep: 3839.60 | bwd_allreduce_microstep: 9.18 | step_microstep: 21.74 [2024-11-13 23:06:43,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.33 | bwd: 3848.84 | bwd_inner: 3839.60 | bwd_allreduce: 9.20 | step: 21.75 5%|▍ | 2412/50750 [6:24:05<79:30:10, 5.92s/it] {'loss': 0.0023, 'learning_rate': 3.996782044516998e-05, 'epoch': 2.38} 5%|▍ | 2412/50750 [6:24:05<79:30:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:06:49,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:06:49,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.08 | bwd_microstep: 3857.37 | bwd_inner_microstep: 3849.87 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.97 [2024-11-13 23:06:49,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.06 | bwd: 3857.39 | bwd_inner: 3849.87 | bwd_allreduce: 7.47 | step: 20.97 5%|▍ | 2413/50750 [6:24:11<79:36:40, 5.93s/it] {'loss': 0.2119, 'learning_rate': 3.996774802897846e-05, 'epoch': 2.38} 5%|▍ | 2413/50750 [6:24:11<79:36:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:06:55,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 5.00 [2024-11-13 23:06:55,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.54 | bwd_microstep: 3862.32 | bwd_inner_microstep: 3854.81 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.02 [2024-11-13 23:06:55,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.54 | bwd: 3862.33 | bwd_inner: 3854.81 | bwd_allreduce: 7.48 | step: 21.02 5%|▍ | 2414/50750 [6:24:17<79:37:18, 5.93s/it] {'loss': 0.0014, 'learning_rate': 3.9967675531462303e-05, 'epoch': 2.38} 5%|▍ | 2414/50750 [6:24:17<79:37:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:07:01,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:07:01,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.83 | bwd_microstep: 3850.93 | bwd_inner_microstep: 3843.28 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.31 [2024-11-13 23:07:01,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.83 | bwd: 3850.94 | bwd_inner: 3843.28 | bwd_allreduce: 7.62 | step: 21.31 5%|▍ | 2415/50750 [6:24:23<79:36:02, 5.93s/it] {'loss': 0.192, 'learning_rate': 3.9967602952621824e-05, 'epoch': 2.38} 5%|▍ | 2415/50750 [6:24:23<79:36:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:07:07,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 23:07:07,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.98 | bwd_microstep: 3850.33 | bwd_inner_microstep: 3842.74 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.06 [2024-11-13 23:07:07,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.98 | bwd: 3850.34 | bwd_inner: 3842.74 | bwd_allreduce: 7.57 | step: 21.07 5%|▍ | 2416/50750 [6:24:29<79:34:03, 5.93s/it] {'loss': 0.2397, 'learning_rate': 3.9967530292457304e-05, 'epoch': 2.38} 5%|▍ | 2416/50750 [6:24:29<79:34:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:07:13,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.29 | optimizer_step: 4.93 [2024-11-13 23:07:13,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.13 | bwd_microstep: 3856.61 | bwd_inner_microstep: 3849.03 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.33 [2024-11-13 23:07:13,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.11 | bwd: 3856.62 | bwd_inner: 3849.03 | bwd_allreduce: 7.55 | step: 21.34 5%|▍ | 2417/50750 [6:24:35<79:35:32, 5.93s/it] {'loss': 0.0016, 'learning_rate': 3.996745755096904e-05, 'epoch': 2.38} 5%|▍ | 2417/50750 [6:24:35<79:35:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:07:19,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:07:19,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.17 | bwd_microstep: 3853.24 | bwd_inner_microstep: 3845.74 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.93 [2024-11-13 23:07:19,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.15 | bwd: 3853.26 | bwd_inner: 3845.74 | bwd_allreduce: 7.48 | step: 20.94 5%|▍ | 2418/50750 [6:24:41<79:34:42, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.996738472815733e-05, 'epoch': 2.38} 5%|▍ | 2418/50750 [6:24:41<79:34:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:07:25,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 23:07:25,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.53 | bwd_microstep: 3859.85 | bwd_inner_microstep: 3852.26 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.70 [2024-11-13 23:07:25,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.53 | bwd: 3859.86 | bwd_inner: 3852.26 | bwd_allreduce: 7.56 | step: 21.70 5%|▍ | 2419/50750 [6:24:47<79:37:27, 5.93s/it] {'loss': 0.0018, 'learning_rate': 3.996731182402247e-05, 'epoch': 2.38} 5%|▍ | 2419/50750 [6:24:47<79:37:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:07:31,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 23:07:31,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.53 | bwd_microstep: 3850.21 | bwd_inner_microstep: 3842.71 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.06 [2024-11-13 23:07:31,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.53 | bwd: 3850.22 | bwd_inner: 3842.71 | bwd_allreduce: 7.47 | step: 21.06 5%|▍ | 2420/50750 [6:24:53<79:36:51, 5.93s/it] {'loss': 0.009, 'learning_rate': 3.9967238838564755e-05, 'epoch': 2.38} 5%|▍ | 2420/50750 [6:24:53<79:36:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:07:37,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.31 | optimizer_step: 4.93 [2024-11-13 23:07:37,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.10 | bwd_microstep: 3849.59 | bwd_inner_microstep: 3842.12 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.43 [2024-11-13 23:07:37,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.09 | bwd: 3849.60 | bwd_inner: 3842.12 | bwd_allreduce: 7.44 | step: 21.44 5%|▍ | 2421/50750 [6:24:59<79:37:56, 5.93s/it] {'loss': 0.0515, 'learning_rate': 3.996716577178449e-05, 'epoch': 2.39} 5%|▍ | 2421/50750 [6:24:59<79:37:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:07:43,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:07:43,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.63 | bwd_microstep: 3850.11 | bwd_inner_microstep: 3842.52 | bwd_allreduce_microstep: 7.55 | step_microstep: 20.85 [2024-11-13 23:07:43,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.63 | bwd: 3850.13 | bwd_inner: 3842.52 | bwd_allreduce: 7.57 | step: 20.85 5%|▍ | 2422/50750 [6:25:05<79:37:35, 5.93s/it] {'loss': 0.0031, 'learning_rate': 3.9967092623681964e-05, 'epoch': 2.39} 5%|▍ | 2422/50750 [6:25:05<79:37:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:07:49,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.97 [2024-11-13 23:07:49,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.70 | bwd_microstep: 3850.55 | bwd_inner_microstep: 3842.91 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.59 [2024-11-13 23:07:49,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.70 | bwd: 3850.56 | bwd_inner: 3842.91 | bwd_allreduce: 7.61 | step: 21.61 5%|▍ | 2423/50750 [6:25:11<79:36:04, 5.93s/it] {'loss': 0.3856, 'learning_rate': 3.996701939425748e-05, 'epoch': 2.39} 5%|▍ | 2423/50750 [6:25:11<79:36:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:07:55,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 23:07:55,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.01 | bwd_microstep: 3848.96 | bwd_inner_microstep: 3841.42 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.10 [2024-11-13 23:07:55,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.00 | bwd: 3848.97 | bwd_inner: 3841.42 | bwd_allreduce: 7.51 | step: 21.10 5%|▍ | 2424/50750 [6:25:16<79:34:35, 5.93s/it] {'loss': 0.0934, 'learning_rate': 3.9966946083511334e-05, 'epoch': 2.39} 5%|▍ | 2424/50750 [6:25:16<79:34:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:08:00,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:08:00,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.61 | bwd_microstep: 3847.58 | bwd_inner_microstep: 3840.02 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.70 [2024-11-13 23:08:00,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.61 | bwd: 3847.59 | bwd_inner: 3840.02 | bwd_allreduce: 7.53 | step: 21.71 5%|▍ | 2425/50750 [6:25:22<79:32:51, 5.93s/it] {'loss': 0.0732, 'learning_rate': 3.996687269144382e-05, 'epoch': 2.39} 5%|▍ | 2425/50750 [6:25:22<79:32:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:08:06,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 23:08:06,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.18 | bwd_microstep: 3849.69 | bwd_inner_microstep: 3842.10 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.30 [2024-11-13 23:08:06,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.16 | bwd: 3849.70 | bwd_inner: 3842.10 | bwd_allreduce: 7.56 | step: 21.30 5%|▍ | 2426/50750 [6:25:28<79:33:09, 5.93s/it] {'loss': 0.0031, 'learning_rate': 3.996679921805524e-05, 'epoch': 2.39} 5%|▍ | 2426/50750 [6:25:28<79:33:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:08:12,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 23:08:12,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.71 | bwd_microstep: 3844.75 | bwd_inner_microstep: 3836.96 | bwd_allreduce_microstep: 7.75 | step_microstep: 21.66 [2024-11-13 23:08:12,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.71 | bwd: 3844.77 | bwd_inner: 3836.96 | bwd_allreduce: 7.77 | step: 21.67 5%|▍ | 2427/50750 [6:25:34<79:32:13, 5.93s/it] {'loss': 0.0032, 'learning_rate': 3.99667256633459e-05, 'epoch': 2.39} 5%|▍ | 2427/50750 [6:25:34<79:32:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:08:18,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:08:18,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.05 | bwd_microstep: 3842.94 | bwd_inner_microstep: 3835.41 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.06 [2024-11-13 23:08:18,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.05 | bwd: 3842.96 | bwd_inner: 3835.41 | bwd_allreduce: 7.51 | step: 21.06 5%|▍ | 2428/50750 [6:25:40<79:29:39, 5.92s/it] {'loss': 0.0018, 'learning_rate': 3.9966652027316094e-05, 'epoch': 2.39} 5%|▍ | 2428/50750 [6:25:40<79:29:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:08:24,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 23:08:24,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.77 | bwd_microstep: 3851.75 | bwd_inner_microstep: 3844.02 | bwd_allreduce_microstep: 7.69 | step_microstep: 22.10 [2024-11-13 23:08:24,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.77 | bwd: 3851.77 | bwd_inner: 3844.02 | bwd_allreduce: 7.71 | step: 22.10 5%|▍ | 2429/50750 [6:25:46<79:30:18, 5.92s/it] {'loss': 0.1661, 'learning_rate': 3.996657830996612e-05, 'epoch': 2.39} 5%|▍ | 2429/50750 [6:25:46<79:30:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:08:30,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.66 | optimizer_step: 4.92 [2024-11-13 23:08:30,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.36 | bwd_microstep: 3846.45 | bwd_inner_microstep: 3838.32 | bwd_allreduce_microstep: 8.06 | step_microstep: 28.80 [2024-11-13 23:08:30,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.36 | bwd: 3846.47 | bwd_inner: 3838.32 | bwd_allreduce: 8.09 | step: 28.81 5%|▍ | 2430/50750 [6:25:52<79:32:55, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.996650451129628e-05, 'epoch': 2.39} 5%|▍ | 2430/50750 [6:25:52<79:32:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:08:36,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:08:36,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.74 | bwd_microstep: 3840.22 | bwd_inner_microstep: 3832.66 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.33 [2024-11-13 23:08:36,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.74 | bwd: 3840.23 | bwd_inner: 3832.66 | bwd_allreduce: 7.54 | step: 21.33 5%|▍ | 2431/50750 [6:25:58<79:30:23, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.9966430631306876e-05, 'epoch': 2.4} 5%|▍ | 2431/50750 [6:25:58<79:30:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:08:42,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:08:42,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.15 | bwd_microstep: 3844.30 | bwd_inner_microstep: 3836.77 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.08 [2024-11-13 23:08:42,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.15 | bwd: 3844.32 | bwd_inner: 3836.77 | bwd_allreduce: 7.50 | step: 21.08 5%|▍ | 2432/50750 [6:26:04<79:28:59, 5.92s/it] {'loss': 0.0105, 'learning_rate': 3.9966356669998206e-05, 'epoch': 2.4} 5%|▍ | 2432/50750 [6:26:04<79:28:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:08:48,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 23:08:48,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.39 | bwd_microstep: 3845.31 | bwd_inner_microstep: 3837.77 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.56 [2024-11-13 23:08:48,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.39 | bwd: 3845.32 | bwd_inner: 3837.77 | bwd_allreduce: 7.51 | step: 21.56 5%|▍ | 2433/50750 [6:26:10<79:28:43, 5.92s/it] {'loss': 0.1027, 'learning_rate': 3.996628262737058e-05, 'epoch': 2.4} 5%|▍ | 2433/50750 [6:26:10<79:28:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:08:54,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:08:54,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.95 | bwd_microstep: 3845.89 | bwd_inner_microstep: 3838.36 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.27 [2024-11-13 23:08:54,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.95 | bwd: 3845.90 | bwd_inner: 3838.36 | bwd_allreduce: 7.50 | step: 21.27 5%|▍ | 2434/50750 [6:26:16<79:29:26, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.996620850342429e-05, 'epoch': 2.4} 5%|▍ | 2434/50750 [6:26:16<79:29:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:09:00,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 23:09:00,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.59 | bwd_microstep: 3842.93 | bwd_inner_microstep: 3835.38 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.19 [2024-11-13 23:09:00,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.58 | bwd: 3842.94 | bwd_inner: 3835.38 | bwd_allreduce: 7.52 | step: 21.20 5%|▍ | 2435/50750 [6:26:22<79:27:04, 5.92s/it] {'loss': 0.6002, 'learning_rate': 3.996613429815963e-05, 'epoch': 2.4} 5%|▍ | 2435/50750 [6:26:22<79:27:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:09:06,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 23:09:06,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.14 | bwd_microstep: 3847.71 | bwd_inner_microstep: 3840.21 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.29 [2024-11-13 23:09:06,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.14 | bwd: 3847.73 | bwd_inner: 3840.21 | bwd_allreduce: 7.48 | step: 21.30 5%|▍ | 2436/50750 [6:26:28<79:26:37, 5.92s/it] {'loss': 0.0091, 'learning_rate': 3.996606001157693e-05, 'epoch': 2.4} 5%|▍ | 2436/50750 [6:26:28<79:26:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:09:12,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:09:12,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.67 | bwd_microstep: 3844.28 | bwd_inner_microstep: 3836.72 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.51 [2024-11-13 23:09:12,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.67 | bwd: 3844.29 | bwd_inner: 3836.72 | bwd_allreduce: 7.53 | step: 21.52 5%|▍ | 2437/50750 [6:26:33<79:25:58, 5.92s/it] {'loss': 0.0089, 'learning_rate': 3.9965985643676466e-05, 'epoch': 2.4} 5%|▍ | 2437/50750 [6:26:33<79:25:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:09:17,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:09:17,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.58 | bwd_microstep: 3846.88 | bwd_inner_microstep: 3839.43 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.88 [2024-11-13 23:09:17,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.56 | bwd: 3846.90 | bwd_inner: 3839.42 | bwd_allreduce: 7.43 | step: 20.89 5%|▍ | 2438/50750 [6:26:39<79:27:20, 5.92s/it] {'loss': 0.0027, 'learning_rate': 3.996591119445855e-05, 'epoch': 2.4} 5%|▍ | 2438/50750 [6:26:39<79:27:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:09:23,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:09:23,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.30 | bwd_microstep: 3835.31 | bwd_inner_microstep: 3827.76 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.19 [2024-11-13 23:09:23,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.49 | bwd: 3835.32 | bwd_inner: 3827.76 | bwd_allreduce: 7.52 | step: 21.20 5%|▍ | 2439/50750 [6:26:45<79:23:47, 5.92s/it] {'loss': 0.0034, 'learning_rate': 3.996583666392349e-05, 'epoch': 2.4} 5%|▍ | 2439/50750 [6:26:45<79:23:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:09:29,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:09:29,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.40 | bwd_microstep: 3842.60 | bwd_inner_microstep: 3835.06 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.05 [2024-11-13 23:09:29,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.40 | bwd: 3842.61 | bwd_inner: 3835.06 | bwd_allreduce: 7.51 | step: 21.05 5%|▍ | 2440/50750 [6:26:51<79:23:44, 5.92s/it] {'loss': 0.0042, 'learning_rate': 3.9965762052071586e-05, 'epoch': 2.4} 5%|▍ | 2440/50750 [6:26:51<79:23:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 23:09:35,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 5.01 [2024-11-13 23:09:35,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.11 | bwd_microstep: 3845.23 | bwd_inner_microstep: 3837.67 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.70 [2024-11-13 23:09:35,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.10 | bwd: 3845.24 | bwd_inner: 3837.67 | bwd_allreduce: 7.53 | step: 21.70 5%|▍ | 2441/50750 [6:26:57<79:24:03, 5.92s/it] {'loss': 0.0092, 'learning_rate': 3.9965687358903134e-05, 'epoch': 2.4} 5%|▍ | 2441/50750 [6:26:57<79:24:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:09:41,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:09:41,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.83 | bwd_microstep: 3842.38 | bwd_inner_microstep: 3834.90 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.08 [2024-11-13 23:09:41,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.83 | bwd: 3842.40 | bwd_inner: 3834.90 | bwd_allreduce: 7.46 | step: 21.08 5%|▍ | 2442/50750 [6:27:03<79:23:31, 5.92s/it] {'loss': 0.0058, 'learning_rate': 3.996561258441845e-05, 'epoch': 2.41} 5%|▍ | 2442/50750 [6:27:03<79:23:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:09:47,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 5.01 [2024-11-13 23:09:47,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.47 | bwd_microstep: 3843.53 | bwd_inner_microstep: 3836.04 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.12 [2024-11-13 23:09:47,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.47 | bwd: 3843.54 | bwd_inner: 3836.04 | bwd_allreduce: 7.46 | step: 21.12 5%|▍ | 2443/50750 [6:27:09<79:24:18, 5.92s/it] {'loss': 0.0046, 'learning_rate': 3.996553772861783e-05, 'epoch': 2.41} 5%|▍ | 2443/50750 [6:27:09<79:24:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:09:53,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:09:53,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.85 | bwd_microstep: 3851.02 | bwd_inner_microstep: 3843.54 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.91 [2024-11-13 23:09:53,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.85 | bwd: 3851.03 | bwd_inner: 3843.54 | bwd_allreduce: 7.45 | step: 20.91 5%|▍ | 2444/50750 [6:27:15<79:25:53, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.996546279150159e-05, 'epoch': 2.41} 5%|▍ | 2444/50750 [6:27:15<79:25:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:09:59,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:09:59,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.28 | bwd_microstep: 3850.49 | bwd_inner_microstep: 3842.96 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.08 [2024-11-13 23:09:59,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.28 | bwd: 3850.51 | bwd_inner: 3842.96 | bwd_allreduce: 7.50 | step: 21.08 5%|▍ | 2445/50750 [6:27:21<79:26:00, 5.92s/it] {'loss': 0.0014, 'learning_rate': 3.9965387773070016e-05, 'epoch': 2.41} 5%|▍ | 2445/50750 [6:27:21<79:26:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:10:05,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:10:05,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.53 | bwd_microstep: 3846.76 | bwd_inner_microstep: 3839.22 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.58 [2024-11-13 23:10:05,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.53 | bwd: 3846.78 | bwd_inner: 3839.22 | bwd_allreduce: 7.52 | step: 21.58 5%|▍ | 2446/50750 [6:27:27<79:25:31, 5.92s/it] {'loss': 0.0181, 'learning_rate': 3.996531267332343e-05, 'epoch': 2.41} 5%|▍ | 2446/50750 [6:27:27<79:25:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:10:11,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 23:10:11,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.22 | bwd_microstep: 3854.71 | bwd_inner_microstep: 3847.18 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.52 [2024-11-13 23:10:11,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.21 | bwd: 3854.73 | bwd_inner: 3847.18 | bwd_allreduce: 7.50 | step: 21.52 5%|▍ | 2447/50750 [6:27:33<79:28:57, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.996523749226214e-05, 'epoch': 2.41} 5%|▍ | 2447/50750 [6:27:33<79:28:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:10:17,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-13 23:10:17,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.80 | bwd_microstep: 3846.45 | bwd_inner_microstep: 3838.82 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.74 [2024-11-13 23:10:17,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.80 | bwd: 3846.46 | bwd_inner: 3838.82 | bwd_allreduce: 7.60 | step: 21.75 5%|▍ | 2448/50750 [6:27:39<79:27:58, 5.92s/it] {'loss': 0.2667, 'learning_rate': 3.996516222988644e-05, 'epoch': 2.41} 5%|▍ | 2448/50750 [6:27:39<79:27:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:10:23,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:10:23,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.48 | bwd_microstep: 3847.23 | bwd_inner_microstep: 3839.66 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.37 [2024-11-13 23:10:23,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.46 | bwd: 3847.24 | bwd_inner: 3839.66 | bwd_allreduce: 7.54 | step: 21.37 5%|▍ | 2449/50750 [6:27:45<79:29:53, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.9965086886196643e-05, 'epoch': 2.41} 5%|▍ | 2449/50750 [6:27:45<79:29:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:10:28,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 23:10:28,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.80 | bwd_microstep: 3847.09 | bwd_inner_microstep: 3839.59 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.14 [2024-11-13 23:10:28,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.80 | bwd: 3847.11 | bwd_inner: 3839.59 | bwd_allreduce: 7.48 | step: 21.14 5%|▍ | 2450/50750 [6:27:50<79:29:22, 5.92s/it] {'loss': 0.3916, 'learning_rate': 3.996501146119305e-05, 'epoch': 2.41} 5%|▍ | 2450/50750 [6:27:50<79:29:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:10:34,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:10:34,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.16 | bwd_microstep: 3838.65 | bwd_inner_microstep: 3831.19 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.94 [2024-11-13 23:10:34,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.16 | bwd: 3838.66 | bwd_inner: 3831.19 | bwd_allreduce: 7.43 | step: 20.94 5%|▍ | 2451/50750 [6:27:56<79:25:25, 5.92s/it] {'loss': 1.6363, 'learning_rate': 3.9964935954875986e-05, 'epoch': 2.41} 5%|▍ | 2451/50750 [6:27:56<79:25:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:10:40,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 23:10:40,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.14 | bwd_microstep: 3849.27 | bwd_inner_microstep: 3841.76 | bwd_allreduce_microstep: 7.47 | step_microstep: 22.01 [2024-11-13 23:10:40,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.14 | bwd: 3849.28 | bwd_inner: 3841.76 | bwd_allreduce: 7.48 | step: 22.02 5%|▍ | 2452/50750 [6:28:02<79:25:42, 5.92s/it] {'loss': 0.9321, 'learning_rate': 3.996486036724573e-05, 'epoch': 2.42} 5%|▍ | 2452/50750 [6:28:02<79:25:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:10:46,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 23:10:46,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.26 | bwd_microstep: 3843.13 | bwd_inner_microstep: 3835.60 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.19 [2024-11-13 23:10:46,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.26 | bwd: 3843.14 | bwd_inner: 3835.60 | bwd_allreduce: 7.51 | step: 21.19 5%|▍ | 2453/50750 [6:28:08<79:26:02, 5.92s/it] {'loss': 0.0008, 'learning_rate': 3.996478469830261e-05, 'epoch': 2.42} 5%|▍ | 2453/50750 [6:28:08<79:26:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:10:52,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-13 23:10:52,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.77 | bwd_microstep: 3850.93 | bwd_inner_microstep: 3843.41 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.14 [2024-11-13 23:10:52,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.77 | bwd: 3850.94 | bwd_inner: 3843.41 | bwd_allreduce: 7.49 | step: 21.14 5%|▍ | 2454/50750 [6:28:14<79:26:25, 5.92s/it] {'loss': 0.3096, 'learning_rate': 3.9964708948046934e-05, 'epoch': 2.42} 5%|▍ | 2454/50750 [6:28:14<79:26:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:10:58,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 23:10:58,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.40 | bwd_microstep: 3842.34 | bwd_inner_microstep: 3834.81 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.29 [2024-11-13 23:10:58,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.39 | bwd: 3842.35 | bwd_inner: 3834.81 | bwd_allreduce: 7.50 | step: 21.29 5%|▍ | 2455/50750 [6:28:20<79:24:44, 5.92s/it] {'loss': 0.1226, 'learning_rate': 3.996463311647901e-05, 'epoch': 2.42} 5%|▍ | 2455/50750 [6:28:20<79:24:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:11:04,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-13 23:11:04,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.62 | bwd_microstep: 3849.68 | bwd_inner_microstep: 3841.70 | bwd_allreduce_microstep: 7.93 | step_microstep: 22.61 [2024-11-13 23:11:04,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.62 | bwd: 3849.70 | bwd_inner: 3841.70 | bwd_allreduce: 7.96 | step: 22.62 5%|▍ | 2456/50750 [6:28:26<79:26:06, 5.92s/it] {'loss': 0.0031, 'learning_rate': 3.996455720359914e-05, 'epoch': 2.42} 5%|▍ | 2456/50750 [6:28:26<79:26:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:11:10,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-13 23:11:10,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.77 | bwd_microstep: 3838.98 | bwd_inner_microstep: 3831.04 | bwd_allreduce_microstep: 7.88 | step_microstep: 25.58 [2024-11-13 23:11:10,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.75 | bwd: 3839.00 | bwd_inner: 3831.04 | bwd_allreduce: 7.91 | step: 25.58 5%|▍ | 2457/50750 [6:28:32<79:27:23, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.996448120940763e-05, 'epoch': 2.42} 5%|▍ | 2457/50750 [6:28:32<79:27:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:11:16,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 23:11:16,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.46 | bwd_microstep: 3840.14 | bwd_inner_microstep: 3832.37 | bwd_allreduce_microstep: 7.72 | step_microstep: 26.30 [2024-11-13 23:11:16,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.44 | bwd: 3840.16 | bwd_inner: 3832.37 | bwd_allreduce: 7.74 | step: 26.30 5%|▍ | 2458/50750 [6:28:38<79:26:33, 5.92s/it] {'loss': 0.2408, 'learning_rate': 3.9964405133904796e-05, 'epoch': 2.42} 5%|▍ | 2458/50750 [6:28:38<79:26:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:11:22,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.76 | optimizer_step: 4.93 [2024-11-13 23:11:22,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.82 | bwd_microstep: 3838.07 | bwd_inner_microstep: 3830.60 | bwd_allreduce_microstep: 7.43 | step_microstep: 22.90 [2024-11-13 23:11:22,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.82 | bwd: 3838.08 | bwd_inner: 3830.60 | bwd_allreduce: 7.44 | step: 22.92 5%|▍ | 2459/50750 [6:28:44<79:24:08, 5.92s/it] {'loss': 0.5133, 'learning_rate': 3.996432897709094e-05, 'epoch': 2.42} 5%|▍ | 2459/50750 [6:28:44<79:24:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:11:28,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.92 | optimizer_step: 4.93 [2024-11-13 23:11:28,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.71 | bwd_microstep: 3839.53 | bwd_inner_microstep: 3831.58 | bwd_allreduce_microstep: 7.89 | step_microstep: 31.83 [2024-11-13 23:11:28,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.71 | bwd: 3839.55 | bwd_inner: 3831.58 | bwd_allreduce: 7.92 | step: 31.83 5%|▍ | 2460/50750 [6:28:50<79:27:34, 5.92s/it] {'loss': 0.0173, 'learning_rate': 3.996425273896639e-05, 'epoch': 2.42} 5%|▍ | 2460/50750 [6:28:50<79:27:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:11:34,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 23:11:34,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.42 | bwd_microstep: 3840.73 | bwd_inner_microstep: 3832.39 | bwd_allreduce_microstep: 8.29 | step_microstep: 21.21 [2024-11-13 23:11:34,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.41 | bwd: 3840.74 | bwd_inner: 3832.39 | bwd_allreduce: 8.31 | step: 21.21 5%|▍ | 2461/50750 [6:28:56<79:24:49, 5.92s/it] {'loss': 0.3534, 'learning_rate': 3.9964176419531455e-05, 'epoch': 2.42} 5%|▍ | 2461/50750 [6:28:56<79:24:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:11:40,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 23:11:40,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.41 | bwd_microstep: 3839.07 | bwd_inner_microstep: 3831.54 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.22 [2024-11-13 23:11:40,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.41 | bwd: 3839.08 | bwd_inner: 3831.54 | bwd_allreduce: 7.50 | step: 21.22 5%|▍ | 2462/50750 [6:29:01<79:23:10, 5.92s/it] {'loss': 0.4594, 'learning_rate': 3.9964100018786424e-05, 'epoch': 2.43} 5%|▍ | 2462/50750 [6:29:01<79:23:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:11:45,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 5.07 [2024-11-13 23:11:45,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.46 | bwd_microstep: 3838.55 | bwd_inner_microstep: 3830.85 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.41 [2024-11-13 23:11:45,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.44 | bwd: 3838.56 | bwd_inner: 3830.85 | bwd_allreduce: 7.67 | step: 21.41 5%|▍ | 2463/50750 [6:29:07<79:21:09, 5.92s/it] {'loss': 0.413, 'learning_rate': 3.9964023536731624e-05, 'epoch': 2.43} 5%|▍ | 2463/50750 [6:29:07<79:21:09, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:11:51,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:11:51,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.76 | bwd_microstep: 3840.51 | bwd_inner_microstep: 3832.98 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.67 [2024-11-13 23:11:51,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.76 | bwd: 3840.52 | bwd_inner: 3832.98 | bwd_allreduce: 7.50 | step: 21.67 5%|▍ | 2464/50750 [6:29:13<79:19:49, 5.91s/it] {'loss': 0.001, 'learning_rate': 3.996394697336737e-05, 'epoch': 2.43} 5%|▍ | 2464/50750 [6:29:13<79:19:49, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:11:57,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:11:57,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.67 | bwd_microstep: 3839.10 | bwd_inner_microstep: 3831.44 | bwd_allreduce_microstep: 7.62 | step_microstep: 20.99 [2024-11-13 23:11:57,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.67 | bwd: 3839.11 | bwd_inner: 3831.44 | bwd_allreduce: 7.64 | step: 20.99 5%|▍ | 2465/50750 [6:29:19<79:17:59, 5.91s/it] {'loss': 0.001, 'learning_rate': 3.996387032869396e-05, 'epoch': 2.43} 5%|▍ | 2465/50750 [6:29:19<79:17:59, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:12:03,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:12:03,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.77 | bwd_microstep: 3841.23 | bwd_inner_microstep: 3833.76 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.23 [2024-11-13 23:12:03,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.77 | bwd: 3841.24 | bwd_inner: 3833.76 | bwd_allreduce: 7.44 | step: 21.24 5%|▍ | 2466/50750 [6:29:25<79:17:03, 5.91s/it] {'loss': 0.7681, 'learning_rate': 3.9963793602711714e-05, 'epoch': 2.43} 5%|▍ | 2466/50750 [6:29:25<79:17:03, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:12:09,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:12:09,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.93 | bwd_microstep: 3837.18 | bwd_inner_microstep: 3829.53 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.42 [2024-11-13 23:12:09,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.93 | bwd: 3837.19 | bwd_inner: 3829.53 | bwd_allreduce: 7.62 | step: 21.42 5%|▍ | 2467/50750 [6:29:31<79:15:57, 5.91s/it] {'loss': 0.0045, 'learning_rate': 3.996371679542094e-05, 'epoch': 2.43} 5%|▍ | 2467/50750 [6:29:31<79:15:57, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:12:15,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.92 [2024-11-13 23:12:15,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.18 | bwd_microstep: 3842.32 | bwd_inner_microstep: 3834.51 | bwd_allreduce_microstep: 7.76 | step_microstep: 22.42 [2024-11-13 23:12:15,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.18 | bwd: 3842.33 | bwd_inner: 3834.51 | bwd_allreduce: 7.78 | step: 22.42 5%|▍ | 2468/50750 [6:29:37<79:17:58, 5.91s/it] {'loss': 0.1802, 'learning_rate': 3.996363990682196e-05, 'epoch': 2.43} 5%|▍ | 2468/50750 [6:29:37<79:17:58, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:12:21,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 23:12:21,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.13 | bwd_microstep: 3853.13 | bwd_inner_microstep: 3843.55 | bwd_allreduce_microstep: 9.53 | step_microstep: 21.29 [2024-11-13 23:12:21,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.11 | bwd: 3853.14 | bwd_inner: 3843.55 | bwd_allreduce: 9.55 | step: 21.29 5%|▍ | 2469/50750 [6:29:43<79:22:18, 5.92s/it] {'loss': 0.014, 'learning_rate': 3.9963562936915083e-05, 'epoch': 2.43} 5%|▍ | 2469/50750 [6:29:43<79:22:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:12:27,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:12:27,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.44 | bwd_microstep: 3839.09 | bwd_inner_microstep: 3831.58 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.15 [2024-11-13 23:12:27,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.43 | bwd: 3839.10 | bwd_inner: 3831.58 | bwd_allreduce: 7.48 | step: 21.15 5%|▍ | 2470/50750 [6:29:49<79:22:31, 5.92s/it] {'loss': 0.0034, 'learning_rate': 3.996348588570062e-05, 'epoch': 2.43} 5%|▍ | 2470/50750 [6:29:49<79:22:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:12:33,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:12:33,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.32 | bwd_microstep: 3847.29 | bwd_inner_microstep: 3839.77 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.32 [2024-11-13 23:12:33,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.32 | bwd: 3847.31 | bwd_inner: 3839.77 | bwd_allreduce: 7.49 | step: 21.32 5%|▍ | 2471/50750 [6:29:55<79:23:17, 5.92s/it] {'loss': 0.0069, 'learning_rate': 3.9963408753178884e-05, 'epoch': 2.43} 5%|▍ | 2471/50750 [6:29:55<79:23:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:12:39,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:12:39,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.31 | bwd_microstep: 3846.41 | bwd_inner_microstep: 3838.94 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.83 [2024-11-13 23:12:39,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.31 | bwd: 3846.42 | bwd_inner: 3838.94 | bwd_allreduce: 7.44 | step: 20.83 5%|▍ | 2472/50750 [6:30:01<79:22:13, 5.92s/it] {'loss': 0.0008, 'learning_rate': 3.9963331539350196e-05, 'epoch': 2.44} 5%|▍ | 2472/50750 [6:30:01<79:22:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:12:45,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.56 | optimizer_step: 4.97 [2024-11-13 23:12:45,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.54 | bwd_microstep: 3839.16 | bwd_inner_microstep: 3831.58 | bwd_allreduce_microstep: 7.53 | step_microstep: 22.90 [2024-11-13 23:12:45,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.54 | bwd: 3839.17 | bwd_inner: 3831.58 | bwd_allreduce: 7.55 | step: 22.90 5%|▍ | 2473/50750 [6:30:07<79:20:24, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.9963254244214864e-05, 'epoch': 2.44} 5%|▍ | 2473/50750 [6:30:07<79:20:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 23:12:51,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 23:12:51,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.21 | bwd_microstep: 3842.93 | bwd_inner_microstep: 3834.71 | bwd_allreduce_microstep: 8.17 | step_microstep: 21.83 [2024-11-13 23:12:51,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.20 | bwd: 3842.94 | bwd_inner: 3834.71 | bwd_allreduce: 8.19 | step: 21.84 5%|▍ | 2474/50750 [6:30:12<79:22:20, 5.92s/it] {'loss': 0.0016, 'learning_rate': 3.996317686777321e-05, 'epoch': 2.44} 5%|▍ | 2474/50750 [6:30:12<79:22:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:12:56,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:12:56,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.71 | bwd_microstep: 3838.77 | bwd_inner_microstep: 3831.20 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.26 [2024-11-13 23:12:56,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.70 | bwd: 3838.79 | bwd_inner: 3831.20 | bwd_allreduce: 7.54 | step: 21.26 5%|▍ | 2475/50750 [6:30:18<79:19:52, 5.92s/it] {'loss': 0.2917, 'learning_rate': 3.996309941002554e-05, 'epoch': 2.44} 5%|▍ | 2475/50750 [6:30:18<79:19:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:13:02,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-13 23:13:02,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.54 | bwd_microstep: 3850.72 | bwd_inner_microstep: 3843.02 | bwd_allreduce_microstep: 7.66 | step_microstep: 22.11 [2024-11-13 23:13:02,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.54 | bwd: 3850.74 | bwd_inner: 3843.02 | bwd_allreduce: 7.67 | step: 22.12 5%|▍ | 2476/50750 [6:30:24<79:22:37, 5.92s/it] {'loss': 0.0441, 'learning_rate': 3.9963021870972166e-05, 'epoch': 2.44} 5%|▍ | 2476/50750 [6:30:24<79:22:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:13:08,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.93 [2024-11-13 23:13:08,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2038.75 | bwd_microstep: 3843.74 | bwd_inner_microstep: 3836.04 | bwd_allreduce_microstep: 7.65 | step_microstep: 23.35 [2024-11-13 23:13:08,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2038.73 | bwd: 3843.75 | bwd_inner: 3836.05 | bwd_allreduce: 7.66 | step: 23.36 5%|▍ | 2477/50750 [6:30:30<79:27:40, 5.93s/it] {'loss': 0.0041, 'learning_rate': 3.996294425061342e-05, 'epoch': 2.44} 5%|▍ | 2477/50750 [6:30:30<79:27:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:13:14,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 23:13:14,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.10 | bwd_microstep: 3841.23 | bwd_inner_microstep: 3833.70 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.23 [2024-11-13 23:13:14,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.09 | bwd: 3841.24 | bwd_inner: 3833.70 | bwd_allreduce: 7.50 | step: 21.23 5%|▍ | 2478/50750 [6:30:36<79:27:09, 5.93s/it] {'loss': 0.1452, 'learning_rate': 3.99628665489496e-05, 'epoch': 2.44} 5%|▍ | 2478/50750 [6:30:36<79:27:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:13:20,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:13:20,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.79 | bwd_microstep: 3836.74 | bwd_inner_microstep: 3829.10 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.19 [2024-11-13 23:13:20,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.78 | bwd: 3836.76 | bwd_inner: 3829.10 | bwd_allreduce: 7.61 | step: 21.20 5%|▍ | 2479/50750 [6:30:42<79:23:03, 5.92s/it] {'loss': 0.0099, 'learning_rate': 3.9962788765981046e-05, 'epoch': 2.44} 5%|▍ | 2479/50750 [6:30:42<79:23:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:13:26,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 23:13:26,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.55 | bwd_microstep: 3850.75 | bwd_inner_microstep: 3843.29 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.17 [2024-11-13 23:13:26,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.52 | bwd: 3850.76 | bwd_inner: 3843.29 | bwd_allreduce: 7.43 | step: 21.20 5%|▍ | 2480/50750 [6:30:48<79:23:02, 5.92s/it] {'loss': 0.0018, 'learning_rate': 3.996271090170805e-05, 'epoch': 2.44} 5%|▍ | 2480/50750 [6:30:48<79:23:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:13:32,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:13:32,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.53 | bwd_microstep: 3856.01 | bwd_inner_microstep: 3848.53 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.85 [2024-11-13 23:13:32,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.53 | bwd: 3856.02 | bwd_inner: 3848.53 | bwd_allreduce: 7.45 | step: 20.85 5%|▍ | 2481/50750 [6:30:54<79:23:54, 5.92s/it] {'loss': 0.0229, 'learning_rate': 3.9962632956130943e-05, 'epoch': 2.44} 5%|▍ | 2481/50750 [6:30:54<79:23:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:13:38,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:13:38,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.63 | bwd_microstep: 3842.22 | bwd_inner_microstep: 3834.70 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.10 [2024-11-13 23:13:38,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.63 | bwd: 3842.24 | bwd_inner: 3834.70 | bwd_allreduce: 7.50 | step: 21.10 5%|▍ | 2482/50750 [6:31:00<79:22:05, 5.92s/it] {'loss': 0.0021, 'learning_rate': 3.996255492925004e-05, 'epoch': 2.45} 5%|▍ | 2482/50750 [6:31:00<79:22:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:13:44,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:13:44,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.40 | bwd_microstep: 3838.00 | bwd_inner_microstep: 3830.48 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.27 [2024-11-13 23:13:44,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.38 | bwd: 3838.02 | bwd_inner: 3830.48 | bwd_allreduce: 7.49 | step: 21.28 5%|▍ | 2483/50750 [6:31:06<79:20:28, 5.92s/it] {'loss': 0.0022, 'learning_rate': 3.996247682106566e-05, 'epoch': 2.45} 5%|▍ | 2483/50750 [6:31:06<79:20:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:13:50,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:13:50,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.34 | bwd_microstep: 3842.08 | bwd_inner_microstep: 3834.59 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.90 [2024-11-13 23:13:50,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.34 | bwd: 3842.09 | bwd_inner: 3834.59 | bwd_allreduce: 7.46 | step: 20.90 5%|▍ | 2484/50750 [6:31:12<79:18:44, 5.92s/it] {'loss': 0.0231, 'learning_rate': 3.996239863157811e-05, 'epoch': 2.45} 5%|▍ | 2484/50750 [6:31:12<79:18:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:13:56,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 23:13:56,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.65 | bwd_microstep: 3840.62 | bwd_inner_microstep: 3833.13 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.10 [2024-11-13 23:13:56,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.65 | bwd: 3840.63 | bwd_inner: 3833.13 | bwd_allreduce: 7.46 | step: 21.10 5%|▍ | 2485/50750 [6:31:18<79:17:06, 5.91s/it] {'loss': 0.1916, 'learning_rate': 3.9962320360787724e-05, 'epoch': 2.45} 5%|▍ | 2485/50750 [6:31:18<79:17:06, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:14:02,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:14:02,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.63 | bwd_microstep: 3846.43 | bwd_inner_microstep: 3838.96 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.09 [2024-11-13 23:14:02,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.64 | bwd: 3846.44 | bwd_inner: 3838.96 | bwd_allreduce: 7.44 | step: 21.09 5%|▍ | 2486/50750 [6:31:23<79:18:01, 5.91s/it] {'loss': 0.0369, 'learning_rate': 3.996224200869482e-05, 'epoch': 2.45} 5%|▍ | 2486/50750 [6:31:23<79:18:01, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:14:07,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 23:14:07,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.71 | bwd_microstep: 3843.83 | bwd_inner_microstep: 3836.38 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.83 [2024-11-13 23:14:07,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.71 | bwd: 3843.85 | bwd_inner: 3836.38 | bwd_allreduce: 7.43 | step: 20.84 5%|▍ | 2487/50750 [6:31:29<79:17:25, 5.91s/it] {'loss': 0.0358, 'learning_rate': 3.99621635752997e-05, 'epoch': 2.45} 5%|▍ | 2487/50750 [6:31:29<79:17:25, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:14:13,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:14:13,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.80 | bwd_microstep: 3845.87 | bwd_inner_microstep: 3838.42 | bwd_allreduce_microstep: 7.40 | step_microstep: 20.88 [2024-11-13 23:14:13,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.80 | bwd: 3845.88 | bwd_inner: 3838.42 | bwd_allreduce: 7.42 | step: 20.89 5%|▍ | 2488/50750 [6:31:35<79:17:48, 5.91s/it] {'loss': 0.068, 'learning_rate': 3.99620850606027e-05, 'epoch': 2.45} 5%|▍ | 2488/50750 [6:31:35<79:17:48, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:14:19,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 23:14:19,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.85 | bwd_microstep: 3841.22 | bwd_inner_microstep: 3833.76 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.82 [2024-11-13 23:14:19,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.85 | bwd: 3841.23 | bwd_inner: 3833.76 | bwd_allreduce: 7.44 | step: 20.82 5%|▍ | 2489/50750 [6:31:41<79:17:07, 5.91s/it] {'loss': 0.032, 'learning_rate': 3.9962006464604135e-05, 'epoch': 2.45} 5%|▍ | 2489/50750 [6:31:41<79:17:07, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:14:25,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:14:25,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.86 | bwd_microstep: 3840.18 | bwd_inner_microstep: 3832.70 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.91 [2024-11-13 23:14:25,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.86 | bwd: 3840.19 | bwd_inner: 3832.70 | bwd_allreduce: 7.45 | step: 20.91 5%|▍ | 2490/50750 [6:31:47<79:16:05, 5.91s/it] {'loss': 0.0218, 'learning_rate': 3.9961927787304326e-05, 'epoch': 2.45} 5%|▍ | 2490/50750 [6:31:47<79:16:05, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:14:31,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:14:31,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.73 | bwd_microstep: 3845.64 | bwd_inner_microstep: 3838.18 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.92 [2024-11-13 23:14:31,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.73 | bwd: 3845.66 | bwd_inner: 3838.18 | bwd_allreduce: 7.44 | step: 20.93 5%|▍ | 2491/50750 [6:31:53<79:16:47, 5.91s/it] {'loss': 0.0011, 'learning_rate': 3.996184902870359e-05, 'epoch': 2.45} 5%|▍ | 2491/50750 [6:31:53<79:16:47, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:14:37,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:14:37,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.79 | bwd_microstep: 3844.21 | bwd_inner_microstep: 3836.74 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.87 [2024-11-13 23:14:37,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.79 | bwd: 3844.23 | bwd_inner: 3836.74 | bwd_allreduce: 7.44 | step: 20.88 5%|▍ | 2492/50750 [6:31:59<79:16:40, 5.91s/it] {'loss': 0.4894, 'learning_rate': 3.9961770188802254e-05, 'epoch': 2.46} 5%|▍ | 2492/50750 [6:31:59<79:16:40, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 23:14:43,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:14:43,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.92 | bwd_microstep: 3836.39 | bwd_inner_microstep: 3828.91 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.89 [2024-11-13 23:14:43,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.92 | bwd: 3836.40 | bwd_inner: 3828.91 | bwd_allreduce: 7.46 | step: 20.89 5%|▍ | 2493/50750 [6:32:05<79:14:33, 5.91s/it] {'loss': 0.0057, 'learning_rate': 3.996169126760063e-05, 'epoch': 2.46} 5%|▍ | 2493/50750 [6:32:05<79:14:33, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:14:49,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:14:49,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.00 | bwd_microstep: 3848.72 | bwd_inner_microstep: 3841.26 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.88 [2024-11-13 23:14:49,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.01 | bwd: 3848.73 | bwd_inner: 3841.26 | bwd_allreduce: 7.44 | step: 20.88 5%|▍ | 2494/50750 [6:32:11<79:15:54, 5.91s/it] {'loss': 0.0006, 'learning_rate': 3.9961612265099046e-05, 'epoch': 2.46} 5%|▍ | 2494/50750 [6:32:11<79:15:54, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:14:55,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:14:55,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.25 | bwd_microstep: 3848.64 | bwd_inner_microstep: 3841.17 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.84 [2024-11-13 23:14:55,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.25 | bwd: 3848.65 | bwd_inner: 3841.17 | bwd_allreduce: 7.44 | step: 20.85 5%|▍ | 2495/50750 [6:32:17<79:17:26, 5.92s/it] {'loss': 0.0026, 'learning_rate': 3.996153318129782e-05, 'epoch': 2.46} 5%|▍ | 2495/50750 [6:32:17<79:17:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:15:01,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 23:15:01,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.65 | bwd_microstep: 3845.14 | bwd_inner_microstep: 3837.63 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.01 [2024-11-13 23:15:01,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.65 | bwd: 3845.16 | bwd_inner: 3837.63 | bwd_allreduce: 7.48 | step: 21.01 5%|▍ | 2496/50750 [6:32:23<79:16:52, 5.91s/it] {'loss': 0.0013, 'learning_rate': 3.996145401619728e-05, 'epoch': 2.46} 5%|▍ | 2496/50750 [6:32:23<79:16:52, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:15:07,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:15:07,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.09 | bwd_microstep: 3843.83 | bwd_inner_microstep: 3836.35 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.93 [2024-11-13 23:15:07,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.09 | bwd: 3843.85 | bwd_inner: 3836.36 | bwd_allreduce: 7.45 | step: 20.94 5%|▍ | 2497/50750 [6:32:29<79:15:55, 5.91s/it] {'loss': 0.0015, 'learning_rate': 3.996137476979774e-05, 'epoch': 2.46} 5%|▍ | 2497/50750 [6:32:29<79:15:55, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:15:12,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:15:12,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.11 | bwd_microstep: 3844.20 | bwd_inner_microstep: 3836.72 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.03 [2024-11-13 23:15:12,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.11 | bwd: 3844.21 | bwd_inner: 3836.72 | bwd_allreduce: 7.46 | step: 21.03 5%|▍ | 2498/50750 [6:32:34<79:15:22, 5.91s/it] {'loss': 0.0103, 'learning_rate': 3.9961295442099536e-05, 'epoch': 2.46} 5%|▍ | 2498/50750 [6:32:34<79:15:22, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:15:18,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:15:18,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.59 | bwd_microstep: 3838.23 | bwd_inner_microstep: 3830.72 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.00 [2024-11-13 23:15:18,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.59 | bwd: 3838.25 | bwd_inner: 3830.73 | bwd_allreduce: 7.48 | step: 21.01 5%|▍ | 2499/50750 [6:32:40<79:14:27, 5.91s/it] {'loss': 0.0017, 'learning_rate': 3.996121603310298e-05, 'epoch': 2.46} 5%|▍ | 2499/50750 [6:32:40<79:14:27, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:15:24,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:15:24,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.39 | bwd_microstep: 3850.01 | bwd_inner_microstep: 3842.52 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.90 [2024-11-13 23:15:24,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.39 | bwd: 3850.02 | bwd_inner: 3842.52 | bwd_allreduce: 7.46 | step: 20.90 5%|▍ | 2500/50750 [6:32:46<79:16:03, 5.91s/it] {'loss': 0.0023, 'learning_rate': 3.996113654280839e-05, 'epoch': 2.46} 5%|▍ | 2500/50750 [6:32:46<79:16:03, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:15:30,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-13 23:15:30,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.49 | bwd_microstep: 3850.13 | bwd_inner_microstep: 3842.51 | bwd_allreduce_microstep: 7.58 | step_microstep: 22.02 [2024-11-13 23:15:30,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.49 | bwd: 3850.14 | bwd_inner: 3842.51 | bwd_allreduce: 7.60 | step: 22.02 5%|▍ | 2501/50750 [6:32:52<79:19:20, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.99610569712161e-05, 'epoch': 2.46} 5%|▍ | 2501/50750 [6:32:52<79:19:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:15:36,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 23:15:36,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.94 | bwd_microstep: 3839.80 | bwd_inner_microstep: 3832.26 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.57 [2024-11-13 23:15:36,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.92 | bwd: 3839.82 | bwd_inner: 3832.26 | bwd_allreduce: 7.52 | step: 21.57 5%|▍ | 2502/50750 [6:32:58<79:18:51, 5.92s/it] {'loss': 0.0621, 'learning_rate': 3.9960977318326436e-05, 'epoch': 2.47} 5%|▍ | 2502/50750 [6:32:58<79:18:51, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:15:42,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 23:15:42,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.09 | bwd_microstep: 3857.52 | bwd_inner_microstep: 3850.02 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.02 [2024-11-13 23:15:42,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.09 | bwd: 3857.53 | bwd_inner: 3850.02 | bwd_allreduce: 7.47 | step: 21.02 5%|▍ | 2503/50750 [6:33:04<79:21:14, 5.92s/it] {'loss': 0.4157, 'learning_rate': 3.9960897584139715e-05, 'epoch': 2.47} 5%|▍ | 2503/50750 [6:33:04<79:21:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:15:48,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.51 | optimizer_step: 4.93 [2024-11-13 23:15:48,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.59 | bwd_microstep: 3856.84 | bwd_inner_microstep: 3849.30 | bwd_allreduce_microstep: 7.50 | step_microstep: 22.75 [2024-11-13 23:15:48,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.59 | bwd: 3856.86 | bwd_inner: 3849.30 | bwd_allreduce: 7.52 | step: 22.75 5%|▍ | 2504/50750 [6:33:10<79:23:20, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.996081776865627e-05, 'epoch': 2.47} 5%|▍ | 2504/50750 [6:33:10<79:23:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:15:54,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 23:15:54,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.41 | bwd_microstep: 3852.03 | bwd_inner_microstep: 3844.55 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.00 [2024-11-13 23:15:54,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.41 | bwd: 3852.04 | bwd_inner: 3844.55 | bwd_allreduce: 7.45 | step: 21.01 5%|▍ | 2505/50750 [6:33:16<79:24:53, 5.93s/it] {'loss': 0.0008, 'learning_rate': 3.9960737871876416e-05, 'epoch': 2.47} 5%|▍ | 2505/50750 [6:33:16<79:24:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:16:00,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:16:00,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.36 | bwd_microstep: 3850.92 | bwd_inner_microstep: 3843.44 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.78 [2024-11-13 23:16:00,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.36 | bwd: 3850.93 | bwd_inner: 3843.44 | bwd_allreduce: 7.45 | step: 20.79 5%|▍ | 2506/50750 [6:33:22<79:24:36, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.996065789380048e-05, 'epoch': 2.47} 5%|▍ | 2506/50750 [6:33:22<79:24:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:16:06,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.95 [2024-11-13 23:16:06,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.63 | bwd_microstep: 3853.86 | bwd_inner_microstep: 3846.04 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.54 [2024-11-13 23:16:06,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.63 | bwd: 3853.87 | bwd_inner: 3846.04 | bwd_allreduce: 7.80 | step: 22.55 5%|▍ | 2507/50750 [6:33:28<79:27:29, 5.93s/it] {'loss': 0.2553, 'learning_rate': 3.99605778344288e-05, 'epoch': 2.47} 5%|▍ | 2507/50750 [6:33:28<79:27:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:16:12,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.41 | optimizer_step: 4.92 [2024-11-13 23:16:12,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.21 | bwd_microstep: 3849.69 | bwd_inner_microstep: 3841.85 | bwd_allreduce_microstep: 7.79 | step_microstep: 22.45 [2024-11-13 23:16:12,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.20 | bwd: 3849.71 | bwd_inner: 3841.85 | bwd_allreduce: 7.81 | step: 22.46 5%|▍ | 2508/50750 [6:33:34<79:28:26, 5.93s/it] {'loss': 0.0263, 'learning_rate': 3.9960497693761695e-05, 'epoch': 2.47} 5%|▍ | 2508/50750 [6:33:34<79:28:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:16:18,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-13 23:16:18,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.24 | bwd_microstep: 3853.42 | bwd_inner_microstep: 3845.92 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.12 [2024-11-13 23:16:18,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.22 | bwd: 3853.44 | bwd_inner: 3845.92 | bwd_allreduce: 7.48 | step: 21.12 5%|▍ | 2509/50750 [6:33:40<79:29:50, 5.93s/it] {'loss': 0.0144, 'learning_rate': 3.9960417471799484e-05, 'epoch': 2.47} 5%|▍ | 2509/50750 [6:33:40<79:29:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:16:24,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.92 [2024-11-13 23:16:24,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.42 | bwd_microstep: 3838.24 | bwd_inner_microstep: 3830.77 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.99 [2024-11-13 23:16:24,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.42 | bwd: 3838.25 | bwd_inner: 3830.77 | bwd_allreduce: 7.44 | step: 21.99 5%|▍ | 2510/50750 [6:33:46<79:24:08, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.99603371685425e-05, 'epoch': 2.47} 5%|▍ | 2510/50750 [6:33:46<79:24:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:16:30,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.92 [2024-11-13 23:16:30,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.81 | bwd_microstep: 3840.13 | bwd_inner_microstep: 3832.36 | bwd_allreduce_microstep: 7.72 | step_microstep: 23.01 [2024-11-13 23:16:30,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.80 | bwd: 3840.15 | bwd_inner: 3832.36 | bwd_allreduce: 7.74 | step: 23.02 5%|▍ | 2511/50750 [6:33:51<79:25:13, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.996025678399107e-05, 'epoch': 2.47} 5%|▍ | 2511/50750 [6:33:51<79:25:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:16:35,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.94 [2024-11-13 23:16:35,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.12 | bwd_microstep: 3847.23 | bwd_inner_microstep: 3839.26 | bwd_allreduce_microstep: 7.92 | step_microstep: 22.11 [2024-11-13 23:16:35,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.11 | bwd: 3847.24 | bwd_inner: 3839.26 | bwd_allreduce: 7.94 | step: 22.11 5%|▍ | 2512/50750 [6:33:57<79:27:53, 5.93s/it] {'loss': 0.0031, 'learning_rate': 3.996017631814552e-05, 'epoch': 2.47} 5%|▍ | 2512/50750 [6:33:57<79:27:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:16:41,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:16:41,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.84 | bwd_microstep: 3868.17 | bwd_inner_microstep: 3860.70 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.99 [2024-11-13 23:16:41,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.82 | bwd: 3868.18 | bwd_inner: 3860.70 | bwd_allreduce: 7.44 | step: 20.99 5%|▍ | 2513/50750 [6:34:03<79:32:02, 5.94s/it] {'loss': 0.002, 'learning_rate': 3.9960095771006174e-05, 'epoch': 2.48} 5%|▍ | 2513/50750 [6:34:03<79:32:02, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:16:47,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-13 23:16:47,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.03 | bwd_microstep: 3850.03 | bwd_inner_microstep: 3842.55 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.93 [2024-11-13 23:16:47,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.02 | bwd: 3850.04 | bwd_inner: 3842.55 | bwd_allreduce: 7.45 | step: 20.93 5%|▍ | 2514/50750 [6:34:09<79:28:46, 5.93s/it] {'loss': 0.2668, 'learning_rate': 3.996001514257337e-05, 'epoch': 2.48} 5%|▍ | 2514/50750 [6:34:09<79:28:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:16:53,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 23:16:53,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.07 | bwd_microstep: 3836.60 | bwd_inner_microstep: 3828.83 | bwd_allreduce_microstep: 7.72 | step_microstep: 21.88 [2024-11-13 23:16:53,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.07 | bwd: 3836.61 | bwd_inner: 3828.83 | bwd_allreduce: 7.74 | step: 21.88 5%|▍ | 2515/50750 [6:34:15<79:23:18, 5.93s/it] {'loss': 0.0013, 'learning_rate': 3.995993443284743e-05, 'epoch': 2.48} 5%|▍ | 2515/50750 [6:34:15<79:23:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:16:59,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:16:59,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.50 | bwd_microstep: 3830.03 | bwd_inner_microstep: 3822.52 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.09 [2024-11-13 23:16:59,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.49 | bwd: 3830.05 | bwd_inner: 3822.52 | bwd_allreduce: 7.48 | step: 21.10 5%|▍ | 2516/50750 [6:34:21<79:17:47, 5.92s/it] {'loss': 0.5377, 'learning_rate': 3.995985364182868e-05, 'epoch': 2.48} 5%|▍ | 2516/50750 [6:34:21<79:17:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:17:05,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 23:17:05,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.61 | bwd_microstep: 3849.44 | bwd_inner_microstep: 3841.68 | bwd_allreduce_microstep: 7.72 | step_microstep: 21.27 [2024-11-13 23:17:05,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.61 | bwd: 3849.45 | bwd_inner: 3841.68 | bwd_allreduce: 7.74 | step: 21.28 5%|▍ | 2517/50750 [6:34:27<79:17:55, 5.92s/it] {'loss': 0.0256, 'learning_rate': 3.995977276951746e-05, 'epoch': 2.48} 5%|▍ | 2517/50750 [6:34:27<79:17:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:17:11,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:17:11,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.83 | bwd_microstep: 3835.27 | bwd_inner_microstep: 3827.81 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.93 [2024-11-13 23:17:11,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.82 | bwd: 3835.28 | bwd_inner: 3827.81 | bwd_allreduce: 7.44 | step: 20.93 5%|▍ | 2518/50750 [6:34:33<79:16:43, 5.92s/it] {'loss': 0.0071, 'learning_rate': 3.995969181591409e-05, 'epoch': 2.48} 5%|▍ | 2518/50750 [6:34:33<79:16:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:17:17,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:17:17,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.07 | bwd_microstep: 3840.05 | bwd_inner_microstep: 3832.52 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.35 [2024-11-13 23:17:17,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.07 | bwd: 3840.07 | bwd_inner: 3832.52 | bwd_allreduce: 7.50 | step: 21.36 5%|▍ | 2519/50750 [6:34:39<79:16:26, 5.92s/it] {'loss': 0.0018, 'learning_rate': 3.99596107810189e-05, 'epoch': 2.48} 5%|▍ | 2519/50750 [6:34:39<79:16:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:17:23,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 23:17:23,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.22 | bwd_microstep: 3840.48 | bwd_inner_microstep: 3832.92 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.56 [2024-11-13 23:17:23,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.22 | bwd: 3840.49 | bwd_inner: 3832.92 | bwd_allreduce: 7.53 | step: 21.57 5%|▍ | 2520/50750 [6:34:45<79:15:12, 5.92s/it] {'loss': 0.165, 'learning_rate': 3.9959529664832225e-05, 'epoch': 2.48} 5%|▍ | 2520/50750 [6:34:45<79:15:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:17:29,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:17:29,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.31 | bwd_microstep: 3844.88 | bwd_inner_microstep: 3837.07 | bwd_allreduce_microstep: 7.76 | step_microstep: 21.48 [2024-11-13 23:17:29,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.31 | bwd: 3844.90 | bwd_inner: 3837.07 | bwd_allreduce: 7.79 | step: 21.49 5%|▍ | 2521/50750 [6:34:51<79:15:41, 5.92s/it] {'loss': 0.6267, 'learning_rate': 3.9959448467354386e-05, 'epoch': 2.48} 5%|▍ | 2521/50750 [6:34:51<79:15:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:17:35,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.37 | optimizer_step: 4.93 [2024-11-13 23:17:35,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.57 | bwd_microstep: 3835.52 | bwd_inner_microstep: 3827.44 | bwd_allreduce_microstep: 8.01 | step_microstep: 26.00 [2024-11-13 23:17:35,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.58 | bwd: 3835.54 | bwd_inner: 3827.44 | bwd_allreduce: 8.04 | step: 25.99 5%|▍ | 2522/50750 [6:34:57<79:14:15, 5.91s/it] {'loss': 0.1628, 'learning_rate': 3.9959367188585725e-05, 'epoch': 2.48} 5%|▍ | 2522/50750 [6:34:57<79:14:15, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:17:41,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-13 23:17:41,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.83 | bwd_microstep: 3846.22 | bwd_inner_microstep: 3837.68 | bwd_allreduce_microstep: 8.49 | step_microstep: 22.30 [2024-11-13 23:17:41,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.82 | bwd: 3846.23 | bwd_inner: 3837.68 | bwd_allreduce: 8.51 | step: 22.30 5%|▍ | 2523/50750 [6:35:03<79:17:21, 5.92s/it] {'loss': 0.2946, 'learning_rate': 3.9959285828526566e-05, 'epoch': 2.49} 5%|▍ | 2523/50750 [6:35:03<79:17:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:17:46,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-13 23:17:46,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.59 | bwd_microstep: 3848.77 | bwd_inner_microstep: 3841.26 | bwd_allreduce_microstep: 7.47 | step_microstep: 22.17 [2024-11-13 23:17:46,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.56 | bwd: 3848.78 | bwd_inner: 3841.26 | bwd_allreduce: 7.48 | step: 22.17 5%|▍ | 2524/50750 [6:35:08<79:19:40, 5.92s/it] {'loss': 0.0213, 'learning_rate': 3.995920438717725e-05, 'epoch': 2.49} 5%|▍ | 2524/50750 [6:35:08<79:19:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:17:52,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:17:52,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.31 | bwd_microstep: 3838.49 | bwd_inner_microstep: 3830.98 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.55 [2024-11-13 23:17:52,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.29 | bwd: 3838.50 | bwd_inner: 3830.98 | bwd_allreduce: 7.48 | step: 21.55 5%|▍ | 2525/50750 [6:35:14<79:17:39, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.9959122864538084e-05, 'epoch': 2.49} 5%|▍ | 2525/50750 [6:35:14<79:17:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:17:58,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-13 23:17:58,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.32 | bwd_microstep: 3844.86 | bwd_inner_microstep: 3837.24 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.47 [2024-11-13 23:17:58,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.32 | bwd: 3844.87 | bwd_inner: 3837.24 | bwd_allreduce: 7.59 | step: 21.47 5%|▍ | 2526/50750 [6:35:20<79:16:35, 5.92s/it] {'loss': 0.0034, 'learning_rate': 3.9959041260609426e-05, 'epoch': 2.49} 5%|▍ | 2526/50750 [6:35:20<79:16:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:18:04,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 23:18:04,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.35 | bwd_microstep: 3840.69 | bwd_inner_microstep: 3832.05 | bwd_allreduce_microstep: 8.60 | step_microstep: 21.94 [2024-11-13 23:18:04,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.34 | bwd: 3840.70 | bwd_inner: 3832.05 | bwd_allreduce: 8.61 | step: 21.94 5%|▍ | 2527/50750 [6:35:26<79:15:55, 5.92s/it] {'loss': 0.0678, 'learning_rate': 3.9958959575391605e-05, 'epoch': 2.49} 5%|▍ | 2527/50750 [6:35:26<79:15:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:18:10,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 23:18:10,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.97 | bwd_microstep: 3843.44 | bwd_inner_microstep: 3835.65 | bwd_allreduce_microstep: 7.74 | step_microstep: 22.02 [2024-11-13 23:18:10,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.97 | bwd: 3843.46 | bwd_inner: 3835.65 | bwd_allreduce: 7.76 | step: 22.03 5%|▍ | 2528/50750 [6:35:32<79:16:45, 5.92s/it] {'loss': 0.0149, 'learning_rate': 3.9958877808884947e-05, 'epoch': 2.49} 5%|▍ | 2528/50750 [6:35:32<79:16:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:18:16,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:18:16,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.76 | bwd_microstep: 3840.29 | bwd_inner_microstep: 3832.76 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.16 [2024-11-13 23:18:16,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.74 | bwd: 3840.30 | bwd_inner: 3832.76 | bwd_allreduce: 7.50 | step: 21.17 5%|▍ | 2529/50750 [6:35:38<79:15:48, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.995879596108978e-05, 'epoch': 2.49} 5%|▍ | 2529/50750 [6:35:38<79:15:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:18:22,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:18:22,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.16 | bwd_microstep: 3841.46 | bwd_inner_microstep: 3833.92 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.09 [2024-11-13 23:18:22,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.16 | bwd: 3841.47 | bwd_inner: 3833.92 | bwd_allreduce: 7.51 | step: 21.09 5%|▍ | 2530/50750 [6:35:44<79:14:32, 5.92s/it] {'loss': 0.0023, 'learning_rate': 3.995871403200645e-05, 'epoch': 2.49} 5%|▍ | 2530/50750 [6:35:44<79:14:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:18:28,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 23:18:28,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.35 | bwd_microstep: 3836.76 | bwd_inner_microstep: 3828.98 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.63 [2024-11-13 23:18:28,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.35 | bwd: 3836.77 | bwd_inner: 3828.98 | bwd_allreduce: 7.74 | step: 21.64 5%|▍ | 2531/50750 [6:35:50<79:14:51, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.9958632021635276e-05, 'epoch': 2.49} 5%|▍ | 2531/50750 [6:35:50<79:14:51, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:18:34,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.92 [2024-11-13 23:18:34,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.75 | bwd_microstep: 3849.90 | bwd_inner_microstep: 3842.09 | bwd_allreduce_microstep: 7.75 | step_microstep: 21.75 [2024-11-13 23:18:34,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.75 | bwd: 3849.92 | bwd_inner: 3842.09 | bwd_allreduce: 7.77 | step: 21.74 5%|▍ | 2532/50750 [6:35:56<79:16:32, 5.92s/it] {'loss': 0.0844, 'learning_rate': 3.99585499299766e-05, 'epoch': 2.49} 5%|▍ | 2532/50750 [6:35:56<79:16:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 23:18:40,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:18:40,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.37 | bwd_microstep: 3850.04 | bwd_inner_microstep: 3842.36 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.15 [2024-11-13 23:18:40,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.37 | bwd: 3850.06 | bwd_inner: 3842.36 | bwd_allreduce: 7.66 | step: 21.15 5%|▍ | 2533/50750 [6:36:02<79:17:07, 5.92s/it] {'loss': 0.4144, 'learning_rate': 3.995846775703077e-05, 'epoch': 2.5} 5%|▍ | 2533/50750 [6:36:02<79:17:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:18:46,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:18:46,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.71 | bwd_microstep: 3858.16 | bwd_inner_microstep: 3850.65 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.95 [2024-11-13 23:18:46,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.71 | bwd: 3858.17 | bwd_inner: 3850.65 | bwd_allreduce: 7.48 | step: 20.95 5%|▍ | 2534/50750 [6:36:08<79:20:16, 5.92s/it] {'loss': 0.0049, 'learning_rate': 3.995838550279809e-05, 'epoch': 2.5} 5%|▍ | 2534/50750 [6:36:08<79:20:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:18:52,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 23:18:52,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.07 | bwd_microstep: 3853.33 | bwd_inner_microstep: 3845.54 | bwd_allreduce_microstep: 7.74 | step_microstep: 22.44 [2024-11-13 23:18:52,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.07 | bwd: 3853.36 | bwd_inner: 3845.54 | bwd_allreduce: 7.76 | step: 22.44 5%|▍ | 2535/50750 [6:36:14<79:20:29, 5.92s/it] {'loss': 0.0634, 'learning_rate': 3.995830316727892e-05, 'epoch': 2.5} 5%|▍ | 2535/50750 [6:36:14<79:20:29, 5.92s/it]evaluate! dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B Results saved to qa_abcd_lora.csv Accuracy: 0.9005905511811023 dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:54:22,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.34 | optimizer_step: 4.93 [2024-11-13 23:54:22,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2015.91 | bwd_microstep: 3836.99 | bwd_inner_microstep: 3829.36 | bwd_allreduce_microstep: 7.58 | step_microstep: 22.29 [2024-11-13 23:54:22,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2015.90 | bwd: 3837.00 | bwd_inner: 3829.36 | bwd_allreduce: 7.60 | step: 22.29 5%|▍ | 2536/50750 [7:11:44<8616:53:08, 643.40s/it] {'loss': 0.0015, 'learning_rate': 3.995822075047359e-05, 'epoch': 2.5} 5%|▍ | 2536/50750 [7:11:44<8616:53:08, 643.40s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:54:28,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:54:28,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.86 | bwd_microstep: 3832.43 | bwd_inner_microstep: 3824.47 | bwd_allreduce_microstep: 7.92 | step_microstep: 21.24 [2024-11-13 23:54:28,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.85 | bwd: 3832.44 | bwd_inner: 3824.47 | bwd_allreduce: 7.93 | step: 21.24 5%|▍ | 2537/50750 [7:11:50<6055:25:51, 452.15s/it] {'loss': 0.003, 'learning_rate': 3.995813825238243e-05, 'epoch': 2.5} 5%|▍ | 2537/50750 [7:11:50<6055:25:51, 452.15s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:54:34,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:54:34,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.35 | bwd_microstep: 3852.15 | bwd_inner_microstep: 3844.54 | bwd_allreduce_microstep: 7.56 | step_microstep: 20.91 [2024-11-13 23:54:34,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.35 | bwd: 3852.16 | bwd_inner: 3844.54 | bwd_allreduce: 7.58 | step: 20.91 5%|▌ | 2538/50750 [7:11:56<4262:30:26, 318.28s/it] {'loss': 0.0038, 'learning_rate': 3.995805567300578e-05, 'epoch': 2.5} 5%|▌ | 2538/50750 [7:11:56<4262:30:26, 318.28s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:54:40,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-13 23:54:40,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.32 | bwd_microstep: 3852.48 | bwd_inner_microstep: 3844.68 | bwd_allreduce_microstep: 7.75 | step_microstep: 22.85 [2024-11-13 23:54:40,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.30 | bwd: 3852.49 | bwd_inner: 3844.68 | bwd_allreduce: 7.77 | step: 22.86 5%|▌ | 2539/50750 [7:12:02<3007:33:58, 224.58s/it] {'loss': 1.3029, 'learning_rate': 3.995797301234397e-05, 'epoch': 2.5} 5%|▌ | 2539/50750 [7:12:02<3007:33:58, 224.58s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 23:54:46,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-13 23:54:46,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.19 | bwd_microstep: 3851.41 | bwd_inner_microstep: 3843.58 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.10 [2024-11-13 23:54:46,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.18 | bwd: 3851.42 | bwd_inner: 3843.58 | bwd_allreduce: 7.80 | step: 22.10 5%|▌ | 2540/50750 [7:12:08<2129:08:06, 158.99s/it] {'loss': 0.0204, 'learning_rate': 3.995789027039736e-05, 'epoch': 2.5} 5%|▌ | 2540/50750 [7:12:08<2129:08:06, 158.99s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-13 23:54:52,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 23:54:52,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2038.33 | bwd_microstep: 3859.54 | bwd_inner_microstep: 3851.79 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.61 [2024-11-13 23:54:52,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2038.33 | bwd: 3859.55 | bwd_inner: 3851.79 | bwd_allreduce: 7.73 | step: 21.62 5%|▌ | 2541/50750 [7:12:14<1514:15:38, 113.08s/it] {'loss': 0.751, 'learning_rate': 3.995780744716625e-05, 'epoch': 2.5} 5%|▌ | 2541/50750 [7:12:14<1514:15:38, 113.08s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:54:58,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-13 23:54:58,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.79 | bwd_microstep: 3836.10 | bwd_inner_microstep: 3828.58 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.46 [2024-11-13 23:54:58,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.78 | bwd: 3836.11 | bwd_inner: 3828.58 | bwd_allreduce: 7.49 | step: 21.47 5%|▌ | 2542/50750 [7:12:20<1083:41:44, 80.93s/it] {'loss': 0.0463, 'learning_rate': 3.9957724542651005e-05, 'epoch': 2.5} 5%|▌ | 2542/50750 [7:12:20<1083:41:44, 80.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:55:04,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-13 23:55:04,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2015.14 | bwd_microstep: 3832.20 | bwd_inner_microstep: 3824.69 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.23 [2024-11-13 23:55:04,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2015.14 | bwd: 3832.22 | bwd_inner: 3824.69 | bwd_allreduce: 7.49 | step: 21.24 5%|▌ | 2543/50750 [7:12:26<782:15:11, 58.42s/it] {'loss': 0.6225, 'learning_rate': 3.995764155685195e-05, 'epoch': 2.51} 5%|▌ | 2543/50750 [7:12:26<782:15:11, 58.42s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:55:10,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 23:55:10,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.17 | bwd_microstep: 3837.08 | bwd_inner_microstep: 3829.45 | bwd_allreduce_microstep: 7.58 | step_microstep: 22.16 [2024-11-13 23:55:10,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.17 | bwd: 3837.09 | bwd_inner: 3829.45 | bwd_allreduce: 7.60 | step: 22.16 5%|▌ | 2544/50750 [7:12:32<571:18:18, 42.66s/it] {'loss': 0.4729, 'learning_rate': 3.995755848976943e-05, 'epoch': 2.51} 5%|▌ | 2544/50750 [7:12:32<571:18:18, 42.66s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:55:16,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:55:16,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.17 | bwd_microstep: 3838.38 | bwd_inner_microstep: 3830.85 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.30 [2024-11-13 23:55:16,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.16 | bwd: 3838.39 | bwd_inner: 3830.85 | bwd_allreduce: 7.49 | step: 21.30 5%|▌ | 2545/50750 [7:12:38<423:39:20, 31.64s/it] {'loss': 0.0005, 'learning_rate': 3.995747534140378e-05, 'epoch': 2.51} 5%|▌ | 2545/50750 [7:12:38<423:39:20, 31.64s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:55:22,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:55:22,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.09 | bwd_microstep: 3836.00 | bwd_inner_microstep: 3828.45 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.21 [2024-11-13 23:55:22,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.08 | bwd: 3836.02 | bwd_inner: 3828.45 | bwd_allreduce: 7.53 | step: 21.21 5%|▌ | 2546/50750 [7:12:44<320:16:12, 23.92s/it] {'loss': 0.0037, 'learning_rate': 3.9957392111755334e-05, 'epoch': 2.51} 5%|▌ | 2546/50750 [7:12:44<320:16:12, 23.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:55:28,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:55:28,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.89 | bwd_microstep: 3837.89 | bwd_inner_microstep: 3829.88 | bwd_allreduce_microstep: 7.96 | step_microstep: 21.98 [2024-11-13 23:55:28,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.89 | bwd: 3837.90 | bwd_inner: 3829.88 | bwd_allreduce: 7.98 | step: 21.98 5%|▌ | 2547/50750 [7:12:50<247:55:59, 18.52s/it] {'loss': 0.0293, 'learning_rate': 3.995730880082445e-05, 'epoch': 2.51} 5%|▌ | 2547/50750 [7:12:50<247:55:59, 18.52s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:55:33,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 5.08 [2024-11-13 23:55:33,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.22 | bwd_microstep: 3845.69 | bwd_inner_microstep: 3837.96 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.40 [2024-11-13 23:55:33,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.20 | bwd: 3845.70 | bwd_inner: 3837.96 | bwd_allreduce: 7.70 | step: 21.40 5%|▌ | 2548/50750 [7:12:55<197:18:52, 14.74s/it] {'loss': 0.509, 'learning_rate': 3.995722540861144e-05, 'epoch': 2.51} 5%|▌ | 2548/50750 [7:12:55<197:18:52, 14.74s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:55:39,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 23:55:39,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.06 | bwd_microstep: 3830.84 | bwd_inner_microstep: 3823.12 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.63 [2024-11-13 23:55:39,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.06 | bwd: 3830.86 | bwd_inner: 3823.12 | bwd_allreduce: 7.70 | step: 21.64 5%|▌ | 2549/50750 [7:13:01<161:48:33, 12.09s/it] {'loss': 0.0284, 'learning_rate': 3.995714193511666e-05, 'epoch': 2.51} 5%|▌ | 2549/50750 [7:13:01<161:48:33, 12.09s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:55:45,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-13 23:55:45,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.05 | bwd_microstep: 3843.28 | bwd_inner_microstep: 3835.58 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.89 [2024-11-13 23:55:45,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.04 | bwd: 3843.29 | bwd_inner: 3835.58 | bwd_allreduce: 7.67 | step: 21.90 5%|▌ | 2550/50750 [7:13:07<137:01:14, 10.23s/it] {'loss': 0.0108, 'learning_rate': 3.995705838034045e-05, 'epoch': 2.51} 5%|▌ | 2550/50750 [7:13:07<137:01:14, 10.23s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:55:51,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.92 [2024-11-13 23:55:51,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.43 | bwd_microstep: 3836.17 | bwd_inner_microstep: 3828.48 | bwd_allreduce_microstep: 7.65 | step_microstep: 22.87 [2024-11-13 23:55:51,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.41 | bwd: 3836.19 | bwd_inner: 3828.48 | bwd_allreduce: 7.67 | step: 22.87 5%|▌ | 2551/50750 [7:13:13<119:38:43, 8.94s/it] {'loss': 0.058, 'learning_rate': 3.995697474428315e-05, 'epoch': 2.51} 5%|▌ | 2551/50750 [7:13:13<119:38:43, 8.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:55:57,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:55:57,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.82 | bwd_microstep: 3838.33 | bwd_inner_microstep: 3830.78 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.49 [2024-11-13 23:55:57,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.82 | bwd: 3838.35 | bwd_inner: 3830.78 | bwd_allreduce: 7.52 | step: 21.50 5%|▌ | 2552/50750 [7:13:19<107:30:40, 8.03s/it] {'loss': 0.0045, 'learning_rate': 3.9956891026945086e-05, 'epoch': 2.51} 5%|▌ | 2552/50750 [7:13:19<107:30:40, 8.03s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:56:03,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:56:03,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.94 | bwd_microstep: 3840.23 | bwd_inner_microstep: 3832.72 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.00 [2024-11-13 23:56:03,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.94 | bwd: 3840.24 | bwd_inner: 3832.72 | bwd_allreduce: 7.48 | step: 21.01 5%|▌ | 2553/50750 [7:13:25<98:59:35, 7.39s/it] {'loss': 0.0019, 'learning_rate': 3.9956807228326624e-05, 'epoch': 2.52} 5%|▌ | 2553/50750 [7:13:25<98:59:35, 7.39s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:56:09,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.38 | optimizer_step: 4.93 [2024-11-13 23:56:09,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.76 | bwd_microstep: 3846.76 | bwd_inner_microstep: 3839.20 | bwd_allreduce_microstep: 7.52 | step_microstep: 23.52 [2024-11-13 23:56:09,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.76 | bwd: 3846.77 | bwd_inner: 3839.20 | bwd_allreduce: 7.53 | step: 23.53 5%|▌ | 2554/50750 [7:13:31<93:04:16, 6.95s/it] {'loss': 0.0039, 'learning_rate': 3.9956723348428086e-05, 'epoch': 2.52} 5%|▌ | 2554/50750 [7:13:31<93:04:16, 6.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:56:15,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.99 [2024-11-13 23:56:15,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.76 | bwd_microstep: 3832.35 | bwd_inner_microstep: 3824.81 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.87 [2024-11-13 23:56:15,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.75 | bwd: 3832.36 | bwd_inner: 3824.81 | bwd_allreduce: 7.51 | step: 21.87 5%|▌ | 2555/50750 [7:13:37<88:51:28, 6.64s/it] {'loss': 0.0001, 'learning_rate': 3.995663938724982e-05, 'epoch': 2.52} 5%|▌ | 2555/50750 [7:13:37<88:51:28, 6.64s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:56:21,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:56:21,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.30 | bwd_microstep: 3838.59 | bwd_inner_microstep: 3831.11 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.07 [2024-11-13 23:56:21,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.30 | bwd: 3838.61 | bwd_inner: 3831.11 | bwd_allreduce: 7.46 | step: 21.07 5%|▌ | 2556/50750 [7:13:43<85:55:16, 6.42s/it] {'loss': 0.018, 'learning_rate': 3.995655534479217e-05, 'epoch': 2.52} 5%|▌ | 2556/50750 [7:13:43<85:55:16, 6.42s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:56:27,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:56:27,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.32 | bwd_microstep: 3839.62 | bwd_inner_microstep: 3832.07 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.20 [2024-11-13 23:56:27,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.32 | bwd: 3839.64 | bwd_inner: 3832.07 | bwd_allreduce: 7.52 | step: 21.20 5%|▌ | 2557/50750 [7:13:49<83:51:47, 6.26s/it] {'loss': 0.5398, 'learning_rate': 3.995647122105547e-05, 'epoch': 2.52} 5%|▌ | 2557/50750 [7:13:49<83:51:47, 6.26s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:56:33,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:56:33,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.51 | bwd_microstep: 3849.36 | bwd_inner_microstep: 3841.86 | bwd_allreduce_microstep: 7.46 | step_microstep: 22.00 [2024-11-13 23:56:33,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.51 | bwd: 3849.37 | bwd_inner: 3841.86 | bwd_allreduce: 7.48 | step: 22.01 5%|▌ | 2558/50750 [7:13:55<82:28:17, 6.16s/it] {'loss': 0.0156, 'learning_rate': 3.9956387016040073e-05, 'epoch': 2.52} 5%|▌ | 2558/50750 [7:13:55<82:28:17, 6.16s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:56:38,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 23:56:38,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.99 | bwd_microstep: 3839.87 | bwd_inner_microstep: 3832.38 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.46 [2024-11-13 23:56:38,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.99 | bwd: 3839.88 | bwd_inner: 3832.38 | bwd_allreduce: 7.46 | step: 21.48 5%|▌ | 2559/50750 [7:14:00<81:26:54, 6.08s/it] {'loss': 0.1912, 'learning_rate': 3.995630272974632e-05, 'epoch': 2.52} 5%|▌ | 2559/50750 [7:14:00<81:26:54, 6.08s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:56:44,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:56:44,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.57 | bwd_microstep: 3845.59 | bwd_inner_microstep: 3838.06 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.96 [2024-11-13 23:56:44,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.57 | bwd: 3845.60 | bwd_inner: 3838.06 | bwd_allreduce: 7.50 | step: 20.97 5%|▌ | 2560/50750 [7:14:06<80:45:23, 6.03s/it] {'loss': 0.0025, 'learning_rate': 3.995621836217455e-05, 'epoch': 2.52} 5%|▌ | 2560/50750 [7:14:06<80:45:23, 6.03s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:56:50,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-13 23:56:50,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.34 | bwd_microstep: 3836.20 | bwd_inner_microstep: 3828.71 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.25 [2024-11-13 23:56:50,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.34 | bwd: 3836.21 | bwd_inner: 3828.71 | bwd_allreduce: 7.46 | step: 21.27 5%|▌ | 2561/50750 [7:14:12<80:15:07, 6.00s/it] {'loss': 0.0106, 'learning_rate': 3.9956133913325106e-05, 'epoch': 2.52} 5%|▌ | 2561/50750 [7:14:12<80:15:07, 6.00s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:56:56,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 23:56:56,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.20 | bwd_microstep: 3842.47 | bwd_inner_microstep: 3834.96 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.11 [2024-11-13 23:56:56,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.20 | bwd: 3842.48 | bwd_inner: 3834.96 | bwd_allreduce: 7.48 | step: 21.11 5%|▌ | 2562/50750 [7:14:18<79:54:06, 5.97s/it] {'loss': 0.0009, 'learning_rate': 3.995604938319833e-05, 'epoch': 2.52} 5%|▌ | 2562/50750 [7:14:18<79:54:06, 5.97s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:57:02,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.94 [2024-11-13 23:57:02,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.37 | bwd_microstep: 3840.78 | bwd_inner_microstep: 3833.27 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.96 [2024-11-13 23:57:02,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.37 | bwd: 3840.80 | bwd_inner: 3833.27 | bwd_allreduce: 7.49 | step: 20.96 5%|▌ | 2563/50750 [7:14:24<79:40:51, 5.95s/it] {'loss': 0.0016, 'learning_rate': 3.995596477179458e-05, 'epoch': 2.53} 5%|▌ | 2563/50750 [7:14:24<79:40:51, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:57:08,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 23:57:08,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.87 | bwd_microstep: 3838.30 | bwd_inner_microstep: 3830.76 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.29 [2024-11-13 23:57:08,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.87 | bwd: 3838.31 | bwd_inner: 3830.76 | bwd_allreduce: 7.51 | step: 21.29 5%|▌ | 2564/50750 [7:14:30<79:29:49, 5.94s/it] {'loss': 0.0524, 'learning_rate': 3.995588007911419e-05, 'epoch': 2.53} 5%|▌ | 2564/50750 [7:14:30<79:29:49, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:57:14,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.66 | optimizer_step: 4.93 [2024-11-13 23:57:14,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.51 | bwd_microstep: 3855.16 | bwd_inner_microstep: 3847.18 | bwd_allreduce_microstep: 7.91 | step_microstep: 28.55 [2024-11-13 23:57:14,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.51 | bwd: 3855.18 | bwd_inner: 3847.18 | bwd_allreduce: 7.94 | step: 28.55 5%|▌ | 2565/50750 [7:14:36<79:28:34, 5.94s/it] {'loss': 0.1907, 'learning_rate': 3.9955795305157505e-05, 'epoch': 2.53} 5%|▌ | 2565/50750 [7:14:36<79:28:34, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:57:20,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:57:20,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.41 | bwd_microstep: 3849.62 | bwd_inner_microstep: 3842.11 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.33 [2024-11-13 23:57:20,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.39 | bwd: 3849.63 | bwd_inner: 3842.11 | bwd_allreduce: 7.48 | step: 21.34 5%|▌ | 2566/50750 [7:14:42<79:27:06, 5.94s/it] {'loss': 0.0035, 'learning_rate': 3.9955710449924876e-05, 'epoch': 2.53} 5%|▌ | 2566/50750 [7:14:42<79:27:06, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:57:26,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-13 23:57:26,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.51 | bwd_microstep: 3851.88 | bwd_inner_microstep: 3844.25 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.43 [2024-11-13 23:57:26,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.51 | bwd: 3851.89 | bwd_inner: 3844.25 | bwd_allreduce: 7.60 | step: 21.43 5%|▌ | 2567/50750 [7:14:48<79:25:04, 5.93s/it] {'loss': 0.0015, 'learning_rate': 3.9955625513416637e-05, 'epoch': 2.53} 5%|▌ | 2567/50750 [7:14:48<79:25:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:57:32,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:57:32,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.63 | bwd_microstep: 3853.99 | bwd_inner_microstep: 3846.49 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.01 [2024-11-13 23:57:32,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.61 | bwd: 3854.00 | bwd_inner: 3846.49 | bwd_allreduce: 7.48 | step: 21.01 5%|▌ | 2568/50750 [7:14:54<79:25:58, 5.93s/it] {'loss': 0.0008, 'learning_rate': 3.995554049563314e-05, 'epoch': 2.53} 5%|▌ | 2568/50750 [7:14:54<79:25:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:57:38,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-13 23:57:38,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.59 | bwd_microstep: 3853.86 | bwd_inner_microstep: 3846.32 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.79 [2024-11-13 23:57:38,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.60 | bwd: 3853.88 | bwd_inner: 3846.32 | bwd_allreduce: 7.52 | step: 21.79 5%|▌ | 2569/50750 [7:15:00<79:24:31, 5.93s/it] {'loss': 0.0103, 'learning_rate': 3.995545539657474e-05, 'epoch': 2.53} 5%|▌ | 2569/50750 [7:15:00<79:24:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:57:44,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-13 23:57:44,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.24 | bwd_microstep: 3851.95 | bwd_inner_microstep: 3844.41 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.16 [2024-11-13 23:57:44,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.24 | bwd: 3851.96 | bwd_inner: 3844.41 | bwd_allreduce: 7.51 | step: 21.16 5%|▌ | 2570/50750 [7:15:06<79:22:33, 5.93s/it] {'loss': 0.0103, 'learning_rate': 3.9955370216241763e-05, 'epoch': 2.53} 5%|▌ | 2570/50750 [7:15:06<79:22:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:57:50,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 23:57:50,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.24 | bwd_microstep: 3859.68 | bwd_inner_microstep: 3852.16 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.57 [2024-11-13 23:57:50,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.24 | bwd: 3859.70 | bwd_inner: 3852.16 | bwd_allreduce: 7.49 | step: 21.58 5%|▌ | 2571/50750 [7:15:12<79:22:58, 5.93s/it] {'loss': 0.1374, 'learning_rate': 3.995528495463458e-05, 'epoch': 2.53} 5%|▌ | 2571/50750 [7:15:12<79:22:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:57:55,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.34 | optimizer_step: 4.93 [2024-11-13 23:57:55,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.52 | bwd_microstep: 3848.10 | bwd_inner_microstep: 3840.52 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.84 [2024-11-13 23:57:55,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.52 | bwd: 3848.11 | bwd_inner: 3840.52 | bwd_allreduce: 7.55 | step: 21.85 5%|▌ | 2572/50750 [7:15:17<79:22:44, 5.93s/it] {'loss': 0.0008, 'learning_rate': 3.995519961175353e-05, 'epoch': 2.53} 5%|▌ | 2572/50750 [7:15:17<79:22:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:58:01,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:58:01,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.98 | bwd_microstep: 3845.10 | bwd_inner_microstep: 3837.55 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.11 [2024-11-13 23:58:01,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.96 | bwd: 3845.12 | bwd_inner: 3837.55 | bwd_allreduce: 7.52 | step: 21.11 5%|▌ | 2573/50750 [7:15:23<79:21:42, 5.93s/it] {'loss': 0.0028, 'learning_rate': 3.995511418759895e-05, 'epoch': 2.53} 5%|▌ | 2573/50750 [7:15:23<79:21:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-13 23:58:07,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-13 23:58:07,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.55 | bwd_microstep: 3853.78 | bwd_inner_microstep: 3846.08 | bwd_allreduce_microstep: 7.65 | step_microstep: 23.38 [2024-11-13 23:58:07,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.55 | bwd: 3853.80 | bwd_inner: 3846.08 | bwd_allreduce: 7.67 | step: 23.37 5%|▌ | 2574/50750 [7:15:29<79:20:29, 5.93s/it] {'loss': 0.0008, 'learning_rate': 3.9955028682171194e-05, 'epoch': 2.54} 5%|▌ | 2574/50750 [7:15:29<79:20:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:58:13,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:58:13,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.18 | bwd_microstep: 3856.63 | bwd_inner_microstep: 3849.11 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.99 [2024-11-13 23:58:13,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.18 | bwd: 3856.64 | bwd_inner: 3849.11 | bwd_allreduce: 7.49 | step: 20.99 5%|▌ | 2575/50750 [7:15:35<79:20:08, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.995494309547062e-05, 'epoch': 2.54} 5%|▌ | 2575/50750 [7:15:35<79:20:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:58:19,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.80 | optimizer_step: 4.93 [2024-11-13 23:58:19,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.77 | bwd_microstep: 3856.78 | bwd_inner_microstep: 3849.24 | bwd_allreduce_microstep: 7.49 | step_microstep: 23.25 [2024-11-13 23:58:19,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.77 | bwd: 3856.79 | bwd_inner: 3849.24 | bwd_allreduce: 7.51 | step: 23.25 5%|▌ | 2576/50750 [7:15:41<79:20:50, 5.93s/it] {'loss': 0.0028, 'learning_rate': 3.9954857427497564e-05, 'epoch': 2.54} 5%|▌ | 2576/50750 [7:15:41<79:20:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:58:25,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-13 23:58:25,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.89 | bwd_microstep: 3849.14 | bwd_inner_microstep: 3841.55 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.18 [2024-11-13 23:58:25,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.89 | bwd: 3849.15 | bwd_inner: 3841.55 | bwd_allreduce: 7.56 | step: 21.19 5%|▌ | 2577/50750 [7:15:47<79:19:58, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.9954771678252376e-05, 'epoch': 2.54} 5%|▌ | 2577/50750 [7:15:47<79:19:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:58:31,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:58:31,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.04 | bwd_microstep: 3847.40 | bwd_inner_microstep: 3839.83 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.27 [2024-11-13 23:58:31,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.04 | bwd: 3847.41 | bwd_inner: 3839.83 | bwd_allreduce: 7.54 | step: 21.28 5%|▌ | 2578/50750 [7:15:53<79:17:56, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.995468584773541e-05, 'epoch': 2.54} 5%|▌ | 2578/50750 [7:15:53<79:17:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:58:37,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:58:37,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.16 | bwd_microstep: 3848.28 | bwd_inner_microstep: 3840.71 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.55 [2024-11-13 23:58:37,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.16 | bwd: 3848.29 | bwd_inner: 3840.71 | bwd_allreduce: 7.54 | step: 21.55 5%|▌ | 2579/50750 [7:15:59<79:18:05, 5.93s/it] {'loss': 0.0033, 'learning_rate': 3.995459993594702e-05, 'epoch': 2.54} 5%|▌ | 2579/50750 [7:15:59<79:18:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-13 23:58:43,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.93 [2024-11-13 23:58:43,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.84 | bwd_microstep: 3851.49 | bwd_inner_microstep: 3843.62 | bwd_allreduce_microstep: 7.81 | step_microstep: 23.96 [2024-11-13 23:58:43,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.84 | bwd: 3851.51 | bwd_inner: 3843.62 | bwd_allreduce: 7.84 | step: 23.95 5%|▌ | 2580/50750 [7:16:05<79:19:10, 5.93s/it] {'loss': 0.3526, 'learning_rate': 3.995451394288754e-05, 'epoch': 2.54} 5%|▌ | 2580/50750 [7:16:05<79:19:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:58:49,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-13 23:58:49,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.45 | bwd_microstep: 3847.77 | bwd_inner_microstep: 3840.26 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.60 [2024-11-13 23:58:49,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.44 | bwd: 3847.78 | bwd_inner: 3840.26 | bwd_allreduce: 7.49 | step: 21.61 5%|▌ | 2581/50750 [7:16:11<79:18:02, 5.93s/it] {'loss': 0.0134, 'learning_rate': 3.995442786855734e-05, 'epoch': 2.54} 5%|▌ | 2581/50750 [7:16:11<79:18:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:58:55,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-13 23:58:55,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.51 | bwd_microstep: 3852.48 | bwd_inner_microstep: 3844.99 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.94 [2024-11-13 23:58:55,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.50 | bwd: 3852.49 | bwd_inner: 3844.99 | bwd_allreduce: 7.46 | step: 20.94 5%|▌ | 2582/50750 [7:16:17<79:17:23, 5.93s/it] {'loss': 0.0107, 'learning_rate': 3.9954341712956756e-05, 'epoch': 2.54} 5%|▌ | 2582/50750 [7:16:17<79:17:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:59:01,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-13 23:59:01,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.62 | bwd_microstep: 3856.14 | bwd_inner_microstep: 3848.65 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.19 [2024-11-13 23:59:01,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.62 | bwd: 3856.15 | bwd_inner: 3848.65 | bwd_allreduce: 7.46 | step: 21.19 5%|▌ | 2583/50750 [7:16:23<79:18:25, 5.93s/it] {'loss': 0.3904, 'learning_rate': 3.9954255476086145e-05, 'epoch': 2.54} 5%|▌ | 2583/50750 [7:16:23<79:18:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:59:07,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 23:59:07,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.46 | bwd_microstep: 3858.93 | bwd_inner_microstep: 3851.42 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.16 [2024-11-13 23:59:07,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.45 | bwd: 3858.94 | bwd_inner: 3851.42 | bwd_allreduce: 7.48 | step: 21.17 5%|▌ | 2584/50750 [7:16:29<79:19:14, 5.93s/it] {'loss': 0.0423, 'learning_rate': 3.995416915794586e-05, 'epoch': 2.55} 5%|▌ | 2584/50750 [7:16:29<79:19:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:59:13,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 23:59:13,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.07 | bwd_microstep: 3851.65 | bwd_inner_microstep: 3844.18 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.89 [2024-11-13 23:59:13,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.07 | bwd: 3851.66 | bwd_inner: 3844.18 | bwd_allreduce: 7.44 | step: 20.90 5%|▌ | 2585/50750 [7:16:35<79:18:19, 5.93s/it] {'loss': 0.1037, 'learning_rate': 3.995408275853624e-05, 'epoch': 2.55} 5%|▌ | 2585/50750 [7:16:35<79:18:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-13 23:59:18,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-13 23:59:18,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.76 | bwd_microstep: 3851.73 | bwd_inner_microstep: 3844.24 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.87 [2024-11-13 23:59:18,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.76 | bwd: 3851.74 | bwd_inner: 3844.24 | bwd_allreduce: 7.46 | step: 21.88 5%|▌ | 2586/50750 [7:16:40<79:17:00, 5.93s/it] {'loss': 0.0029, 'learning_rate': 3.995399627785766e-05, 'epoch': 2.55} 5%|▌ | 2586/50750 [7:16:40<79:17:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:59:24,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-13 23:59:24,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.44 | bwd_microstep: 3849.63 | bwd_inner_microstep: 3842.15 | bwd_allreduce_microstep: 7.44 | step_microstep: 22.47 [2024-11-13 23:59:24,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.44 | bwd: 3849.64 | bwd_inner: 3842.15 | bwd_allreduce: 7.46 | step: 22.47 5%|▌ | 2587/50750 [7:16:46<79:16:58, 5.93s/it] {'loss': 0.0016, 'learning_rate': 3.9953909715910446e-05, 'epoch': 2.55} 5%|▌ | 2587/50750 [7:16:46<79:16:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:59:30,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-13 23:59:30,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.30 | bwd_microstep: 3844.92 | bwd_inner_microstep: 3837.45 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.92 [2024-11-13 23:59:30,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.30 | bwd: 3844.94 | bwd_inner: 3837.45 | bwd_allreduce: 7.45 | step: 20.93 5%|▌ | 2588/50750 [7:16:52<79:14:47, 5.92s/it] {'loss': 0.589, 'learning_rate': 3.995382307269497e-05, 'epoch': 2.55} 5%|▌ | 2588/50750 [7:16:52<79:14:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:59:36,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-13 23:59:36,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.35 | bwd_microstep: 3852.42 | bwd_inner_microstep: 3844.97 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.91 [2024-11-13 23:59:36,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.34 | bwd: 3852.43 | bwd_inner: 3844.97 | bwd_allreduce: 7.42 | step: 20.91 5%|▌ | 2589/50750 [7:16:58<79:14:31, 5.92s/it] {'loss': 0.1018, 'learning_rate': 3.9953736348211574e-05, 'epoch': 2.55} 5%|▌ | 2589/50750 [7:16:58<79:14:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-13 23:59:42,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-13 23:59:42,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.07 | bwd_microstep: 3850.29 | bwd_inner_microstep: 3842.82 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.91 [2024-11-13 23:59:42,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.07 | bwd: 3850.30 | bwd_inner: 3842.82 | bwd_allreduce: 7.45 | step: 21.91 5%|▌ | 2590/50750 [7:17:04<79:14:27, 5.92s/it] {'loss': 0.0366, 'learning_rate': 3.995364954246062e-05, 'epoch': 2.55} 5%|▌ | 2590/50750 [7:17:04<79:14:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-13 23:59:48,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-13 23:59:48,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.64 | bwd_microstep: 3850.72 | bwd_inner_microstep: 3843.20 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.32 [2024-11-13 23:59:48,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.64 | bwd: 3850.73 | bwd_inner: 3843.20 | bwd_allreduce: 7.49 | step: 21.33 5%|▌ | 2591/50750 [7:17:10<79:14:01, 5.92s/it] {'loss': 0.2678, 'learning_rate': 3.9953562655442456e-05, 'epoch': 2.55} 5%|▌ | 2591/50750 [7:17:10<79:14:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-13 23:59:54,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-13 23:59:54,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.20 | bwd_microstep: 3860.15 | bwd_inner_microstep: 3852.70 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.86 [2024-11-13 23:59:54,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.20 | bwd: 3860.16 | bwd_inner: 3852.70 | bwd_allreduce: 7.42 | step: 20.86 5%|▌ | 2592/50750 [7:17:16<79:16:35, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.995347568715743e-05, 'epoch': 2.55} 5%|▌ | 2592/50750 [7:17:16<79:16:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:00:00,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 00:00:00,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.88 | bwd_microstep: 3843.48 | bwd_inner_microstep: 3836.02 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.02 [2024-11-14 00:00:00,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.87 | bwd: 3843.49 | bwd_inner: 3836.02 | bwd_allreduce: 7.43 | step: 21.02 5%|▌ | 2593/50750 [7:17:22<79:13:50, 5.92s/it] {'loss': 0.0045, 'learning_rate': 3.995338863760591e-05, 'epoch': 2.55} 5%|▌ | 2593/50750 [7:17:22<79:13:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:00:06,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:00:06,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.28 | bwd_microstep: 3851.74 | bwd_inner_microstep: 3844.19 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.17 [2024-11-14 00:00:06,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.28 | bwd: 3851.75 | bwd_inner: 3844.19 | bwd_allreduce: 7.52 | step: 21.18 5%|▌ | 2594/50750 [7:17:28<79:13:45, 5.92s/it] {'loss': 0.06, 'learning_rate': 3.995330150678824e-05, 'epoch': 2.56} 5%|▌ | 2594/50750 [7:17:28<79:13:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:00:12,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-14 00:00:12,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.26 | bwd_microstep: 3852.11 | bwd_inner_microstep: 3844.58 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.12 [2024-11-14 00:00:12,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.26 | bwd: 3852.13 | bwd_inner: 3844.58 | bwd_allreduce: 7.51 | step: 21.12 5%|▌ | 2595/50750 [7:17:34<79:14:25, 5.92s/it] {'loss': 0.0163, 'learning_rate': 3.995321429470478e-05, 'epoch': 2.56} 5%|▌ | 2595/50750 [7:17:34<79:14:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:00:18,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:00:18,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.14 | bwd_microstep: 3849.79 | bwd_inner_microstep: 3842.31 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.87 [2024-11-14 00:00:18,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.14 | bwd: 3849.80 | bwd_inner: 3842.31 | bwd_allreduce: 7.45 | step: 20.87 5%|▌ | 2596/50750 [7:17:40<79:13:20, 5.92s/it] {'loss': 0.0012, 'learning_rate': 3.9953127001355886e-05, 'epoch': 2.56} 5%|▌ | 2596/50750 [7:17:40<79:13:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:00:24,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 00:00:24,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.03 | bwd_microstep: 3846.33 | bwd_inner_microstep: 3838.70 | bwd_allreduce_microstep: 7.58 | step_microstep: 20.82 [2024-11-14 00:00:24,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.03 | bwd: 3846.34 | bwd_inner: 3838.70 | bwd_allreduce: 7.60 | step: 20.82 5%|▌ | 2597/50750 [7:17:46<79:12:38, 5.92s/it] {'loss': 0.0512, 'learning_rate': 3.9953039626741906e-05, 'epoch': 2.56} 5%|▌ | 2597/50750 [7:17:46<79:12:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:00:30,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:00:30,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.15 | bwd_microstep: 3861.95 | bwd_inner_microstep: 3854.39 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.80 [2024-11-14 00:00:30,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.15 | bwd: 3861.96 | bwd_inner: 3854.39 | bwd_allreduce: 7.54 | step: 21.80 5%|▌ | 2598/50750 [7:17:52<79:16:55, 5.93s/it] {'loss': 0.0107, 'learning_rate': 3.9952952170863205e-05, 'epoch': 2.56} 5%|▌ | 2598/50750 [7:17:52<79:16:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:00:35,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:00:35,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.22 | bwd_microstep: 3861.97 | bwd_inner_microstep: 3854.45 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.09 [2024-11-14 00:00:35,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.22 | bwd: 3861.98 | bwd_inner: 3854.45 | bwd_allreduce: 7.49 | step: 21.10 5%|▌ | 2599/50750 [7:17:57<79:18:43, 5.93s/it] {'loss': 0.0067, 'learning_rate': 3.995286463372013e-05, 'epoch': 2.56} 5%|▌ | 2599/50750 [7:17:57<79:18:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:00:41,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 00:00:41,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.13 | bwd_microstep: 3858.00 | bwd_inner_microstep: 3850.50 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.03 [2024-11-14 00:00:41,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.12 | bwd: 3858.01 | bwd_inner: 3850.50 | bwd_allreduce: 7.48 | step: 21.04 5%|▌ | 2600/50750 [7:18:03<79:19:15, 5.93s/it] {'loss': 0.5338, 'learning_rate': 3.995277701531304e-05, 'epoch': 2.56} 5%|▌ | 2600/50750 [7:18:03<79:19:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:00:47,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 00:00:47,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.62 | bwd_microstep: 3862.63 | bwd_inner_microstep: 3854.74 | bwd_allreduce_microstep: 7.83 | step_microstep: 22.03 [2024-11-14 00:00:47,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.62 | bwd: 3862.64 | bwd_inner: 3854.74 | bwd_allreduce: 7.86 | step: 22.03 5%|▌ | 2601/50750 [7:18:09<79:21:03, 5.93s/it] {'loss': 0.0133, 'learning_rate': 3.9952689315642306e-05, 'epoch': 2.56} 5%|▌ | 2601/50750 [7:18:09<79:21:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:00:53,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:00:53,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.50 | bwd_microstep: 3852.06 | bwd_inner_microstep: 3844.53 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.09 [2024-11-14 00:00:53,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.48 | bwd: 3852.07 | bwd_inner: 3844.53 | bwd_allreduce: 7.50 | step: 21.10 5%|▌ | 2602/50750 [7:18:15<79:21:47, 5.93s/it] {'loss': 0.4151, 'learning_rate': 3.995260153470827e-05, 'epoch': 2.56} 5%|▌ | 2602/50750 [7:18:15<79:21:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:00:59,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:00:59,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.25 | bwd_microstep: 3857.31 | bwd_inner_microstep: 3849.81 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.16 [2024-11-14 00:00:59,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.25 | bwd: 3857.32 | bwd_inner: 3849.81 | bwd_allreduce: 7.47 | step: 21.17 5%|▌ | 2603/50750 [7:18:21<79:21:26, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.9952513672511285e-05, 'epoch': 2.56} 5%|▌ | 2603/50750 [7:18:21<79:21:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:01:05,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:01:05,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.56 | bwd_microstep: 3858.39 | bwd_inner_microstep: 3850.88 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.21 [2024-11-14 00:01:05,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.56 | bwd: 3858.40 | bwd_inner: 3850.88 | bwd_allreduce: 7.47 | step: 21.22 5%|▌ | 2604/50750 [7:18:27<79:19:54, 5.93s/it] {'loss': 0.0032, 'learning_rate': 3.995242572905173e-05, 'epoch': 2.57} 5%|▌ | 2604/50750 [7:18:27<79:19:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:01:11,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-14 00:01:11,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.45 | bwd_microstep: 3859.75 | bwd_inner_microstep: 3852.22 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.62 [2024-11-14 00:01:11,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.45 | bwd: 3859.76 | bwd_inner: 3852.22 | bwd_allreduce: 7.50 | step: 21.62 5%|▌ | 2605/50750 [7:18:33<79:22:08, 5.93s/it] {'loss': 0.0034, 'learning_rate': 3.995233770432994e-05, 'epoch': 2.57} 5%|▌ | 2605/50750 [7:18:33<79:22:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:01:17,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 00:01:17,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.46 | bwd_microstep: 3854.59 | bwd_inner_microstep: 3847.07 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.10 [2024-11-14 00:01:17,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.44 | bwd: 3854.60 | bwd_inner: 3847.07 | bwd_allreduce: 7.49 | step: 21.11 5%|▌ | 2606/50750 [7:18:39<79:20:40, 5.93s/it] {'loss': 0.0051, 'learning_rate': 3.9952249598346284e-05, 'epoch': 2.57} 5%|▌ | 2606/50750 [7:18:39<79:20:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:01:23,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:01:23,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.90 | bwd_microstep: 3845.83 | bwd_inner_microstep: 3838.31 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.36 [2024-11-14 00:01:23,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.89 | bwd: 3845.84 | bwd_inner: 3838.31 | bwd_allreduce: 7.49 | step: 21.36 5%|▌ | 2607/50750 [7:18:45<79:19:40, 5.93s/it] {'loss': 0.5702, 'learning_rate': 3.9952161411101126e-05, 'epoch': 2.57} 5%|▌ | 2607/50750 [7:18:45<79:19:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:01:29,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:01:29,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.18 | bwd_microstep: 3856.04 | bwd_inner_microstep: 3848.52 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.30 [2024-11-14 00:01:29,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.18 | bwd: 3856.05 | bwd_inner: 3848.52 | bwd_allreduce: 7.50 | step: 21.30 5%|▌ | 2608/50750 [7:18:51<79:18:46, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.995207314259481e-05, 'epoch': 2.57} 5%|▌ | 2608/50750 [7:18:51<79:18:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:01:35,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:01:35,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.00 | bwd_microstep: 3853.34 | bwd_inner_microstep: 3845.81 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.20 [2024-11-14 00:01:35,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.00 | bwd: 3853.36 | bwd_inner: 3845.81 | bwd_allreduce: 7.51 | step: 21.20 5%|▌ | 2609/50750 [7:18:57<79:17:39, 5.93s/it] {'loss': 0.003, 'learning_rate': 3.995198479282771e-05, 'epoch': 2.57} 5%|▌ | 2609/50750 [7:18:57<79:17:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:01:41,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 00:01:41,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.28 | bwd_microstep: 3855.86 | bwd_inner_microstep: 3848.32 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.21 [2024-11-14 00:01:41,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.28 | bwd: 3855.87 | bwd_inner: 3848.32 | bwd_allreduce: 7.51 | step: 21.21 5%|▌ | 2610/50750 [7:19:03<79:17:20, 5.93s/it] {'loss': 0.0038, 'learning_rate': 3.9951896361800185e-05, 'epoch': 2.57} 5%|▌ | 2610/50750 [7:19:03<79:17:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:01:47,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 00:01:47,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.21 | bwd_microstep: 3850.68 | bwd_inner_microstep: 3843.04 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.34 [2024-11-14 00:01:47,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.21 | bwd: 3850.69 | bwd_inner: 3843.04 | bwd_allreduce: 7.61 | step: 21.35 5%|▌ | 2611/50750 [7:19:09<79:16:21, 5.93s/it] {'loss': 0.2693, 'learning_rate': 3.995180784951259e-05, 'epoch': 2.57} 5%|▌ | 2611/50750 [7:19:09<79:16:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:01:53,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:01:53,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.56 | bwd_microstep: 3850.59 | bwd_inner_microstep: 3843.06 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.09 [2024-11-14 00:01:53,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.55 | bwd: 3850.61 | bwd_inner: 3843.06 | bwd_allreduce: 7.50 | step: 21.10 5%|▌ | 2612/50750 [7:19:15<79:15:27, 5.93s/it] {'loss': 0.0202, 'learning_rate': 3.9951719255965284e-05, 'epoch': 2.57} 5%|▌ | 2612/50750 [7:19:15<79:15:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:01:59,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 00:01:59,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.85 | bwd_microstep: 3857.75 | bwd_inner_microstep: 3850.22 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.52 [2024-11-14 00:01:59,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.85 | bwd: 3857.76 | bwd_inner: 3850.22 | bwd_allreduce: 7.51 | step: 21.53 5%|▌ | 2613/50750 [7:19:20<79:16:40, 5.93s/it] {'loss': 0.0014, 'learning_rate': 3.995163058115864e-05, 'epoch': 2.57} 5%|▌ | 2613/50750 [7:19:20<79:16:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:02:04,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 00:02:04,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.90 | bwd_microstep: 3854.32 | bwd_inner_microstep: 3846.80 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.95 [2024-11-14 00:02:04,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.90 | bwd: 3854.34 | bwd_inner: 3846.80 | bwd_allreduce: 7.50 | step: 20.96 5%|▌ | 2614/50750 [7:19:26<79:16:10, 5.93s/it] {'loss': 0.0024, 'learning_rate': 3.9951541825092996e-05, 'epoch': 2.58} 5%|▌ | 2614/50750 [7:19:26<79:16:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:02:10,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:02:10,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.28 | bwd_microstep: 3860.76 | bwd_inner_microstep: 3853.26 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.20 [2024-11-14 00:02:10,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.28 | bwd: 3860.78 | bwd_inner: 3853.26 | bwd_allreduce: 7.48 | step: 21.20 5%|▌ | 2615/50750 [7:19:32<79:17:46, 5.93s/it] {'loss': 0.0064, 'learning_rate': 3.9951452987768726e-05, 'epoch': 2.58} 5%|▌ | 2615/50750 [7:19:32<79:17:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:02:16,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.49 | optimizer_step: 4.93 [2024-11-14 00:02:16,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.15 | bwd_microstep: 3852.11 | bwd_inner_microstep: 3844.38 | bwd_allreduce_microstep: 7.67 | step_microstep: 27.54 [2024-11-14 00:02:16,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.15 | bwd: 3852.13 | bwd_inner: 3844.38 | bwd_allreduce: 7.69 | step: 27.54 5%|▌ | 2616/50750 [7:19:38<79:19:25, 5.93s/it] {'loss': 0.0107, 'learning_rate': 3.995136406918621e-05, 'epoch': 2.58} 5%|▌ | 2616/50750 [7:19:38<79:19:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:02:22,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 00:02:22,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.93 | bwd_microstep: 3859.24 | bwd_inner_microstep: 3851.49 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.82 [2024-11-14 00:02:22,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.93 | bwd: 3859.26 | bwd_inner: 3851.49 | bwd_allreduce: 7.72 | step: 21.83 5%|▌ | 2617/50750 [7:19:44<79:19:35, 5.93s/it] {'loss': 0.0011, 'learning_rate': 3.995127506934578e-05, 'epoch': 2.58} 5%|▌ | 2617/50750 [7:19:44<79:19:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:02:28,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:02:28,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.94 | bwd_microstep: 3861.94 | bwd_inner_microstep: 3854.42 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.16 [2024-11-14 00:02:28,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.93 | bwd: 3861.96 | bwd_inner: 3854.42 | bwd_allreduce: 7.49 | step: 21.16 5%|▌ | 2618/50750 [7:19:50<79:20:39, 5.93s/it] {'loss': 0.0027, 'learning_rate': 3.995118598824781e-05, 'epoch': 2.58} 5%|▌ | 2618/50750 [7:19:50<79:20:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:02:34,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:02:34,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.59 | bwd_microstep: 3856.28 | bwd_inner_microstep: 3848.75 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.90 [2024-11-14 00:02:34,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.59 | bwd: 3856.29 | bwd_inner: 3848.75 | bwd_allreduce: 7.50 | step: 20.90 5%|▌ | 2619/50750 [7:19:56<79:19:22, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.9951096825892665e-05, 'epoch': 2.58} 5%|▌ | 2619/50750 [7:19:56<79:19:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:02:40,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:02:40,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.47 | bwd_microstep: 3855.25 | bwd_inner_microstep: 3847.71 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.21 [2024-11-14 00:02:40,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.45 | bwd: 3855.26 | bwd_inner: 3847.71 | bwd_allreduce: 7.51 | step: 21.21 5%|▌ | 2620/50750 [7:20:02<79:20:52, 5.94s/it] {'loss': 0.3584, 'learning_rate': 3.9951007582280714e-05, 'epoch': 2.58} 5%|▌ | 2620/50750 [7:20:02<79:20:52, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:02:46,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:02:46,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.37 | bwd_microstep: 3857.68 | bwd_inner_microstep: 3850.06 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.03 [2024-11-14 00:02:46,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.37 | bwd: 3857.69 | bwd_inner: 3850.06 | bwd_allreduce: 7.59 | step: 21.04 5%|▌ | 2621/50750 [7:20:08<79:18:56, 5.93s/it] {'loss': 0.0032, 'learning_rate': 3.995091825741231e-05, 'epoch': 2.58} 5%|▌ | 2621/50750 [7:20:08<79:18:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:02:52,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:02:52,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.12 | bwd_microstep: 3854.42 | bwd_inner_microstep: 3846.87 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.07 [2024-11-14 00:02:52,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.12 | bwd: 3854.43 | bwd_inner: 3846.87 | bwd_allreduce: 7.52 | step: 21.07 5%|▌ | 2622/50750 [7:20:14<79:16:57, 5.93s/it] {'loss': 0.3716, 'learning_rate': 3.9950828851287824e-05, 'epoch': 2.58} 5%|▌ | 2622/50750 [7:20:14<79:16:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:02:58,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 00:02:58,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.55 | bwd_microstep: 3851.06 | bwd_inner_microstep: 3843.33 | bwd_allreduce_microstep: 7.68 | step_microstep: 20.92 [2024-11-14 00:02:58,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.55 | bwd: 3851.07 | bwd_inner: 3843.34 | bwd_allreduce: 7.69 | step: 20.92 5%|▌ | 2623/50750 [7:20:20<79:15:34, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.995073936390761e-05, 'epoch': 2.58} 5%|▌ | 2623/50750 [7:20:20<79:15:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:03:04,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:03:04,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.80 | bwd_microstep: 3851.16 | bwd_inner_microstep: 3843.47 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.49 [2024-11-14 00:03:04,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.80 | bwd: 3851.17 | bwd_inner: 3843.47 | bwd_allreduce: 7.66 | step: 21.49 5%|▌ | 2624/50750 [7:20:26<79:15:18, 5.93s/it] {'loss': 0.0048, 'learning_rate': 3.995064979527205e-05, 'epoch': 2.59} 5%|▌ | 2624/50750 [7:20:26<79:15:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:03:10,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-14 00:03:10,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.90 | bwd_microstep: 3856.44 | bwd_inner_microstep: 3848.96 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.07 [2024-11-14 00:03:10,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.90 | bwd: 3856.45 | bwd_inner: 3848.96 | bwd_allreduce: 7.46 | step: 21.07 5%|▌ | 2625/50750 [7:20:32<79:15:04, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.995056014538149e-05, 'epoch': 2.59} 5%|▌ | 2625/50750 [7:20:32<79:15:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:03:16,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 00:03:16,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.27 | bwd_microstep: 3856.73 | bwd_inner_microstep: 3849.15 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.56 [2024-11-14 00:03:16,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.27 | bwd: 3856.74 | bwd_inner: 3849.15 | bwd_allreduce: 7.55 | step: 21.56 5%|▌ | 2626/50750 [7:20:38<79:17:53, 5.93s/it] {'loss': 0.1898, 'learning_rate': 3.995047041423631e-05, 'epoch': 2.59} 5%|▌ | 2626/50750 [7:20:38<79:17:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:03:22,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 00:03:22,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.56 | bwd_microstep: 3853.22 | bwd_inner_microstep: 3845.66 | bwd_allreduce_microstep: 7.52 | step_microstep: 20.91 [2024-11-14 00:03:22,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.54 | bwd: 3853.23 | bwd_inner: 3845.66 | bwd_allreduce: 7.53 | step: 20.92 5%|▌ | 2627/50750 [7:20:44<79:19:03, 5.93s/it] {'loss': 0.0008, 'learning_rate': 3.995038060183686e-05, 'epoch': 2.59} 5%|▌ | 2627/50750 [7:20:44<79:19:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:03:28,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:03:28,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.20 | bwd_microstep: 3860.75 | bwd_inner_microstep: 3853.20 | bwd_allreduce_microstep: 7.51 | step_microstep: 20.96 [2024-11-14 00:03:28,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.20 | bwd: 3860.76 | bwd_inner: 3853.20 | bwd_allreduce: 7.52 | step: 20.97 5%|▌ | 2628/50750 [7:20:49<79:19:21, 5.93s/it] {'loss': 0.0009, 'learning_rate': 3.9950290708183514e-05, 'epoch': 2.59} 5%|▌ | 2628/50750 [7:20:49<79:19:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:03:33,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:03:33,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.04 | bwd_microstep: 3848.02 | bwd_inner_microstep: 3840.32 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.28 [2024-11-14 00:03:33,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.04 | bwd: 3848.04 | bwd_inner: 3840.32 | bwd_allreduce: 7.67 | step: 21.28 5%|▌ | 2629/50750 [7:20:55<79:17:25, 5.93s/it] {'loss': 0.0087, 'learning_rate': 3.995020073327664e-05, 'epoch': 2.59} 5%|▌ | 2629/50750 [7:20:55<79:17:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:03:39,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.92 [2024-11-14 00:03:39,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.67 | bwd_microstep: 3848.54 | bwd_inner_microstep: 3841.07 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.82 [2024-11-14 00:03:39,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.67 | bwd: 3848.55 | bwd_inner: 3841.07 | bwd_allreduce: 7.44 | step: 21.82 5%|▌ | 2630/50750 [7:21:01<79:15:09, 5.93s/it] {'loss': 0.954, 'learning_rate': 3.9950110677116616e-05, 'epoch': 2.59} 5%|▌ | 2630/50750 [7:21:01<79:15:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:03:45,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 00:03:45,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.73 | bwd_microstep: 3854.83 | bwd_inner_microstep: 3847.10 | bwd_allreduce_microstep: 7.69 | step_microstep: 22.26 [2024-11-14 00:03:45,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.72 | bwd: 3854.85 | bwd_inner: 3847.10 | bwd_allreduce: 7.71 | step: 22.26 5%|▌ | 2631/50750 [7:21:07<79:17:49, 5.93s/it] {'loss': 0.02, 'learning_rate': 3.995002053970378e-05, 'epoch': 2.59} 5%|▌ | 2631/50750 [7:21:07<79:17:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:03:51,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:03:51,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.18 | bwd_microstep: 3856.69 | bwd_inner_microstep: 3849.20 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.02 [2024-11-14 00:03:51,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.18 | bwd: 3856.71 | bwd_inner: 3849.20 | bwd_allreduce: 7.46 | step: 21.03 5%|▌ | 2632/50750 [7:21:13<79:17:26, 5.93s/it] {'loss': 0.008, 'learning_rate': 3.994993032103852e-05, 'epoch': 2.59} 5%|▌ | 2632/50750 [7:21:13<79:17:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:03:57,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-14 00:03:57,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.33 | bwd_microstep: 3851.43 | bwd_inner_microstep: 3843.43 | bwd_allreduce_microstep: 7.95 | step_microstep: 22.38 [2024-11-14 00:03:57,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.33 | bwd: 3851.45 | bwd_inner: 3843.43 | bwd_allreduce: 7.97 | step: 22.38 5%|▌ | 2633/50750 [7:21:19<79:16:14, 5.93s/it] {'loss': 0.0015, 'learning_rate': 3.9949840021121205e-05, 'epoch': 2.59} 5%|▌ | 2633/50750 [7:21:19<79:16:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:04:03,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.92 [2024-11-14 00:04:03,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.80 | bwd_microstep: 3855.19 | bwd_inner_microstep: 3847.65 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.67 [2024-11-14 00:04:03,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.78 | bwd: 3855.20 | bwd_inner: 3847.65 | bwd_allreduce: 7.51 | step: 21.67 5%|▌ | 2634/50750 [7:21:25<79:17:12, 5.93s/it] {'loss': 0.043, 'learning_rate': 3.994974963995219e-05, 'epoch': 2.6} 5%|▌ | 2634/50750 [7:21:25<79:17:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:04:09,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.93 [2024-11-14 00:04:09,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.06 | bwd_microstep: 3858.71 | bwd_inner_microstep: 3851.12 | bwd_allreduce_microstep: 7.54 | step_microstep: 22.21 [2024-11-14 00:04:09,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.06 | bwd: 3858.72 | bwd_inner: 3851.12 | bwd_allreduce: 7.56 | step: 22.21 5%|▌ | 2635/50750 [7:21:31<79:18:05, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.994965917753185e-05, 'epoch': 2.6} 5%|▌ | 2635/50750 [7:21:31<79:18:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:04:15,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 00:04:15,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.28 | bwd_microstep: 3847.85 | bwd_inner_microstep: 3840.18 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.59 [2024-11-14 00:04:15,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.28 | bwd: 3847.87 | bwd_inner: 3840.18 | bwd_allreduce: 7.65 | step: 21.60 5%|▌ | 2636/50750 [7:21:37<79:16:01, 5.93s/it] {'loss': 0.0013, 'learning_rate': 3.994956863386055e-05, 'epoch': 2.6} 5%|▌ | 2636/50750 [7:21:37<79:16:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:04:21,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:04:21,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.57 | bwd_microstep: 3861.68 | bwd_inner_microstep: 3854.22 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.84 [2024-11-14 00:04:21,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.56 | bwd: 3861.69 | bwd_inner: 3854.22 | bwd_allreduce: 7.43 | step: 20.84 5%|▌ | 2637/50750 [7:21:43<79:18:34, 5.93s/it] {'loss': 0.3682, 'learning_rate': 3.9949478008938665e-05, 'epoch': 2.6} 5%|▌ | 2637/50750 [7:21:43<79:18:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:04:27,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-14 00:04:27,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.05 | bwd_microstep: 3858.57 | bwd_inner_microstep: 3850.74 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.09 [2024-11-14 00:04:27,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.05 | bwd: 3858.59 | bwd_inner: 3850.74 | bwd_allreduce: 7.80 | step: 22.09 5%|▌ | 2638/50750 [7:21:49<79:18:36, 5.93s/it] {'loss': 0.1915, 'learning_rate': 3.994938730276656e-05, 'epoch': 2.6} 5%|▌ | 2638/50750 [7:21:49<79:18:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:04:33,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-14 00:04:33,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.50 | bwd_microstep: 3859.62 | bwd_inner_microstep: 3852.05 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.08 [2024-11-14 00:04:33,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.48 | bwd: 3859.63 | bwd_inner: 3852.05 | bwd_allreduce: 7.54 | step: 21.08 5%|▌ | 2639/50750 [7:21:55<79:18:59, 5.94s/it] {'loss': 0.1211, 'learning_rate': 3.9949296515344606e-05, 'epoch': 2.6} 5%|▌ | 2639/50750 [7:21:55<79:18:59, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:04:39,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:04:39,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.44 | bwd_microstep: 3861.41 | bwd_inner_microstep: 3853.94 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.94 [2024-11-14 00:04:39,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.44 | bwd: 3861.42 | bwd_inner: 3853.94 | bwd_allreduce: 7.44 | step: 20.95 5%|▌ | 2640/50750 [7:22:01<79:20:08, 5.94s/it] {'loss': 0.0651, 'learning_rate': 3.994920564667317e-05, 'epoch': 2.6} 5%|▌ | 2640/50750 [7:22:01<79:20:08, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:04:45,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 00:04:45,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.10 | bwd_microstep: 3854.88 | bwd_inner_microstep: 3847.11 | bwd_allreduce_microstep: 7.72 | step_microstep: 21.77 [2024-11-14 00:04:45,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.10 | bwd: 3854.89 | bwd_inner: 3847.11 | bwd_allreduce: 7.73 | step: 21.77 5%|▌ | 2641/50750 [7:22:07<79:18:58, 5.94s/it] {'loss': 0.0331, 'learning_rate': 3.9949114696752624e-05, 'epoch': 2.6} 5%|▌ | 2641/50750 [7:22:07<79:18:58, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:04:51,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 00:04:51,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.58 | bwd_microstep: 3858.69 | bwd_inner_microstep: 3851.19 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.22 [2024-11-14 00:04:51,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.58 | bwd: 3858.70 | bwd_inner: 3851.19 | bwd_allreduce: 7.47 | step: 21.23 5%|▌ | 2642/50750 [7:22:13<79:18:01, 5.93s/it] {'loss': 0.2667, 'learning_rate': 3.994902366558334e-05, 'epoch': 2.6} 5%|▌ | 2642/50750 [7:22:13<79:18:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:04:56,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:04:56,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.42 | bwd_microstep: 3849.44 | bwd_inner_microstep: 3841.72 | bwd_allreduce_microstep: 7.67 | step_microstep: 22.69 [2024-11-14 00:04:56,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.42 | bwd: 3849.45 | bwd_inner: 3841.72 | bwd_allreduce: 7.69 | step: 22.71 5%|▌ | 2643/50750 [7:22:18<79:16:06, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.994893255316568e-05, 'epoch': 2.6} 5%|▌ | 2643/50750 [7:22:18<79:16:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:05:02,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.58 | optimizer_step: 4.93 [2024-11-14 00:05:02,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.01 | bwd_microstep: 3851.06 | bwd_inner_microstep: 3843.49 | bwd_allreduce_microstep: 7.53 | step_microstep: 23.15 [2024-11-14 00:05:02,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.01 | bwd: 3851.08 | bwd_inner: 3843.49 | bwd_allreduce: 7.54 | step: 23.16 5%|▌ | 2644/50750 [7:22:24<79:16:16, 5.93s/it] {'loss': 0.0236, 'learning_rate': 3.994884135950003e-05, 'epoch': 2.6} 5%|▌ | 2644/50750 [7:22:24<79:16:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:05:08,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.66 | optimizer_step: 5.02 [2024-11-14 00:05:08,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.29 | bwd_microstep: 3852.91 | bwd_inner_microstep: 3844.70 | bwd_allreduce_microstep: 8.14 | step_microstep: 28.44 [2024-11-14 00:05:08,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.29 | bwd: 3852.93 | bwd_inner: 3844.70 | bwd_allreduce: 8.17 | step: 28.44 5%|▌ | 2645/50750 [7:22:30<79:19:36, 5.94s/it] {'loss': 0.0033, 'learning_rate': 3.994875008458675e-05, 'epoch': 2.61} 5%|▌ | 2645/50750 [7:22:30<79:19:36, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:05:14,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:05:14,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.72 | bwd_microstep: 3853.38 | bwd_inner_microstep: 3845.89 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.15 [2024-11-14 00:05:14,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.71 | bwd: 3853.39 | bwd_inner: 3845.89 | bwd_allreduce: 7.46 | step: 21.15 5%|▌ | 2646/50750 [7:22:36<79:18:56, 5.94s/it] {'loss': 0.0019, 'learning_rate': 3.994865872842622e-05, 'epoch': 2.61} 5%|▌ | 2646/50750 [7:22:36<79:18:56, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:05:20,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:05:20,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.80 | bwd_microstep: 3850.07 | bwd_inner_microstep: 3842.59 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.12 [2024-11-14 00:05:20,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.80 | bwd: 3850.08 | bwd_inner: 3842.59 | bwd_allreduce: 7.45 | step: 21.13 5%|▌ | 2647/50750 [7:22:42<79:15:45, 5.93s/it] {'loss': 0.0106, 'learning_rate': 3.994856729101881e-05, 'epoch': 2.61} 5%|▌ | 2647/50750 [7:22:42<79:15:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:05:26,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 00:05:26,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.49 | bwd_microstep: 3862.30 | bwd_inner_microstep: 3854.81 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.91 [2024-11-14 00:05:26,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.49 | bwd: 3862.31 | bwd_inner: 3854.81 | bwd_allreduce: 7.46 | step: 20.91 5%|▌ | 2648/50750 [7:22:48<79:16:59, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.994847577236488e-05, 'epoch': 2.61} 5%|▌ | 2648/50750 [7:22:48<79:16:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:05:32,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.93 [2024-11-14 00:05:32,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.46 | bwd_microstep: 3861.24 | bwd_inner_microstep: 3853.50 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.84 [2024-11-14 00:05:32,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.45 | bwd: 3861.25 | bwd_inner: 3853.50 | bwd_allreduce: 7.71 | step: 21.85 5%|▌ | 2649/50750 [7:22:54<79:18:10, 5.94s/it] {'loss': 0.0026, 'learning_rate': 3.994838417246482e-05, 'epoch': 2.61} 5%|▌ | 2649/50750 [7:22:54<79:18:10, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:05:38,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:05:38,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.63 | bwd_microstep: 3853.24 | bwd_inner_microstep: 3845.74 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.90 [2024-11-14 00:05:38,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.62 | bwd: 3853.25 | bwd_inner: 3845.74 | bwd_allreduce: 7.48 | step: 20.91 5%|▌ | 2650/50750 [7:23:00<79:18:11, 5.94s/it] {'loss': 0.3556, 'learning_rate': 3.9948292491318984e-05, 'epoch': 2.61} 5%|▌ | 2650/50750 [7:23:00<79:18:11, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:05:44,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:05:44,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.72 | bwd_microstep: 3851.46 | bwd_inner_microstep: 3844.01 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.84 [2024-11-14 00:05:44,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.72 | bwd: 3851.47 | bwd_inner: 3844.01 | bwd_allreduce: 7.43 | step: 20.85 5%|▌ | 2651/50750 [7:23:06<79:15:01, 5.93s/it] {'loss': 0.2666, 'learning_rate': 3.9948200728927774e-05, 'epoch': 2.61} 5%|▌ | 2651/50750 [7:23:06<79:15:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:05:50,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 00:05:50,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.78 | bwd_microstep: 3851.91 | bwd_inner_microstep: 3844.42 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.87 [2024-11-14 00:05:50,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.78 | bwd: 3851.92 | bwd_inner: 3844.42 | bwd_allreduce: 7.46 | step: 20.88 5%|▌ | 2652/50750 [7:23:12<79:13:24, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.9948108885291535e-05, 'epoch': 2.61} 5%|▌ | 2652/50750 [7:23:12<79:13:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:05:56,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.94 [2024-11-14 00:05:56,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.54 | bwd_microstep: 3856.60 | bwd_inner_microstep: 3849.07 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.10 [2024-11-14 00:05:56,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.53 | bwd: 3856.62 | bwd_inner: 3849.07 | bwd_allreduce: 7.51 | step: 21.10 5%|▌ | 2653/50750 [7:23:18<79:14:02, 5.93s/it] {'loss': 0.0034, 'learning_rate': 3.994801696041065e-05, 'epoch': 2.61} 5%|▌ | 2653/50750 [7:23:18<79:14:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:06:02,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-14 00:06:02,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.57 | bwd_microstep: 3849.75 | bwd_inner_microstep: 3842.28 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.98 [2024-11-14 00:06:02,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.57 | bwd: 3849.76 | bwd_inner: 3842.28 | bwd_allreduce: 7.44 | step: 20.99 5%|▌ | 2654/50750 [7:23:24<79:11:38, 5.93s/it] {'loss': 0.3573, 'learning_rate': 3.99479249542855e-05, 'epoch': 2.61} 5%|▌ | 2654/50750 [7:23:24<79:11:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:06:08,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 00:06:08,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.53 | bwd_microstep: 3852.89 | bwd_inner_microstep: 3845.20 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.71 [2024-11-14 00:06:08,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.53 | bwd: 3852.90 | bwd_inner: 3845.20 | bwd_allreduce: 7.66 | step: 21.72 5%|▌ | 2655/50750 [7:23:30<79:11:30, 5.93s/it] {'loss': 0.1559, 'learning_rate': 3.994783286691647e-05, 'epoch': 2.62} 5%|▌ | 2655/50750 [7:23:30<79:11:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:06:14,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:06:14,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.63 | bwd_microstep: 3855.12 | bwd_inner_microstep: 3847.59 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.27 [2024-11-14 00:06:14,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.61 | bwd: 3855.13 | bwd_inner: 3847.59 | bwd_allreduce: 7.49 | step: 21.27 5%|▌ | 2656/50750 [7:23:36<79:12:18, 5.93s/it] {'loss': 0.0038, 'learning_rate': 3.99477406983039e-05, 'epoch': 2.62} 5%|▌ | 2656/50750 [7:23:36<79:12:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:06:20,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:06:20,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.46 | bwd_microstep: 3854.05 | bwd_inner_microstep: 3846.58 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.04 [2024-11-14 00:06:20,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.46 | bwd: 3854.06 | bwd_inner: 3846.58 | bwd_allreduce: 7.44 | step: 21.05 5%|▌ | 2657/50750 [7:23:42<79:12:44, 5.93s/it] {'loss': 0.322, 'learning_rate': 3.99476484484482e-05, 'epoch': 2.62} 5%|▌ | 2657/50750 [7:23:42<79:12:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:06:25,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 00:06:25,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.17 | bwd_microstep: 3850.43 | bwd_inner_microstep: 3842.91 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.07 [2024-11-14 00:06:25,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.17 | bwd: 3850.44 | bwd_inner: 3842.91 | bwd_allreduce: 7.49 | step: 21.07 5%|▌ | 2658/50750 [7:23:47<79:11:24, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.994755611734972e-05, 'epoch': 2.62} 5%|▌ | 2658/50750 [7:23:47<79:11:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:06:31,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.92 [2024-11-14 00:06:31,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.27 | bwd_microstep: 3854.22 | bwd_inner_microstep: 3846.43 | bwd_allreduce_microstep: 7.74 | step_microstep: 22.07 [2024-11-14 00:06:31,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.27 | bwd: 3854.23 | bwd_inner: 3846.43 | bwd_allreduce: 7.75 | step: 22.08 5%|▌ | 2659/50750 [7:23:53<79:13:23, 5.93s/it] {'loss': 0.0023, 'learning_rate': 3.994746370500886e-05, 'epoch': 2.62} 5%|▌ | 2659/50750 [7:23:53<79:13:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:06:37,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:06:37,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.01 | bwd_microstep: 3853.80 | bwd_inner_microstep: 3846.28 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.53 [2024-11-14 00:06:37,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.00 | bwd: 3853.81 | bwd_inner: 3846.28 | bwd_allreduce: 7.49 | step: 21.53 5%|▌ | 2660/50750 [7:23:59<79:14:39, 5.93s/it] {'loss': 0.0272, 'learning_rate': 3.994737121142598e-05, 'epoch': 2.62} 5%|▌ | 2660/50750 [7:23:59<79:14:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:06:43,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:06:43,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.15 | bwd_microstep: 3860.20 | bwd_inner_microstep: 3852.73 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.26 [2024-11-14 00:06:43,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.15 | bwd: 3860.21 | bwd_inner: 3852.73 | bwd_allreduce: 7.44 | step: 21.26 5%|▌ | 2661/50750 [7:24:05<79:15:19, 5.93s/it] {'loss': 0.0014, 'learning_rate': 3.994727863660146e-05, 'epoch': 2.62} 5%|▌ | 2661/50750 [7:24:05<79:15:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:06:49,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-14 00:06:49,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.18 | bwd_microstep: 3849.93 | bwd_inner_microstep: 3842.05 | bwd_allreduce_microstep: 7.83 | step_microstep: 24.08 [2024-11-14 00:06:49,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.18 | bwd: 3849.95 | bwd_inner: 3842.05 | bwd_allreduce: 7.85 | step: 24.09 5%|▌ | 2662/50750 [7:24:11<79:13:46, 5.93s/it] {'loss': 0.7483, 'learning_rate': 3.9947185980535675e-05, 'epoch': 2.62} 5%|▌ | 2662/50750 [7:24:11<79:13:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:06:55,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:06:55,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.33 | bwd_microstep: 3852.60 | bwd_inner_microstep: 3845.09 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.03 [2024-11-14 00:06:55,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.33 | bwd: 3852.61 | bwd_inner: 3845.09 | bwd_allreduce: 7.48 | step: 21.04 5%|▌ | 2663/50750 [7:24:17<79:12:47, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.994709324322902e-05, 'epoch': 2.62} 5%|▌ | 2663/50750 [7:24:17<79:12:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:07:01,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 00:07:01,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.79 | bwd_microstep: 3853.16 | bwd_inner_microstep: 3845.51 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.44 [2024-11-14 00:07:01,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.78 | bwd: 3853.17 | bwd_inner: 3845.51 | bwd_allreduce: 7.62 | step: 21.44 5%|▌ | 2664/50750 [7:24:23<79:45:58, 5.97s/it] {'loss': 0.0015, 'learning_rate': 3.994700042468184e-05, 'epoch': 2.62} 5%|▌ | 2664/50750 [7:24:23<79:45:58, 5.97s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:07:07,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-14 00:07:07,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.59 | bwd_microstep: 3853.96 | bwd_inner_microstep: 3846.47 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.98 [2024-11-14 00:07:07,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.57 | bwd: 3853.98 | bwd_inner: 3846.47 | bwd_allreduce: 7.46 | step: 20.98 5%|▌ | 2665/50750 [7:24:29<79:38:27, 5.96s/it] {'loss': 0.9853, 'learning_rate': 3.994690752489454e-05, 'epoch': 2.63} 5%|▌ | 2665/50750 [7:24:29<79:38:27, 5.96s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:07:13,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:07:13,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.12 | bwd_microstep: 3855.91 | bwd_inner_microstep: 3848.42 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.05 [2024-11-14 00:07:13,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.12 | bwd: 3855.92 | bwd_inner: 3848.42 | bwd_allreduce: 7.46 | step: 21.05 5%|▌ | 2666/50750 [7:24:35<79:30:22, 5.95s/it] {'loss': 0.1093, 'learning_rate': 3.994681454386749e-05, 'epoch': 2.63} 5%|▌ | 2666/50750 [7:24:35<79:30:22, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:07:19,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:07:19,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.45 | bwd_microstep: 3857.48 | bwd_inner_microstep: 3850.01 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.10 [2024-11-14 00:07:19,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.45 | bwd: 3857.50 | bwd_inner: 3850.01 | bwd_allreduce: 7.45 | step: 21.11 5%|▌ | 2667/50750 [7:24:41<79:24:30, 5.95s/it] {'loss': 0.7351, 'learning_rate': 3.9946721481601066e-05, 'epoch': 2.63} 5%|▌ | 2667/50750 [7:24:41<79:24:30, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:07:25,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:07:25,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.57 | bwd_microstep: 3855.07 | bwd_inner_microstep: 3847.61 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.14 [2024-11-14 00:07:25,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.57 | bwd: 3855.08 | bwd_inner: 3847.61 | bwd_allreduce: 7.44 | step: 21.14 5%|▌ | 2668/50750 [7:24:47<79:20:11, 5.94s/it] {'loss': 0.0025, 'learning_rate': 3.9946628338095645e-05, 'epoch': 2.63} 5%|▌ | 2668/50750 [7:24:47<79:20:11, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:07:31,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:07:31,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.93 | bwd_microstep: 3854.85 | bwd_inner_microstep: 3847.27 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.36 [2024-11-14 00:07:31,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.93 | bwd: 3854.87 | bwd_inner: 3847.27 | bwd_allreduce: 7.55 | step: 21.37 5%|▌ | 2669/50750 [7:24:53<79:17:24, 5.94s/it] {'loss': 0.0032, 'learning_rate': 3.994653511335162e-05, 'epoch': 2.63} 5%|▌ | 2669/50750 [7:24:53<79:17:24, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:07:37,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.30 | optimizer_step: 4.93 [2024-11-14 00:07:37,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2036.42 | bwd_microstep: 3849.68 | bwd_inner_microstep: 3841.36 | bwd_allreduce_microstep: 8.27 | step_microstep: 22.47 [2024-11-14 00:07:37,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2036.40 | bwd: 3849.69 | bwd_inner: 3841.36 | bwd_allreduce: 8.29 | step: 22.47 5%|▌ | 2670/50750 [7:24:59<79:19:29, 5.94s/it] {'loss': 0.0029, 'learning_rate': 3.994644180736936e-05, 'epoch': 2.63} 5%|▌ | 2670/50750 [7:24:59<79:19:29, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:07:43,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.92 [2024-11-14 00:07:43,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.92 | bwd_microstep: 3862.65 | bwd_inner_microstep: 3854.37 | bwd_allreduce_microstep: 8.24 | step_microstep: 22.11 [2024-11-14 00:07:43,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.91 | bwd: 3862.67 | bwd_inner: 3854.37 | bwd_allreduce: 8.26 | step: 22.12 5%|▌ | 2671/50750 [7:25:05<79:19:53, 5.94s/it] {'loss': 0.0619, 'learning_rate': 3.9946348420149245e-05, 'epoch': 2.63} 5%|▌ | 2671/50750 [7:25:05<79:19:53, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:07:49,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:07:49,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.97 | bwd_microstep: 3856.12 | bwd_inner_microstep: 3848.58 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.12 [2024-11-14 00:07:49,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.95 | bwd: 3856.14 | bwd_inner: 3848.58 | bwd_allreduce: 7.51 | step: 21.12 5%|▌ | 2672/50750 [7:25:11<79:18:26, 5.94s/it] {'loss': 0.0059, 'learning_rate': 3.9946254951691655e-05, 'epoch': 2.63} 5%|▌ | 2672/50750 [7:25:11<79:18:26, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:07:55,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 00:07:55,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.50 | bwd_microstep: 3862.55 | bwd_inner_microstep: 3854.83 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.55 [2024-11-14 00:07:55,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.50 | bwd: 3862.57 | bwd_inner: 3854.83 | bwd_allreduce: 7.69 | step: 21.55 5%|▌ | 2673/50750 [7:25:17<79:17:38, 5.94s/it] {'loss': 0.402, 'learning_rate': 3.994616140199697e-05, 'epoch': 2.63} 5%|▌ | 2673/50750 [7:25:17<79:17:38, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:08:01,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-14 00:08:01,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.40 | bwd_microstep: 3848.47 | bwd_inner_microstep: 3841.00 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.10 [2024-11-14 00:08:01,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.40 | bwd: 3848.48 | bwd_inner: 3841.00 | bwd_allreduce: 7.44 | step: 21.11 5%|▌ | 2674/50750 [7:25:22<79:14:02, 5.93s/it] {'loss': 0.2076, 'learning_rate': 3.994606777106558e-05, 'epoch': 2.63} 5%|▌ | 2674/50750 [7:25:22<79:14:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:08:06,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:08:06,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.00 | bwd_microstep: 3852.45 | bwd_inner_microstep: 3844.96 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.12 [2024-11-14 00:08:06,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.00 | bwd: 3852.46 | bwd_inner: 3844.96 | bwd_allreduce: 7.46 | step: 21.12 5%|▌ | 2675/50750 [7:25:28<79:13:31, 5.93s/it] {'loss': 0.0297, 'learning_rate': 3.9945974058897856e-05, 'epoch': 2.64} 5%|▌ | 2675/50750 [7:25:28<79:13:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:08:12,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:08:12,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.13 | bwd_microstep: 3853.67 | bwd_inner_microstep: 3846.18 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.94 [2024-11-14 00:08:12,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.13 | bwd: 3853.68 | bwd_inner: 3846.18 | bwd_allreduce: 7.46 | step: 20.94 5%|▌ | 2676/50750 [7:25:34<79:12:29, 5.93s/it] {'loss': 0.0053, 'learning_rate': 3.994588026549418e-05, 'epoch': 2.64} 5%|▌ | 2676/50750 [7:25:34<79:12:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:08:18,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:08:18,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.42 | bwd_microstep: 3855.03 | bwd_inner_microstep: 3847.59 | bwd_allreduce_microstep: 7.40 | step_microstep: 20.87 [2024-11-14 00:08:18,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.42 | bwd: 3855.04 | bwd_inner: 3847.59 | bwd_allreduce: 7.42 | step: 20.87 5%|▌ | 2677/50750 [7:25:40<79:12:14, 5.93s/it] {'loss': 0.0075, 'learning_rate': 3.9945786390854946e-05, 'epoch': 2.64} 5%|▌ | 2677/50750 [7:25:40<79:12:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:08:24,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:08:24,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.99 | bwd_microstep: 3848.13 | bwd_inner_microstep: 3840.67 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.07 [2024-11-14 00:08:24,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.99 | bwd: 3848.14 | bwd_inner: 3840.67 | bwd_allreduce: 7.43 | step: 21.09 5%|▌ | 2678/50750 [7:25:46<79:09:31, 5.93s/it] {'loss': 0.0125, 'learning_rate': 3.994569243498053e-05, 'epoch': 2.64} 5%|▌ | 2678/50750 [7:25:46<79:09:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:08:30,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:08:30,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.15 | bwd_microstep: 3852.86 | bwd_inner_microstep: 3845.33 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.92 [2024-11-14 00:08:30,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.15 | bwd: 3852.87 | bwd_inner: 3845.33 | bwd_allreduce: 7.50 | step: 20.93 5%|▌ | 2679/50750 [7:25:52<79:08:31, 5.93s/it] {'loss': 0.0017, 'learning_rate': 3.99455983978713e-05, 'epoch': 2.64} 5%|▌ | 2679/50750 [7:25:52<79:08:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:08:36,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-14 00:08:36,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.68 | bwd_microstep: 3853.59 | bwd_inner_microstep: 3845.92 | bwd_allreduce_microstep: 7.62 | step_microstep: 20.83 [2024-11-14 00:08:36,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.68 | bwd: 3853.60 | bwd_inner: 3845.92 | bwd_allreduce: 7.64 | step: 20.84 5%|▌ | 2680/50750 [7:25:58<79:09:04, 5.93s/it] {'loss': 0.0021, 'learning_rate': 3.994550427952765e-05, 'epoch': 2.64} 5%|▌ | 2680/50750 [7:25:58<79:09:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:08:42,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:08:42,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.75 | bwd_microstep: 3849.95 | bwd_inner_microstep: 3842.46 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.88 [2024-11-14 00:08:42,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.73 | bwd: 3849.96 | bwd_inner: 3842.46 | bwd_allreduce: 7.46 | step: 20.88 5%|▌ | 2681/50750 [7:26:04<79:09:48, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.9945410079949976e-05, 'epoch': 2.64} 5%|▌ | 2681/50750 [7:26:04<79:09:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2193 [2024-11-14 00:08:48,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:08:48,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.88 | bwd_microstep: 3848.00 | bwd_inner_microstep: 3840.52 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.90 [2024-11-14 00:08:48,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.88 | bwd: 3848.01 | bwd_inner: 3840.52 | bwd_allreduce: 7.45 | step: 20.90 5%|▌ | 2682/50750 [7:26:10<79:07:13, 5.93s/it] {'loss': 0.0, 'learning_rate': 3.994531579913865e-05, 'epoch': 2.64} 5%|▌ | 2682/50750 [7:26:10<79:07:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:08:54,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:08:54,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.19 | bwd_microstep: 3850.42 | bwd_inner_microstep: 3842.94 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.88 [2024-11-14 00:08:54,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.19 | bwd: 3850.43 | bwd_inner: 3842.94 | bwd_allreduce: 7.45 | step: 20.88 5%|▌ | 2683/50750 [7:26:16<79:05:58, 5.92s/it] {'loss': 0.0283, 'learning_rate': 3.9945221437094055e-05, 'epoch': 2.64} 5%|▌ | 2683/50750 [7:26:16<79:05:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:09:00,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:09:00,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.44 | bwd_microstep: 3846.68 | bwd_inner_microstep: 3839.05 | bwd_allreduce_microstep: 7.59 | step_microstep: 20.94 [2024-11-14 00:09:00,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.44 | bwd: 3846.69 | bwd_inner: 3839.05 | bwd_allreduce: 7.60 | step: 20.95 5%|▌ | 2684/50750 [7:26:22<79:04:22, 5.92s/it] {'loss': 0.0387, 'learning_rate': 3.994512699381657e-05, 'epoch': 2.64} 5%|▌ | 2684/50750 [7:26:22<79:04:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:09:06,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 00:09:06,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.27 | bwd_microstep: 3855.00 | bwd_inner_microstep: 3847.35 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.47 [2024-11-14 00:09:06,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.27 | bwd: 3855.01 | bwd_inner: 3847.35 | bwd_allreduce: 7.63 | step: 21.47 5%|▌ | 2685/50750 [7:26:28<79:06:48, 5.93s/it] {'loss': 0.1655, 'learning_rate': 3.994503246930659e-05, 'epoch': 2.65} 5%|▌ | 2685/50750 [7:26:28<79:06:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:09:12,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 00:09:12,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.52 | bwd_microstep: 3858.53 | bwd_inner_microstep: 3851.02 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.07 [2024-11-14 00:09:12,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.50 | bwd: 3858.54 | bwd_inner: 3851.02 | bwd_allreduce: 7.48 | step: 21.08 5%|▌ | 2686/50750 [7:26:34<79:08:34, 5.93s/it] {'loss': 0.0093, 'learning_rate': 3.99449378635645e-05, 'epoch': 2.65} 5%|▌ | 2686/50750 [7:26:34<79:08:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 00:09:18,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 5.04 [2024-11-14 00:09:18,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.35 | bwd_microstep: 3850.45 | bwd_inner_microstep: 3842.97 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.06 [2024-11-14 00:09:18,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.35 | bwd: 3850.46 | bwd_inner: 3842.97 | bwd_allreduce: 7.45 | step: 21.07 5%|▌ | 2687/50750 [7:26:40<79:07:23, 5.93s/it] {'loss': 0.0664, 'learning_rate': 3.9944843176590684e-05, 'epoch': 2.65} 5%|▌ | 2687/50750 [7:26:40<79:07:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:09:24,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:09:24,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.32 | bwd_microstep: 3853.32 | bwd_inner_microstep: 3845.83 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.97 [2024-11-14 00:09:24,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.32 | bwd: 3853.33 | bwd_inner: 3845.83 | bwd_allreduce: 7.46 | step: 20.97 5%|▌ | 2688/50750 [7:26:45<79:06:56, 5.93s/it] {'loss': 0.0036, 'learning_rate': 3.994474840838552e-05, 'epoch': 2.65} 5%|▌ | 2688/50750 [7:26:45<79:06:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:09:29,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 00:09:29,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.65 | bwd_microstep: 3856.65 | bwd_inner_microstep: 3848.98 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.50 [2024-11-14 00:09:29,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.65 | bwd: 3856.67 | bwd_inner: 3848.98 | bwd_allreduce: 7.65 | step: 21.50 5%|▌ | 2689/50750 [7:26:51<79:08:53, 5.93s/it] {'loss': 0.3069, 'learning_rate': 3.9944653558949394e-05, 'epoch': 2.65} 5%|▌ | 2689/50750 [7:26:51<79:08:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:09:35,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:09:35,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.86 | bwd_microstep: 3853.28 | bwd_inner_microstep: 3845.78 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.15 [2024-11-14 00:09:35,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.85 | bwd: 3853.30 | bwd_inner: 3845.78 | bwd_allreduce: 7.47 | step: 21.15 5%|▌ | 2690/50750 [7:26:57<79:08:30, 5.93s/it] {'loss': 0.0155, 'learning_rate': 3.9944558628282706e-05, 'epoch': 2.65} 5%|▌ | 2690/50750 [7:26:57<79:08:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:09:41,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:09:41,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.55 | bwd_microstep: 3850.20 | bwd_inner_microstep: 3842.74 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.84 [2024-11-14 00:09:41,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.55 | bwd: 3850.21 | bwd_inner: 3842.74 | bwd_allreduce: 7.43 | step: 20.84 5%|▌ | 2691/50750 [7:27:03<79:06:15, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.994446361638583e-05, 'epoch': 2.65} 5%|▌ | 2691/50750 [7:27:03<79:06:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:09:47,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:09:47,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.56 | bwd_microstep: 3849.62 | bwd_inner_microstep: 3842.16 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.93 [2024-11-14 00:09:47,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.56 | bwd: 3849.63 | bwd_inner: 3842.16 | bwd_allreduce: 7.43 | step: 20.93 5%|▌ | 2692/50750 [7:27:09<79:04:18, 5.92s/it] {'loss': 0.0015, 'learning_rate': 3.9944368523259166e-05, 'epoch': 2.65} 5%|▌ | 2692/50750 [7:27:09<79:04:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:09:53,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:09:53,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.85 | bwd_microstep: 3852.37 | bwd_inner_microstep: 3844.89 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.98 [2024-11-14 00:09:53,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.85 | bwd: 3852.38 | bwd_inner: 3844.89 | bwd_allreduce: 7.45 | step: 20.99 5%|▌ | 2693/50750 [7:27:15<79:03:55, 5.92s/it] {'loss': 0.0402, 'learning_rate': 3.9944273348903085e-05, 'epoch': 2.65} 5%|▌ | 2693/50750 [7:27:15<79:03:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 00:09:59,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:09:59,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.40 | bwd_microstep: 3852.00 | bwd_inner_microstep: 3844.52 | bwd_allreduce_microstep: 7.43 | step_microstep: 23.15 [2024-11-14 00:09:59,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.40 | bwd: 3852.01 | bwd_inner: 3844.52 | bwd_allreduce: 7.44 | step: 23.15 5%|▌ | 2694/50750 [7:27:21<79:05:22, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.9944178093317975e-05, 'epoch': 2.65} 5%|▌ | 2694/50750 [7:27:21<79:05:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:10:05,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:10:05,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.12 | bwd_microstep: 3852.17 | bwd_inner_microstep: 3844.69 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.92 [2024-11-14 00:10:05,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.12 | bwd: 3852.18 | bwd_inner: 3844.69 | bwd_allreduce: 7.45 | step: 20.92 5%|▌ | 2695/50750 [7:27:27<79:05:46, 5.93s/it] {'loss': 0.6627, 'learning_rate': 3.994408275650425e-05, 'epoch': 2.66} 5%|▌ | 2695/50750 [7:27:27<79:05:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:10:11,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:10:11,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.81 | bwd_microstep: 3845.54 | bwd_inner_microstep: 3838.03 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.00 [2024-11-14 00:10:11,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.81 | bwd: 3845.55 | bwd_inner: 3838.03 | bwd_allreduce: 7.48 | step: 21.00 5%|▌ | 2696/50750 [7:27:33<79:04:17, 5.92s/it] {'loss': 0.1314, 'learning_rate': 3.9943987338462264e-05, 'epoch': 2.66} 5%|▌ | 2696/50750 [7:27:33<79:04:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:10:17,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-14 00:10:17,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.61 | bwd_microstep: 3853.98 | bwd_inner_microstep: 3846.26 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.66 [2024-11-14 00:10:17,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.61 | bwd: 3853.99 | bwd_inner: 3846.27 | bwd_allreduce: 7.68 | step: 21.68 5%|▌ | 2697/50750 [7:27:39<79:05:04, 5.92s/it] {'loss': 0.0123, 'learning_rate': 3.994389183919242e-05, 'epoch': 2.66} 5%|▌ | 2697/50750 [7:27:39<79:05:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:10:23,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 00:10:23,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.74 | bwd_microstep: 3857.65 | bwd_inner_microstep: 3850.02 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.36 [2024-11-14 00:10:23,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.72 | bwd: 3857.66 | bwd_inner: 3850.02 | bwd_allreduce: 7.60 | step: 21.37 5%|▌ | 2698/50750 [7:27:45<79:08:18, 5.93s/it] {'loss': 0.5854, 'learning_rate': 3.9943796258695115e-05, 'epoch': 2.66} 5%|▌ | 2698/50750 [7:27:45<79:08:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:10:29,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-14 00:10:29,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.60 | bwd_microstep: 3853.79 | bwd_inner_microstep: 3845.91 | bwd_allreduce_microstep: 7.82 | step_microstep: 24.09 [2024-11-14 00:10:29,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.59 | bwd: 3853.81 | bwd_inner: 3845.91 | bwd_allreduce: 7.85 | step: 24.08 5%|▌ | 2699/50750 [7:27:51<79:12:03, 5.93s/it] {'loss': 0.0014, 'learning_rate': 3.994370059697073e-05, 'epoch': 2.66} 5%|▌ | 2699/50750 [7:27:51<79:12:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:10:35,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:10:35,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.57 | bwd_microstep: 3854.88 | bwd_inner_microstep: 3847.23 | bwd_allreduce_microstep: 7.61 | step_microstep: 20.95 [2024-11-14 00:10:35,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.55 | bwd: 3854.89 | bwd_inner: 3847.23 | bwd_allreduce: 7.63 | step: 20.96 5%|▌ | 2700/50750 [7:27:57<79:12:21, 5.93s/it] {'loss': 0.352, 'learning_rate': 3.9943604854019654e-05, 'epoch': 2.66} 5%|▌ | 2700/50750 [7:27:57<79:12:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:10:41,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 00:10:41,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.23 | bwd_microstep: 3860.92 | bwd_inner_microstep: 3853.20 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.24 [2024-11-14 00:10:41,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.23 | bwd: 3860.93 | bwd_inner: 3853.20 | bwd_allreduce: 7.69 | step: 21.25 5%|▌ | 2701/50750 [7:28:03<79:12:02, 5.93s/it] {'loss': 0.0082, 'learning_rate': 3.994350902984228e-05, 'epoch': 2.66} 5%|▌ | 2701/50750 [7:28:03<79:12:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:10:47,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:10:47,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.77 | bwd_microstep: 3852.18 | bwd_inner_microstep: 3844.67 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.16 [2024-11-14 00:10:47,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.75 | bwd: 3852.19 | bwd_inner: 3844.67 | bwd_allreduce: 7.48 | step: 21.16 5%|▌ | 2702/50750 [7:28:08<79:10:05, 5.93s/it] {'loss': 0.7527, 'learning_rate': 3.9943413124438996e-05, 'epoch': 2.66} 5%|▌ | 2702/50750 [7:28:08<79:10:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:10:52,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:10:52,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.24 | bwd_microstep: 3854.76 | bwd_inner_microstep: 3847.03 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.65 [2024-11-14 00:10:52,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.24 | bwd: 3854.77 | bwd_inner: 3847.03 | bwd_allreduce: 7.71 | step: 21.65 5%|▌ | 2703/50750 [7:28:14<79:09:11, 5.93s/it] {'loss': 0.0209, 'learning_rate': 3.99433171378102e-05, 'epoch': 2.66} 5%|▌ | 2703/50750 [7:28:14<79:09:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:10:58,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 4.93 [2024-11-14 00:10:58,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.19 | bwd_microstep: 3857.43 | bwd_inner_microstep: 3849.88 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.32 [2024-11-14 00:10:58,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.18 | bwd: 3857.44 | bwd_inner: 3849.88 | bwd_allreduce: 7.52 | step: 21.33 5%|▌ | 2704/50750 [7:28:20<79:10:41, 5.93s/it] {'loss': 0.7588, 'learning_rate': 3.994322106995627e-05, 'epoch': 2.66} 5%|▌ | 2704/50750 [7:28:20<79:10:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:11:04,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 00:11:04,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.30 | bwd_microstep: 3852.18 | bwd_inner_microstep: 3844.65 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.23 [2024-11-14 00:11:04,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.30 | bwd: 3852.19 | bwd_inner: 3844.65 | bwd_allreduce: 7.50 | step: 21.24 5%|▌ | 2705/50750 [7:28:26<79:09:34, 5.93s/it] {'loss': 0.1035, 'learning_rate': 3.994312492087761e-05, 'epoch': 2.67} 5%|▌ | 2705/50750 [7:28:26<79:09:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:11:10,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-14 00:11:10,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.20 | bwd_microstep: 3849.99 | bwd_inner_microstep: 3842.43 | bwd_allreduce_microstep: 7.52 | step_microstep: 22.03 [2024-11-14 00:11:10,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.20 | bwd: 3850.00 | bwd_inner: 3842.43 | bwd_allreduce: 7.53 | step: 22.04 5%|▌ | 2706/50750 [7:28:32<79:09:30, 5.93s/it] {'loss': 0.7675, 'learning_rate': 3.9943028690574595e-05, 'epoch': 2.67} 5%|▌ | 2706/50750 [7:28:32<79:09:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:11:16,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.92 [2024-11-14 00:11:16,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.86 | bwd_microstep: 3850.67 | bwd_inner_microstep: 3842.94 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.79 [2024-11-14 00:11:16,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.86 | bwd: 3850.69 | bwd_inner: 3842.95 | bwd_allreduce: 7.70 | step: 21.79 5%|▌ | 2707/50750 [7:28:38<79:08:02, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.994293237904764e-05, 'epoch': 2.67} 5%|▌ | 2707/50750 [7:28:38<79:08:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 00:11:22,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:11:22,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.41 | bwd_microstep: 3848.85 | bwd_inner_microstep: 3841.25 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.36 [2024-11-14 00:11:22,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.39 | bwd: 3848.86 | bwd_inner: 3841.25 | bwd_allreduce: 7.57 | step: 21.37 5%|▌ | 2708/50750 [7:28:44<79:07:48, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.994283598629711e-05, 'epoch': 2.67} 5%|▌ | 2708/50750 [7:28:44<79:07:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:11:28,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.94 [2024-11-14 00:11:28,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.65 | bwd_microstep: 3851.31 | bwd_inner_microstep: 3843.78 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.80 [2024-11-14 00:11:28,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.63 | bwd: 3851.33 | bwd_inner: 3843.78 | bwd_allreduce: 7.50 | step: 21.80 5%|▌ | 2709/50750 [7:28:50<79:07:39, 5.93s/it] {'loss': 0.9886, 'learning_rate': 3.9942739512323426e-05, 'epoch': 2.67} 5%|▌ | 2709/50750 [7:28:50<79:07:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:11:34,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.66 | optimizer_step: 4.94 [2024-11-14 00:11:34,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.02 | bwd_microstep: 3852.98 | bwd_inner_microstep: 3844.69 | bwd_allreduce_microstep: 8.22 | step_microstep: 29.04 [2024-11-14 00:11:34,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.01 | bwd: 3853.00 | bwd_inner: 3844.69 | bwd_allreduce: 8.25 | step: 29.05 5%|▌ | 2710/50750 [7:28:56<79:09:52, 5.93s/it] {'loss': 0.5962, 'learning_rate': 3.994264295712696e-05, 'epoch': 2.67} 5%|▌ | 2710/50750 [7:28:56<79:09:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:11:40,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-14 00:11:40,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.74 | bwd_microstep: 3849.40 | bwd_inner_microstep: 3841.53 | bwd_allreduce_microstep: 7.82 | step_microstep: 21.74 [2024-11-14 00:11:40,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.72 | bwd: 3849.41 | bwd_inner: 3841.53 | bwd_allreduce: 7.84 | step: 21.74 5%|▌ | 2711/50750 [7:29:02<79:08:55, 5.93s/it] {'loss': 0.032, 'learning_rate': 3.9942546320708105e-05, 'epoch': 2.67} 5%|▌ | 2711/50750 [7:29:02<79:08:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:11:46,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.92 [2024-11-14 00:11:46,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.35 | bwd_microstep: 3849.74 | bwd_inner_microstep: 3842.19 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.54 [2024-11-14 00:11:46,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.34 | bwd: 3849.76 | bwd_inner: 3842.19 | bwd_allreduce: 7.53 | step: 21.54 5%|▌ | 2712/50750 [7:29:08<79:09:47, 5.93s/it] {'loss': 0.0009, 'learning_rate': 3.994244960306728e-05, 'epoch': 2.67} 5%|▌ | 2712/50750 [7:29:08<79:09:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:11:52,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:11:52,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.27 | bwd_microstep: 3849.90 | bwd_inner_microstep: 3842.36 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.12 [2024-11-14 00:11:52,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.27 | bwd: 3849.91 | bwd_inner: 3842.36 | bwd_allreduce: 7.51 | step: 21.12 5%|▌ | 2713/50750 [7:29:14<79:07:53, 5.93s/it] {'loss': 0.0074, 'learning_rate': 3.994235280420485e-05, 'epoch': 2.67} 5%|▌ | 2713/50750 [7:29:14<79:07:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:11:58,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-14 00:11:58,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.53 | bwd_microstep: 3862.86 | bwd_inner_microstep: 3855.32 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.44 [2024-11-14 00:11:58,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.53 | bwd: 3862.87 | bwd_inner: 3855.32 | bwd_allreduce: 7.51 | step: 21.44 5%|▌ | 2714/50750 [7:29:20<79:10:31, 5.93s/it] {'loss': 0.1837, 'learning_rate': 3.994225592412122e-05, 'epoch': 2.67} 5%|▌ | 2714/50750 [7:29:20<79:10:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:12:04,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:12:04,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.57 | bwd_microstep: 3861.96 | bwd_inner_microstep: 3854.44 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.87 [2024-11-14 00:12:04,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.57 | bwd: 3861.97 | bwd_inner: 3854.44 | bwd_allreduce: 7.49 | step: 20.88 5%|▌ | 2715/50750 [7:29:26<79:10:54, 5.93s/it] {'loss': 0.1306, 'learning_rate': 3.9942158962816785e-05, 'epoch': 2.67} 5%|▌ | 2715/50750 [7:29:26<79:10:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:12:10,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:12:10,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.81 | bwd_microstep: 3858.22 | bwd_inner_microstep: 3850.72 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.05 [2024-11-14 00:12:10,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.81 | bwd: 3858.23 | bwd_inner: 3850.72 | bwd_allreduce: 7.48 | step: 21.06 5%|▌ | 2716/50750 [7:29:32<79:09:48, 5.93s/it] {'loss': 0.13, 'learning_rate': 3.9942061920291944e-05, 'epoch': 2.68} 5%|▌ | 2716/50750 [7:29:32<79:09:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:12:15,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:12:15,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.68 | bwd_microstep: 3857.41 | bwd_inner_microstep: 3849.85 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.56 [2024-11-14 00:12:15,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.68 | bwd: 3857.42 | bwd_inner: 3849.85 | bwd_allreduce: 7.53 | step: 21.56 5%|▌ | 2717/50750 [7:29:37<79:09:26, 5.93s/it] {'loss': 0.0071, 'learning_rate': 3.9941964796547086e-05, 'epoch': 2.68} 5%|▌ | 2717/50750 [7:29:37<79:09:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:12:21,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:12:21,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.80 | bwd_microstep: 3863.05 | bwd_inner_microstep: 3855.49 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.37 [2024-11-14 00:12:21,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.80 | bwd: 3863.06 | bwd_inner: 3855.49 | bwd_allreduce: 7.53 | step: 21.38 5%|▌ | 2718/50750 [7:29:43<79:09:53, 5.93s/it] {'loss': 0.2106, 'learning_rate': 3.9941867591582604e-05, 'epoch': 2.68} 5%|▌ | 2718/50750 [7:29:43<79:09:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:12:27,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:12:27,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.73 | bwd_microstep: 3852.26 | bwd_inner_microstep: 3844.58 | bwd_allreduce_microstep: 7.64 | step_microstep: 20.82 [2024-11-14 00:12:27,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.73 | bwd: 3852.27 | bwd_inner: 3844.58 | bwd_allreduce: 7.65 | step: 20.83 5%|▌ | 2719/50750 [7:29:49<79:07:53, 5.93s/it] {'loss': 0.0662, 'learning_rate': 3.9941770305398904e-05, 'epoch': 2.68} 5%|▌ | 2719/50750 [7:29:49<79:07:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:12:33,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:12:33,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.11 | bwd_microstep: 3852.49 | bwd_inner_microstep: 3844.85 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.14 [2024-11-14 00:12:33,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.11 | bwd: 3852.51 | bwd_inner: 3844.85 | bwd_allreduce: 7.62 | step: 21.14 5%|▌ | 2720/50750 [7:29:55<79:07:25, 5.93s/it] {'loss': 0.0102, 'learning_rate': 3.994167293799637e-05, 'epoch': 2.68} 5%|▌ | 2720/50750 [7:29:55<79:07:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:12:39,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:12:39,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.11 | bwd_microstep: 3857.39 | bwd_inner_microstep: 3849.32 | bwd_allreduce_microstep: 8.03 | step_microstep: 21.40 [2024-11-14 00:12:39,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.09 | bwd: 3857.41 | bwd_inner: 3849.32 | bwd_allreduce: 8.04 | step: 21.41 5%|▌ | 2721/50750 [7:30:01<79:10:21, 5.93s/it] {'loss': 0.0168, 'learning_rate': 3.994157548937541e-05, 'epoch': 2.68} 5%|▌ | 2721/50750 [7:30:01<79:10:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:12:45,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:12:45,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.72 | bwd_microstep: 3850.79 | bwd_inner_microstep: 3843.14 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.01 [2024-11-14 00:12:45,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.71 | bwd: 3850.80 | bwd_inner: 3843.14 | bwd_allreduce: 7.62 | step: 21.01 5%|▌ | 2722/50750 [7:30:07<79:09:15, 5.93s/it] {'loss': 0.146, 'learning_rate': 3.994147795953642e-05, 'epoch': 2.68} 5%|▌ | 2722/50750 [7:30:07<79:09:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:12:51,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.98 [2024-11-14 00:12:51,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.07 | bwd_microstep: 3846.01 | bwd_inner_microstep: 3838.51 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.27 [2024-11-14 00:12:51,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.06 | bwd: 3846.03 | bwd_inner: 3838.51 | bwd_allreduce: 7.48 | step: 21.28 5%|▌ | 2723/50750 [7:30:13<79:05:15, 5.93s/it] {'loss': 0.0632, 'learning_rate': 3.9941380348479785e-05, 'epoch': 2.68} 5%|▌ | 2723/50750 [7:30:13<79:05:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:12:57,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-14 00:12:57,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.23 | bwd_microstep: 3845.91 | bwd_inner_microstep: 3838.45 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.91 [2024-11-14 00:12:57,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.23 | bwd: 3845.92 | bwd_inner: 3838.45 | bwd_allreduce: 7.43 | step: 20.91 5%|▌ | 2724/50750 [7:30:19<79:01:33, 5.92s/it] {'loss': 0.5461, 'learning_rate': 3.994128265620592e-05, 'epoch': 2.68} 5%|▌ | 2724/50750 [7:30:19<79:01:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:13:03,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.92 [2024-11-14 00:13:03,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.68 | bwd_microstep: 3847.76 | bwd_inner_microstep: 3840.21 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.26 [2024-11-14 00:13:03,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.68 | bwd: 3847.77 | bwd_inner: 3840.21 | bwd_allreduce: 7.52 | step: 21.27 5%|▌ | 2725/50750 [7:30:25<79:00:00, 5.92s/it] {'loss': 0.004, 'learning_rate': 3.9941184882715206e-05, 'epoch': 2.68} 5%|▌ | 2725/50750 [7:30:25<79:00:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:13:09,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:13:09,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.33 | bwd_microstep: 3844.89 | bwd_inner_microstep: 3837.42 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.86 [2024-11-14 00:13:09,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.33 | bwd: 3844.90 | bwd_inner: 3837.42 | bwd_allreduce: 7.44 | step: 20.87 5%|▌ | 2726/50750 [7:30:31<78:57:59, 5.92s/it] {'loss': 0.0013, 'learning_rate': 3.9941087028008046e-05, 'epoch': 2.69} 5%|▌ | 2726/50750 [7:30:31<78:57:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:13:15,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:13:15,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.56 | bwd_microstep: 3843.54 | bwd_inner_microstep: 3836.06 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.80 [2024-11-14 00:13:15,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.54 | bwd: 3843.55 | bwd_inner: 3836.06 | bwd_allreduce: 7.46 | step: 20.80 5%|▌ | 2727/50750 [7:30:37<78:56:05, 5.92s/it] {'loss': 0.0185, 'learning_rate': 3.994098909208485e-05, 'epoch': 2.69} 5%|▌ | 2727/50750 [7:30:37<78:56:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:13:21,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.94 [2024-11-14 00:13:21,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.75 | bwd_microstep: 3851.20 | bwd_inner_microstep: 3843.71 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.94 [2024-11-14 00:13:21,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.75 | bwd: 3851.21 | bwd_inner: 3843.71 | bwd_allreduce: 7.46 | step: 20.94 5%|▌ | 2728/50750 [7:30:43<78:57:27, 5.92s/it] {'loss': 0.0109, 'learning_rate': 3.9940891074946e-05, 'epoch': 2.69} 5%|▌ | 2728/50750 [7:30:43<78:57:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:13:27,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:13:27,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.89 | bwd_microstep: 3853.35 | bwd_inner_microstep: 3845.87 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.14 [2024-11-14 00:13:27,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.89 | bwd: 3853.37 | bwd_inner: 3845.87 | bwd_allreduce: 7.46 | step: 21.15 5%|▌ | 2729/50750 [7:30:49<79:00:24, 5.92s/it] {'loss': 0.0037, 'learning_rate': 3.994079297659191e-05, 'epoch': 2.69} 5%|▌ | 2729/50750 [7:30:49<79:00:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:13:32,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-14 00:13:32,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.23 | bwd_microstep: 3846.30 | bwd_inner_microstep: 3838.71 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.46 [2024-11-14 00:13:32,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.23 | bwd: 3846.31 | bwd_inner: 3838.71 | bwd_allreduce: 7.56 | step: 21.46 5%|▌ | 2730/50750 [7:30:54<79:00:20, 5.92s/it] {'loss': 0.4126, 'learning_rate': 3.994069479702296e-05, 'epoch': 2.69} 5%|▌ | 2730/50750 [7:30:54<79:00:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:13:38,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 00:13:38,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.92 | bwd_microstep: 3844.06 | bwd_inner_microstep: 3836.35 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.36 [2024-11-14 00:13:38,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.92 | bwd: 3844.07 | bwd_inner: 3836.35 | bwd_allreduce: 7.69 | step: 21.36 5%|▌ | 2731/50750 [7:31:00<78:58:48, 5.92s/it] {'loss': 0.0165, 'learning_rate': 3.994059653623958e-05, 'epoch': 2.69} 5%|▌ | 2731/50750 [7:31:00<78:58:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:13:44,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 00:13:44,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.61 | bwd_microstep: 3846.56 | bwd_inner_microstep: 3838.97 | bwd_allreduce_microstep: 7.54 | step_microstep: 22.16 [2024-11-14 00:13:44,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.61 | bwd: 3846.57 | bwd_inner: 3838.97 | bwd_allreduce: 7.56 | step: 22.17 5%|▌ | 2732/50750 [7:31:06<78:58:40, 5.92s/it] {'loss': 0.0062, 'learning_rate': 3.994049819424215e-05, 'epoch': 2.69} 5%|▌ | 2732/50750 [7:31:06<78:58:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:13:50,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 5.09 [2024-11-14 00:13:50,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.30 | bwd_microstep: 3846.96 | bwd_inner_microstep: 3839.32 | bwd_allreduce_microstep: 7.60 | step_microstep: 22.05 [2024-11-14 00:13:50,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.27 | bwd: 3846.97 | bwd_inner: 3839.32 | bwd_allreduce: 7.62 | step: 22.05 5%|▌ | 2733/50750 [7:31:12<79:01:04, 5.92s/it] {'loss': 0.0076, 'learning_rate': 3.994039977103107e-05, 'epoch': 2.69} 5%|▌ | 2733/50750 [7:31:12<79:01:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 00:13:56,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:13:56,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.34 | bwd_microstep: 3843.83 | bwd_inner_microstep: 3836.29 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.18 [2024-11-14 00:13:56,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.32 | bwd: 3843.84 | bwd_inner: 3836.29 | bwd_allreduce: 7.52 | step: 21.18 5%|▌ | 2734/50750 [7:31:18<78:59:57, 5.92s/it] {'loss': 0.1159, 'learning_rate': 3.994030126660674e-05, 'epoch': 2.69} 5%|▌ | 2734/50750 [7:31:18<78:59:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:14:02,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 00:14:02,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1991.72 | bwd_microstep: 3790.40 | bwd_inner_microstep: 3782.57 | bwd_allreduce_microstep: 7.78 | step_microstep: 21.40 [2024-11-14 00:14:02,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1991.72 | bwd: 3790.41 | bwd_inner: 3782.57 | bwd_allreduce: 7.80 | step: 21.41 5%|▌ | 2735/50750 [7:31:24<78:37:35, 5.90s/it] {'loss': 0.505, 'learning_rate': 3.9940202680969576e-05, 'epoch': 2.69} 5%|▌ | 2735/50750 [7:31:24<78:37:35, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:14:08,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:14:08,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.92 | bwd_microstep: 3850.81 | bwd_inner_microstep: 3843.29 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.49 [2024-11-14 00:14:08,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.91 | bwd: 3850.82 | bwd_inner: 3843.29 | bwd_allreduce: 7.49 | step: 21.49 5%|▌ | 2736/50750 [7:31:30<78:45:29, 5.91s/it] {'loss': 0.4721, 'learning_rate': 3.9940104014119956e-05, 'epoch': 2.7} 5%|▌ | 2736/50750 [7:31:30<78:45:29, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:14:14,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 00:14:14,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.91 | bwd_microstep: 3845.48 | bwd_inner_microstep: 3837.96 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.36 [2024-11-14 00:14:14,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.91 | bwd: 3845.49 | bwd_inner: 3837.96 | bwd_allreduce: 7.49 | step: 21.37 5%|▌ | 2737/50750 [7:31:36<78:48:33, 5.91s/it] {'loss': 0.0038, 'learning_rate': 3.994000526605831e-05, 'epoch': 2.7} 5%|▌ | 2737/50750 [7:31:36<78:48:33, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:14:20,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 00:14:20,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.96 | bwd_microstep: 3843.76 | bwd_inner_microstep: 3836.07 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.83 [2024-11-14 00:14:20,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.96 | bwd: 3843.78 | bwd_inner: 3836.07 | bwd_allreduce: 7.67 | step: 21.82 5%|▌ | 2738/50750 [7:31:42<78:50:08, 5.91s/it] {'loss': 0.3917, 'learning_rate': 3.993990643678501e-05, 'epoch': 2.7} 5%|▌ | 2738/50750 [7:31:42<78:50:08, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:14:26,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 00:14:26,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.70 | bwd_microstep: 3849.91 | bwd_inner_microstep: 3842.26 | bwd_allreduce_microstep: 7.59 | step_microstep: 24.17 [2024-11-14 00:14:26,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.68 | bwd: 3849.92 | bwd_inner: 3842.26 | bwd_allreduce: 7.61 | step: 24.17 5%|▌ | 2739/50750 [7:31:48<78:54:05, 5.92s/it] {'loss': 0.0092, 'learning_rate': 3.993980752630049e-05, 'epoch': 2.7} 5%|▌ | 2739/50750 [7:31:48<78:54:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:14:32,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.39 | optimizer_step: 4.93 [2024-11-14 00:14:32,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.19 | bwd_microstep: 3851.23 | bwd_inner_microstep: 3843.30 | bwd_allreduce_microstep: 7.87 | step_microstep: 22.46 [2024-11-14 00:14:32,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.18 | bwd: 3851.24 | bwd_inner: 3843.30 | bwd_allreduce: 7.89 | step: 22.46 5%|▌ | 2740/50750 [7:31:54<78:56:56, 5.92s/it] {'loss': 0.0358, 'learning_rate': 3.9939708534605124e-05, 'epoch': 2.7} 5%|▌ | 2740/50750 [7:31:54<78:56:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:14:38,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-14 00:14:38,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.32 | bwd_microstep: 3851.46 | bwd_inner_microstep: 3843.89 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.97 [2024-11-14 00:14:38,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.32 | bwd: 3851.47 | bwd_inner: 3843.89 | bwd_allreduce: 7.53 | step: 21.98 5%|▌ | 2741/50750 [7:32:00<78:58:35, 5.92s/it] {'loss': 0.062, 'learning_rate': 3.9939609461699334e-05, 'epoch': 2.7} 5%|▌ | 2741/50750 [7:32:00<78:58:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:14:43,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 00:14:43,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.74 | bwd_microstep: 3843.00 | bwd_inner_microstep: 3835.38 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.66 [2024-11-14 00:14:43,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.74 | bwd: 3843.01 | bwd_inner: 3835.38 | bwd_allreduce: 7.59 | step: 21.67 5%|▌ | 2742/50750 [7:32:05<78:58:04, 5.92s/it] {'loss': 0.3039, 'learning_rate': 3.9939510307583515e-05, 'epoch': 2.7} 5%|▌ | 2742/50750 [7:32:05<78:58:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:14:49,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:14:49,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.11 | bwd_microstep: 3841.85 | bwd_inner_microstep: 3834.30 | bwd_allreduce_microstep: 7.51 | step_microstep: 20.88 [2024-11-14 00:14:49,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.09 | bwd: 3841.86 | bwd_inner: 3834.29 | bwd_allreduce: 7.53 | step: 20.89 5%|▌ | 2743/50750 [7:32:11<78:57:28, 5.92s/it] {'loss': 0.1573, 'learning_rate': 3.9939411072258066e-05, 'epoch': 2.7} 5%|▌ | 2743/50750 [7:32:11<78:57:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:14:55,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 00:14:55,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.77 | bwd_microstep: 3845.54 | bwd_inner_microstep: 3837.76 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.83 [2024-11-14 00:14:55,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.76 | bwd: 3845.56 | bwd_inner: 3837.76 | bwd_allreduce: 7.75 | step: 21.83 5%|▌ | 2744/50750 [7:32:17<78:59:34, 5.92s/it] {'loss': 0.0374, 'learning_rate': 3.993931175572341e-05, 'epoch': 2.7} 5%|▌ | 2744/50750 [7:32:17<78:59:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:15:01,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-14 00:15:01,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.74 | bwd_microstep: 3846.51 | bwd_inner_microstep: 3838.87 | bwd_allreduce_microstep: 7.59 | step_microstep: 22.09 [2024-11-14 00:15:01,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.73 | bwd: 3846.52 | bwd_inner: 3838.87 | bwd_allreduce: 7.60 | step: 22.09 5%|▌ | 2745/50750 [7:32:23<79:00:03, 5.92s/it] {'loss': 0.0065, 'learning_rate': 3.9939212357979933e-05, 'epoch': 2.7} 5%|▌ | 2745/50750 [7:32:23<79:00:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:15:07,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:15:07,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.48 | bwd_microstep: 3845.66 | bwd_inner_microstep: 3838.06 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.33 [2024-11-14 00:15:07,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.46 | bwd: 3845.68 | bwd_inner: 3838.06 | bwd_allreduce: 7.57 | step: 21.33 5%|▌ | 2746/50750 [7:32:29<78:58:17, 5.92s/it] {'loss': 0.0018, 'learning_rate': 3.993911287902805e-05, 'epoch': 2.71} 5%|▌ | 2746/50750 [7:32:29<78:58:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2193 [2024-11-14 00:15:13,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:15:13,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.44 | bwd_microstep: 3854.53 | bwd_inner_microstep: 3846.93 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.43 [2024-11-14 00:15:13,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.44 | bwd: 3854.55 | bwd_inner: 3846.93 | bwd_allreduce: 7.58 | step: 21.43 5%|▌ | 2747/50750 [7:32:35<78:58:57, 5.92s/it] {'loss': 0.0, 'learning_rate': 3.993901331886816e-05, 'epoch': 2.71} 5%|▌ | 2747/50750 [7:32:35<78:58:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:15:19,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 00:15:19,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.23 | bwd_microstep: 3847.84 | bwd_inner_microstep: 3840.24 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.73 [2024-11-14 00:15:19,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.23 | bwd: 3847.85 | bwd_inner: 3840.24 | bwd_allreduce: 7.57 | step: 21.73 5%|▌ | 2748/50750 [7:32:41<78:58:44, 5.92s/it] {'loss': 0.4816, 'learning_rate': 3.993891367750067e-05, 'epoch': 2.71} 5%|▌ | 2748/50750 [7:32:41<78:58:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:15:25,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:15:25,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.33 | bwd_microstep: 3848.36 | bwd_inner_microstep: 3840.77 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.54 [2024-11-14 00:15:25,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.32 | bwd: 3848.37 | bwd_inner: 3840.77 | bwd_allreduce: 7.56 | step: 21.54 5%|▌ | 2749/50750 [7:32:47<78:58:32, 5.92s/it] {'loss': 0.036, 'learning_rate': 3.993881395492599e-05, 'epoch': 2.71} 5%|▌ | 2749/50750 [7:32:47<78:58:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:15:31,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:15:31,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.51 | bwd_microstep: 3847.69 | bwd_inner_microstep: 3840.11 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.43 [2024-11-14 00:15:31,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.48 | bwd: 3847.71 | bwd_inner: 3840.11 | bwd_allreduce: 7.55 | step: 21.43 5%|▌ | 2750/50750 [7:32:53<78:57:46, 5.92s/it] {'loss': 0.058, 'learning_rate': 3.993871415114452e-05, 'epoch': 2.71} 5%|▌ | 2750/50750 [7:32:53<78:57:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:15:37,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:15:37,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.04 | bwd_microstep: 3847.34 | bwd_inner_microstep: 3839.77 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.29 [2024-11-14 00:15:37,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.05 | bwd: 3847.35 | bwd_inner: 3839.77 | bwd_allreduce: 7.54 | step: 21.31 5%|▌ | 2751/50750 [7:32:59<78:57:20, 5.92s/it] {'loss': 0.0057, 'learning_rate': 3.9938614266156674e-05, 'epoch': 2.71} 5%|▌ | 2751/50750 [7:32:59<78:57:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:15:43,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.34 | optimizer_step: 4.93 [2024-11-14 00:15:43,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.87 | bwd_microstep: 3850.57 | bwd_inner_microstep: 3842.88 | bwd_allreduce_microstep: 7.63 | step_microstep: 23.33 [2024-11-14 00:15:43,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.87 | bwd: 3850.58 | bwd_inner: 3842.88 | bwd_allreduce: 7.66 | step: 23.34 5%|▌ | 2752/50750 [7:33:05<78:59:13, 5.92s/it] {'loss': 0.0644, 'learning_rate': 3.9938514299962856e-05, 'epoch': 2.71} 5%|▌ | 2752/50750 [7:33:05<78:59:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:15:49,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-14 00:15:49,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.78 | bwd_microstep: 3855.91 | bwd_inner_microstep: 3847.63 | bwd_allreduce_microstep: 8.23 | step_microstep: 22.04 [2024-11-14 00:15:49,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.76 | bwd: 3855.93 | bwd_inner: 3847.64 | bwd_allreduce: 8.25 | step: 22.04 5%|▌ | 2753/50750 [7:33:11<79:03:44, 5.93s/it] {'loss': 0.1288, 'learning_rate': 3.993841425256347e-05, 'epoch': 2.71} 5%|▌ | 2753/50750 [7:33:11<79:03:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:15:55,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:15:55,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.30 | bwd_microstep: 3855.00 | bwd_inner_microstep: 3847.52 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.86 [2024-11-14 00:15:55,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.30 | bwd: 3855.01 | bwd_inner: 3847.52 | bwd_allreduce: 7.45 | step: 20.86 5%|▌ | 2754/50750 [7:33:17<79:03:55, 5.93s/it] {'loss': 0.0064, 'learning_rate': 3.993831412395892e-05, 'epoch': 2.71} 5%|▌ | 2754/50750 [7:33:17<79:03:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:16:01,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 00:16:01,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.98 | bwd_microstep: 3851.51 | bwd_inner_microstep: 3843.70 | bwd_allreduce_microstep: 7.75 | step_microstep: 25.13 [2024-11-14 00:16:01,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.98 | bwd: 3851.53 | bwd_inner: 3843.70 | bwd_allreduce: 7.78 | step: 25.14 5%|▌ | 2755/50750 [7:33:22<79:03:14, 5.93s/it] {'loss': 0.3034, 'learning_rate': 3.993821391414962e-05, 'epoch': 2.71} 5%|▌ | 2755/50750 [7:33:22<79:03:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:16:06,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 00:16:06,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.34 | bwd_microstep: 3849.81 | bwd_inner_microstep: 3842.33 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.13 [2024-11-14 00:16:06,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.34 | bwd: 3849.82 | bwd_inner: 3842.33 | bwd_allreduce: 7.45 | step: 21.13 5%|▌ | 2756/50750 [7:33:28<79:02:24, 5.93s/it] {'loss': 0.5344, 'learning_rate': 3.993811362313598e-05, 'epoch': 2.72} 5%|▌ | 2756/50750 [7:33:28<79:02:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:16:12,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:16:12,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.35 | bwd_microstep: 3849.93 | bwd_inner_microstep: 3842.39 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.57 [2024-11-14 00:16:12,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.35 | bwd: 3849.95 | bwd_inner: 3842.39 | bwd_allreduce: 7.52 | step: 21.57 5%|▌ | 2757/50750 [7:33:34<79:01:01, 5.93s/it] {'loss': 0.0052, 'learning_rate': 3.9938013250918405e-05, 'epoch': 2.72} 5%|▌ | 2757/50750 [7:33:34<79:01:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:16:18,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-14 00:16:18,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.90 | bwd_microstep: 3858.63 | bwd_inner_microstep: 3851.00 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.20 [2024-11-14 00:16:18,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.90 | bwd: 3858.65 | bwd_inner: 3851.00 | bwd_allreduce: 7.61 | step: 21.20 5%|▌ | 2758/50750 [7:33:40<79:02:22, 5.93s/it] {'loss': 0.2813, 'learning_rate': 3.9937912797497304e-05, 'epoch': 2.72} 5%|▌ | 2758/50750 [7:33:40<79:02:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:16:24,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:16:24,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.68 | bwd_microstep: 3849.77 | bwd_inner_microstep: 3842.07 | bwd_allreduce_microstep: 7.65 | step_microstep: 22.26 [2024-11-14 00:16:24,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.68 | bwd: 3849.79 | bwd_inner: 3842.07 | bwd_allreduce: 7.67 | step: 22.26 5%|▌ | 2759/50750 [7:33:46<79:01:36, 5.93s/it] {'loss': 0.0243, 'learning_rate': 3.9937812262873086e-05, 'epoch': 2.72} 5%|▌ | 2759/50750 [7:33:46<79:01:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:16:30,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-14 00:16:30,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.05 | bwd_microstep: 3860.66 | bwd_inner_microstep: 3852.89 | bwd_allreduce_microstep: 7.72 | step_microstep: 21.89 [2024-11-14 00:16:30,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.04 | bwd: 3860.67 | bwd_inner: 3852.89 | bwd_allreduce: 7.74 | step: 21.90 5%|▌ | 2760/50750 [7:33:52<79:03:34, 5.93s/it] {'loss': 0.1401, 'learning_rate': 3.993771164704616e-05, 'epoch': 2.72} 5%|▌ | 2760/50750 [7:33:52<79:03:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:16:36,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.92 [2024-11-14 00:16:36,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.89 | bwd_microstep: 3854.48 | bwd_inner_microstep: 3846.74 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.45 [2024-11-14 00:16:36,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.87 | bwd: 3854.49 | bwd_inner: 3846.74 | bwd_allreduce: 7.71 | step: 21.45 5%|▌ | 2761/50750 [7:33:58<79:02:58, 5.93s/it] {'loss': 0.0566, 'learning_rate': 3.9937610950016944e-05, 'epoch': 2.72} 5%|▌ | 2761/50750 [7:33:58<79:02:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:16:42,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.36 | optimizer_step: 4.92 [2024-11-14 00:16:42,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.61 | bwd_microstep: 3850.42 | bwd_inner_microstep: 3842.62 | bwd_allreduce_microstep: 7.75 | step_microstep: 26.37 [2024-11-14 00:16:42,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.59 | bwd: 3850.43 | bwd_inner: 3842.62 | bwd_allreduce: 7.77 | step: 26.37 5%|▌ | 2762/50750 [7:34:04<79:04:54, 5.93s/it] {'loss': 0.0061, 'learning_rate': 3.9937510171785826e-05, 'epoch': 2.72} 5%|▌ | 2762/50750 [7:34:04<79:04:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:16:48,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 00:16:48,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.88 | bwd_microstep: 3847.37 | bwd_inner_microstep: 3839.84 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.02 [2024-11-14 00:16:48,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.86 | bwd: 3847.39 | bwd_inner: 3839.84 | bwd_allreduce: 7.50 | step: 21.02 5%|▌ | 2763/50750 [7:34:10<79:03:07, 5.93s/it] {'loss': 0.0098, 'learning_rate': 3.9937409312353245e-05, 'epoch': 2.72} 5%|▌ | 2763/50750 [7:34:10<79:03:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:16:54,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.94 [2024-11-14 00:16:54,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.00 | bwd_microstep: 3847.48 | bwd_inner_microstep: 3839.96 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.19 [2024-11-14 00:16:54,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.00 | bwd: 3847.49 | bwd_inner: 3839.96 | bwd_allreduce: 7.49 | step: 21.19 5%|▌ | 2764/50750 [7:34:16<79:00:37, 5.93s/it] {'loss': 0.2785, 'learning_rate': 3.993730837171959e-05, 'epoch': 2.72} 5%|▌ | 2764/50750 [7:34:16<79:00:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:17:00,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:17:00,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.56 | bwd_microstep: 3846.45 | bwd_inner_microstep: 3838.74 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.62 [2024-11-14 00:17:00,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.56 | bwd: 3846.46 | bwd_inner: 3838.74 | bwd_allreduce: 7.68 | step: 21.62 5%|▌ | 2765/50750 [7:34:22<78:57:23, 5.92s/it] {'loss': 0.4158, 'learning_rate': 3.9937207349885275e-05, 'epoch': 2.72} 5%|▌ | 2765/50750 [7:34:22<78:57:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:17:06,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 00:17:06,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.11 | bwd_microstep: 3848.36 | bwd_inner_microstep: 3840.68 | bwd_allreduce_microstep: 7.63 | step_microstep: 24.59 [2024-11-14 00:17:06,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.10 | bwd: 3848.38 | bwd_inner: 3840.68 | bwd_allreduce: 7.65 | step: 24.60 5%|▌ | 2766/50750 [7:34:28<78:58:40, 5.93s/it] {'loss': 0.22, 'learning_rate': 3.993710624685073e-05, 'epoch': 2.73} 5%|▌ | 2766/50750 [7:34:28<78:58:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:17:12,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:17:12,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.83 | bwd_microstep: 3850.74 | bwd_inner_microstep: 3843.23 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.18 [2024-11-14 00:17:12,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.83 | bwd: 3850.75 | bwd_inner: 3843.23 | bwd_allreduce: 7.48 | step: 21.19 5%|▌ | 2767/50750 [7:34:34<78:58:38, 5.93s/it] {'loss': 0.201, 'learning_rate': 3.993700506261635e-05, 'epoch': 2.73} 5%|▌ | 2767/50750 [7:34:34<78:58:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:17:18,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 5.02 [2024-11-14 00:17:18,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.97 | bwd_microstep: 3858.45 | bwd_inner_microstep: 3850.91 | bwd_allreduce_microstep: 7.50 | step_microstep: 22.87 [2024-11-14 00:17:18,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.97 | bwd: 3858.46 | bwd_inner: 3850.91 | bwd_allreduce: 7.51 | step: 22.88 5%|▌ | 2768/50750 [7:34:40<79:03:21, 5.93s/it] {'loss': 0.0221, 'learning_rate': 3.9936903797182546e-05, 'epoch': 2.73} 5%|▌ | 2768/50750 [7:34:40<79:03:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:17:24,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:17:24,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.96 | bwd_microstep: 3855.33 | bwd_inner_microstep: 3847.80 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.10 [2024-11-14 00:17:24,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.96 | bwd: 3855.34 | bwd_inner: 3847.80 | bwd_allreduce: 7.50 | step: 21.10 5%|▌ | 2769/50750 [7:34:45<79:02:12, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.993680245054973e-05, 'epoch': 2.73} 5%|▌ | 2769/50750 [7:34:45<79:02:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:17:29,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:17:29,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.35 | bwd_microstep: 3848.70 | bwd_inner_microstep: 3841.17 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.21 [2024-11-14 00:17:29,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.35 | bwd: 3848.71 | bwd_inner: 3841.17 | bwd_allreduce: 7.50 | step: 21.21 5%|▌ | 2770/50750 [7:34:51<78:59:47, 5.93s/it] {'loss': 0.0731, 'learning_rate': 3.9936701022718334e-05, 'epoch': 2.73} 5%|▌ | 2770/50750 [7:34:51<78:59:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:17:35,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:17:35,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.64 | bwd_microstep: 3851.80 | bwd_inner_microstep: 3844.30 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.12 [2024-11-14 00:17:35,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.64 | bwd: 3851.81 | bwd_inner: 3844.30 | bwd_allreduce: 7.47 | step: 21.13 5%|▌ | 2771/50750 [7:34:57<78:58:49, 5.93s/it] {'loss': 0.2952, 'learning_rate': 3.9936599513688745e-05, 'epoch': 2.73} 5%|▌ | 2771/50750 [7:34:57<78:58:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:17:41,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:17:41,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.46 | bwd_microstep: 3855.73 | bwd_inner_microstep: 3848.18 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.38 [2024-11-14 00:17:41,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.46 | bwd: 3855.75 | bwd_inner: 3848.18 | bwd_allreduce: 7.53 | step: 21.38 5%|▌ | 2772/50750 [7:35:03<78:59:41, 5.93s/it] {'loss': 0.5146, 'learning_rate': 3.99364979234614e-05, 'epoch': 2.73} 5%|▌ | 2772/50750 [7:35:03<78:59:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:17:47,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:17:47,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.71 | bwd_microstep: 3850.65 | bwd_inner_microstep: 3843.07 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.87 [2024-11-14 00:17:47,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.71 | bwd: 3850.66 | bwd_inner: 3843.07 | bwd_allreduce: 7.54 | step: 21.88 5%|▌ | 2773/50750 [7:35:09<79:00:28, 5.93s/it] {'loss': 0.1211, 'learning_rate': 3.993639625203669e-05, 'epoch': 2.73} 5%|▌ | 2773/50750 [7:35:09<79:00:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:17:53,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-14 00:17:53,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.89 | bwd_microstep: 3853.38 | bwd_inner_microstep: 3845.82 | bwd_allreduce_microstep: 7.52 | step_microstep: 22.31 [2024-11-14 00:17:53,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.88 | bwd: 3853.40 | bwd_inner: 3845.82 | bwd_allreduce: 7.53 | step: 22.32 5%|▌ | 2774/50750 [7:35:15<79:02:26, 5.93s/it] {'loss': 0.1654, 'learning_rate': 3.993629449941504e-05, 'epoch': 2.73} 5%|▌ | 2774/50750 [7:35:15<79:02:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:17:59,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:17:59,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.69 | bwd_microstep: 3850.57 | bwd_inner_microstep: 3843.09 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.38 [2024-11-14 00:17:59,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.67 | bwd: 3850.59 | bwd_inner: 3843.09 | bwd_allreduce: 7.46 | step: 21.39 5%|▌ | 2775/50750 [7:35:21<79:02:26, 5.93s/it] {'loss': 0.0588, 'learning_rate': 3.993619266559687e-05, 'epoch': 2.73} 5%|▌ | 2775/50750 [7:35:21<79:02:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:18:05,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:18:05,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.51 | bwd_microstep: 3859.53 | bwd_inner_microstep: 3852.00 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.27 [2024-11-14 00:18:05,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.50 | bwd: 3859.54 | bwd_inner: 3852.00 | bwd_allreduce: 7.50 | step: 21.28 5%|▌ | 2776/50750 [7:35:27<79:03:39, 5.93s/it] {'loss': 0.1094, 'learning_rate': 3.993609075058259e-05, 'epoch': 2.73} 5%|▌ | 2776/50750 [7:35:27<79:03:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:18:11,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:18:11,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.49 | bwd_microstep: 3856.26 | bwd_inner_microstep: 3848.78 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.98 [2024-11-14 00:18:11,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.47 | bwd: 3856.27 | bwd_inner: 3848.78 | bwd_allreduce: 7.45 | step: 20.98 5%|▌ | 2777/50750 [7:35:33<79:03:40, 5.93s/it] {'loss': 0.0255, 'learning_rate': 3.9935988754372604e-05, 'epoch': 2.74} 5%|▌ | 2777/50750 [7:35:33<79:03:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:18:17,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:18:17,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.50 | bwd_microstep: 3855.59 | bwd_inner_microstep: 3847.86 | bwd_allreduce_microstep: 7.68 | step_microstep: 22.56 [2024-11-14 00:18:17,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.50 | bwd: 3855.61 | bwd_inner: 3847.86 | bwd_allreduce: 7.70 | step: 22.57 5%|▌ | 2778/50750 [7:35:39<79:02:50, 5.93s/it] {'loss': 0.0152, 'learning_rate': 3.9935886676967346e-05, 'epoch': 2.74} 5%|▌ | 2778/50750 [7:35:39<79:02:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:18:23,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:18:23,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.81 | bwd_microstep: 3855.90 | bwd_inner_microstep: 3848.09 | bwd_allreduce_microstep: 7.76 | step_microstep: 21.94 [2024-11-14 00:18:23,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.81 | bwd: 3855.93 | bwd_inner: 3848.09 | bwd_allreduce: 7.78 | step: 21.94 5%|▌ | 2779/50750 [7:35:45<79:04:43, 5.93s/it] {'loss': 0.044, 'learning_rate': 3.9935784518367224e-05, 'epoch': 2.74} 5%|▌ | 2779/50750 [7:35:45<79:04:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:18:29,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.37 | optimizer_step: 4.92 [2024-11-14 00:18:29,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.21 | bwd_microstep: 3859.70 | bwd_inner_microstep: 3852.20 | bwd_allreduce_microstep: 7.46 | step_microstep: 22.44 [2024-11-14 00:18:29,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.21 | bwd: 3859.71 | bwd_inner: 3852.20 | bwd_allreduce: 7.48 | step: 22.44 5%|▌ | 2780/50750 [7:35:51<79:06:05, 5.94s/it] {'loss': 0.2333, 'learning_rate': 3.9935682278572645e-05, 'epoch': 2.74} 5%|▌ | 2780/50750 [7:35:51<79:06:05, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:18:35,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 00:18:35,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.94 | bwd_microstep: 3861.12 | bwd_inner_microstep: 3853.60 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.95 [2024-11-14 00:18:35,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.93 | bwd: 3861.13 | bwd_inner: 3853.60 | bwd_allreduce: 7.49 | step: 20.95 5%|▌ | 2781/50750 [7:35:57<79:06:27, 5.94s/it] {'loss': 0.023, 'learning_rate': 3.993557995758404e-05, 'epoch': 2.74} 5%|▌ | 2781/50750 [7:35:57<79:06:27, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:18:41,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:18:41,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.43 | bwd_microstep: 3858.72 | bwd_inner_microstep: 3851.24 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.97 [2024-11-14 00:18:41,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.43 | bwd: 3858.73 | bwd_inner: 3851.23 | bwd_allreduce: 7.46 | step: 20.98 5%|▌ | 2782/50750 [7:36:03<79:05:37, 5.94s/it] {'loss': 0.0044, 'learning_rate': 3.993547755540182e-05, 'epoch': 2.74} 5%|▌ | 2782/50750 [7:36:03<79:05:37, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:18:47,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:18:47,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.20 | bwd_microstep: 3847.40 | bwd_inner_microstep: 3839.84 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.63 [2024-11-14 00:18:47,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.21 | bwd: 3847.42 | bwd_inner: 3839.84 | bwd_allreduce: 7.54 | step: 21.63 5%|▌ | 2783/50750 [7:36:09<79:03:02, 5.93s/it] {'loss': 0.2809, 'learning_rate': 3.9935375072026405e-05, 'epoch': 2.74} 5%|▌ | 2783/50750 [7:36:09<79:03:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:18:53,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:18:53,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.55 | bwd_microstep: 3852.17 | bwd_inner_microstep: 3844.45 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.21 [2024-11-14 00:18:53,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.55 | bwd: 3852.18 | bwd_inner: 3844.45 | bwd_allreduce: 7.69 | step: 21.21 5%|▌ | 2784/50750 [7:36:14<79:01:40, 5.93s/it] {'loss': 0.718, 'learning_rate': 3.9935272507458205e-05, 'epoch': 2.74} 5%|▌ | 2784/50750 [7:36:14<79:01:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:18:58,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:18:58,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.20 | bwd_microstep: 3846.49 | bwd_inner_microstep: 3839.01 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.94 [2024-11-14 00:18:58,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.20 | bwd: 3846.50 | bwd_inner: 3839.01 | bwd_allreduce: 7.45 | step: 20.95 5%|▌ | 2785/50750 [7:36:20<78:58:30, 5.93s/it] {'loss': 0.1866, 'learning_rate': 3.993516986169765e-05, 'epoch': 2.74} 5%|▌ | 2785/50750 [7:36:20<78:58:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:19:04,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 00:19:04,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.88 | bwd_microstep: 3849.67 | bwd_inner_microstep: 3842.18 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.85 [2024-11-14 00:19:04,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.87 | bwd: 3849.68 | bwd_inner: 3842.18 | bwd_allreduce: 7.46 | step: 20.85 5%|▌ | 2786/50750 [7:36:26<78:57:01, 5.93s/it] {'loss': 0.0079, 'learning_rate': 3.9935067134745133e-05, 'epoch': 2.74} 5%|▌ | 2786/50750 [7:36:26<78:57:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:19:10,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:19:10,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.98 | bwd_microstep: 3852.81 | bwd_inner_microstep: 3845.33 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.80 [2024-11-14 00:19:10,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.98 | bwd: 3852.82 | bwd_inner: 3845.33 | bwd_allreduce: 7.45 | step: 20.80 5%|▌ | 2787/50750 [7:36:32<78:56:53, 5.93s/it] {'loss': 0.0062, 'learning_rate': 3.99349643266011e-05, 'epoch': 2.75} 5%|▌ | 2787/50750 [7:36:32<78:56:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:19:16,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:19:16,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.58 | bwd_microstep: 3854.17 | bwd_inner_microstep: 3846.70 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.84 [2024-11-14 00:19:16,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.58 | bwd: 3854.18 | bwd_inner: 3846.70 | bwd_allreduce: 7.44 | step: 20.84 5%|▌ | 2788/50750 [7:36:38<78:56:45, 5.93s/it] {'loss': 0.473, 'learning_rate': 3.9934861437265956e-05, 'epoch': 2.75} 5%|▌ | 2788/50750 [7:36:38<78:56:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:19:22,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:19:22,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.41 | bwd_microstep: 3845.53 | bwd_inner_microstep: 3838.03 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.95 [2024-11-14 00:19:22,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.41 | bwd: 3845.54 | bwd_inner: 3838.03 | bwd_allreduce: 7.47 | step: 20.95 5%|▌ | 2789/50750 [7:36:44<78:54:52, 5.92s/it] {'loss': 0.0043, 'learning_rate': 3.9934758466740125e-05, 'epoch': 2.75} 5%|▌ | 2789/50750 [7:36:44<78:54:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:19:28,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:19:28,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.48 | bwd_microstep: 3846.78 | bwd_inner_microstep: 3839.30 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.90 [2024-11-14 00:19:28,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.48 | bwd: 3846.79 | bwd_inner: 3839.30 | bwd_allreduce: 7.45 | step: 20.90 5%|▌ | 2790/50750 [7:36:50<78:54:41, 5.92s/it] {'loss': 0.0112, 'learning_rate': 3.993465541502402e-05, 'epoch': 2.75} 5%|▌ | 2790/50750 [7:36:50<78:54:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:19:34,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:19:34,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.73 | bwd_microstep: 3867.37 | bwd_inner_microstep: 3859.90 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.82 [2024-11-14 00:19:34,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.71 | bwd: 3867.38 | bwd_inner: 3859.90 | bwd_allreduce: 7.44 | step: 20.83 5%|▌ | 2791/50750 [7:36:56<78:58:09, 5.93s/it] {'loss': 0.0254, 'learning_rate': 3.993455228211807e-05, 'epoch': 2.75} 5%|▌ | 2791/50750 [7:36:56<78:58:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:19:40,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:19:40,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.02 | bwd_microstep: 3852.61 | bwd_inner_microstep: 3845.15 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.08 [2024-11-14 00:19:40,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.02 | bwd: 3852.62 | bwd_inner: 3845.15 | bwd_allreduce: 7.44 | step: 21.08 6%|▌ | 2792/50750 [7:37:02<78:56:37, 5.93s/it] {'loss': 0.0018, 'learning_rate': 3.993444906802269e-05, 'epoch': 2.75} 6%|▌ | 2792/50750 [7:37:02<78:56:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:19:46,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:19:46,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.83 | bwd_microstep: 3845.93 | bwd_inner_microstep: 3838.45 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.02 [2024-11-14 00:19:46,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.83 | bwd: 3845.94 | bwd_inner: 3838.45 | bwd_allreduce: 7.45 | step: 21.02 6%|▌ | 2793/50750 [7:37:08<78:55:04, 5.92s/it] {'loss': 0.0119, 'learning_rate': 3.9934345772738296e-05, 'epoch': 2.75} 6%|▌ | 2793/50750 [7:37:08<78:55:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:19:52,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-14 00:19:52,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.07 | bwd_microstep: 3851.74 | bwd_inner_microstep: 3844.27 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.97 [2024-11-14 00:19:52,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.07 | bwd: 3851.75 | bwd_inner: 3844.27 | bwd_allreduce: 7.44 | step: 20.98 6%|▌ | 2794/50750 [7:37:14<78:54:38, 5.92s/it] {'loss': 0.3595, 'learning_rate': 3.993424239626532e-05, 'epoch': 2.75} 6%|▌ | 2794/50750 [7:37:14<78:54:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:19:58,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:19:58,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.68 | bwd_microstep: 3846.10 | bwd_inner_microstep: 3838.63 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.91 [2024-11-14 00:19:58,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.68 | bwd: 3846.11 | bwd_inner: 3838.63 | bwd_allreduce: 7.44 | step: 20.91 6%|▌ | 2795/50750 [7:37:20<78:52:46, 5.92s/it] {'loss': 0.0764, 'learning_rate': 3.993413893860417e-05, 'epoch': 2.75} 6%|▌ | 2795/50750 [7:37:20<78:52:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 00:20:04,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:20:04,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.24 | bwd_microstep: 3849.83 | bwd_inner_microstep: 3842.24 | bwd_allreduce_microstep: 7.54 | step_microstep: 22.72 [2024-11-14 00:20:04,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.24 | bwd: 3849.84 | bwd_inner: 3842.24 | bwd_allreduce: 7.56 | step: 22.74 6%|▌ | 2796/50750 [7:37:26<78:52:50, 5.92s/it] {'loss': 0.0, 'learning_rate': 3.993403539975528e-05, 'epoch': 2.75} 6%|▌ | 2796/50750 [7:37:26<78:52:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:20:09,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-14 00:20:09,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.08 | bwd_microstep: 3846.44 | bwd_inner_microstep: 3838.97 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.83 [2024-11-14 00:20:09,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.08 | bwd: 3846.45 | bwd_inner: 3838.97 | bwd_allreduce: 7.44 | step: 20.83 6%|▌ | 2797/50750 [7:37:31<78:51:45, 5.92s/it] {'loss': 0.2434, 'learning_rate': 3.9933931779719055e-05, 'epoch': 2.76} 6%|▌ | 2797/50750 [7:37:31<78:51:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:20:15,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:20:15,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.98 | bwd_microstep: 3846.10 | bwd_inner_microstep: 3838.63 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.88 [2024-11-14 00:20:15,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.98 | bwd: 3846.11 | bwd_inner: 3838.63 | bwd_allreduce: 7.44 | step: 20.89 6%|▌ | 2798/50750 [7:37:37<78:51:45, 5.92s/it] {'loss': 0.0069, 'learning_rate': 3.9933828078495936e-05, 'epoch': 2.76} 6%|▌ | 2798/50750 [7:37:37<78:51:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:20:21,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:20:21,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.70 | bwd_microstep: 3843.74 | bwd_inner_microstep: 3836.27 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.76 [2024-11-14 00:20:21,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.70 | bwd: 3843.75 | bwd_inner: 3836.27 | bwd_allreduce: 7.45 | step: 20.76 6%|▌ | 2799/50750 [7:37:43<78:50:29, 5.92s/it] {'loss': 0.0014, 'learning_rate': 3.9933724296086336e-05, 'epoch': 2.76} 6%|▌ | 2799/50750 [7:37:43<78:50:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:20:27,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:20:27,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.40 | bwd_microstep: 3845.25 | bwd_inner_microstep: 3837.79 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.88 [2024-11-14 00:20:27,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.40 | bwd: 3845.27 | bwd_inner: 3837.79 | bwd_allreduce: 7.44 | step: 20.88 6%|▌ | 2800/50750 [7:37:49<78:49:13, 5.92s/it] {'loss': 0.0106, 'learning_rate': 3.9933620432490674e-05, 'epoch': 2.76} 6%|▌ | 2800/50750 [7:37:49<78:49:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:20:33,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 00:20:33,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.45 | bwd_microstep: 3853.82 | bwd_inner_microstep: 3846.18 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.45 [2024-11-14 00:20:33,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.45 | bwd: 3853.83 | bwd_inner: 3846.18 | bwd_allreduce: 7.62 | step: 21.46 6%|▌ | 2801/50750 [7:37:55<78:51:01, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.9933516487709384e-05, 'epoch': 2.76} 6%|▌ | 2801/50750 [7:37:55<78:51:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:20:39,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 00:20:39,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.99 | bwd_microstep: 3850.67 | bwd_inner_microstep: 3842.97 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.70 [2024-11-14 00:20:39,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.97 | bwd: 3850.68 | bwd_inner: 3842.97 | bwd_allreduce: 7.67 | step: 21.70 6%|▌ | 2802/50750 [7:38:01<78:55:07, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.993341246174287e-05, 'epoch': 2.76} 6%|▌ | 2802/50750 [7:38:01<78:55:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:20:45,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:20:45,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.58 | bwd_microstep: 3845.70 | bwd_inner_microstep: 3838.24 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.96 [2024-11-14 00:20:45,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.57 | bwd: 3845.71 | bwd_inner: 3838.24 | bwd_allreduce: 7.43 | step: 20.96 6%|▌ | 2803/50750 [7:38:07<78:53:12, 5.92s/it] {'loss': 0.3852, 'learning_rate': 3.993330835459158e-05, 'epoch': 2.76} 6%|▌ | 2803/50750 [7:38:07<78:53:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:20:51,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:20:51,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.13 | bwd_microstep: 3846.33 | bwd_inner_microstep: 3838.82 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.08 [2024-11-14 00:20:51,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.13 | bwd: 3846.35 | bwd_inner: 3838.82 | bwd_allreduce: 7.48 | step: 21.08 6%|▌ | 2804/50750 [7:38:13<78:51:57, 5.92s/it] {'loss': 0.0019, 'learning_rate': 3.993320416625592e-05, 'epoch': 2.76} 6%|▌ | 2804/50750 [7:38:13<78:51:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:20:57,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.94 [2024-11-14 00:20:57,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.27 | bwd_microstep: 3846.40 | bwd_inner_microstep: 3838.93 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.87 [2024-11-14 00:20:57,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.27 | bwd: 3846.41 | bwd_inner: 3838.93 | bwd_allreduce: 7.45 | step: 20.87 6%|▌ | 2805/50750 [7:38:19<78:50:57, 5.92s/it] {'loss': 0.0948, 'learning_rate': 3.993309989673633e-05, 'epoch': 2.76} 6%|▌ | 2805/50750 [7:38:19<78:50:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:21:03,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-14 00:21:03,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.19 | bwd_microstep: 3853.48 | bwd_inner_microstep: 3846.02 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.29 [2024-11-14 00:21:03,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.19 | bwd: 3853.50 | bwd_inner: 3846.02 | bwd_allreduce: 7.44 | step: 21.30 6%|▌ | 2806/50750 [7:38:25<78:51:38, 5.92s/it] {'loss': 0.1786, 'learning_rate': 3.9932995546033215e-05, 'epoch': 2.76} 6%|▌ | 2806/50750 [7:38:25<78:51:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:21:09,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:21:09,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.65 | bwd_microstep: 3846.02 | bwd_inner_microstep: 3838.56 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.91 [2024-11-14 00:21:09,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.65 | bwd: 3846.22 | bwd_inner: 3838.56 | bwd_allreduce: 7.43 | step: 20.91 6%|▌ | 2807/50750 [7:38:31<78:49:49, 5.92s/it] {'loss': 0.2536, 'learning_rate': 3.9932891114147014e-05, 'epoch': 2.77} 6%|▌ | 2807/50750 [7:38:31<78:49:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:21:15,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:21:15,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.18 | bwd_microstep: 3856.41 | bwd_inner_microstep: 3848.92 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.28 [2024-11-14 00:21:15,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.18 | bwd: 3856.42 | bwd_inner: 3848.92 | bwd_allreduce: 7.46 | step: 21.29 6%|▌ | 2808/50750 [7:38:37<78:51:27, 5.92s/it] {'loss': 0.0014, 'learning_rate': 3.9932786601078154e-05, 'epoch': 2.77} 6%|▌ | 2808/50750 [7:38:37<78:51:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:21:21,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 00:21:21,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.10 | bwd_microstep: 3850.22 | bwd_inner_microstep: 3842.75 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.89 [2024-11-14 00:21:21,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.10 | bwd: 3850.23 | bwd_inner: 3842.75 | bwd_allreduce: 7.44 | step: 20.89 6%|▌ | 2809/50750 [7:38:43<78:52:00, 5.92s/it] {'loss': 0.0015, 'learning_rate': 3.993268200682705e-05, 'epoch': 2.77} 6%|▌ | 2809/50750 [7:38:43<78:52:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:21:26,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:21:26,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.38 | bwd_microstep: 3843.30 | bwd_inner_microstep: 3835.82 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.80 [2024-11-14 00:21:26,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.38 | bwd: 3843.31 | bwd_inner: 3835.82 | bwd_allreduce: 7.45 | step: 21.80 6%|▌ | 2810/50750 [7:38:48<78:50:05, 5.92s/it] {'loss': 0.0378, 'learning_rate': 3.9932577331394134e-05, 'epoch': 2.77} 6%|▌ | 2810/50750 [7:38:48<78:50:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:21:32,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:21:32,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.84 | bwd_microstep: 3848.87 | bwd_inner_microstep: 3841.40 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.85 [2024-11-14 00:21:32,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.83 | bwd: 3848.89 | bwd_inner: 3841.40 | bwd_allreduce: 7.45 | step: 20.86 6%|▌ | 2811/50750 [7:38:54<78:50:23, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.993247257477983e-05, 'epoch': 2.77} 6%|▌ | 2811/50750 [7:38:54<78:50:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:21:38,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:21:38,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.75 | bwd_microstep: 3861.95 | bwd_inner_microstep: 3854.43 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.11 [2024-11-14 00:21:38,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.75 | bwd: 3861.96 | bwd_inner: 3854.43 | bwd_allreduce: 7.49 | step: 21.12 6%|▌ | 2812/50750 [7:39:00<78:53:44, 5.92s/it] {'loss': 0.0038, 'learning_rate': 3.993236773698457e-05, 'epoch': 2.77} 6%|▌ | 2812/50750 [7:39:00<78:53:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:21:44,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-14 00:21:44,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.21 | bwd_microstep: 3845.21 | bwd_inner_microstep: 3837.73 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.86 [2024-11-14 00:21:44,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.20 | bwd: 3845.23 | bwd_inner: 3837.73 | bwd_allreduce: 7.45 | step: 20.86 6%|▌ | 2813/50750 [7:39:06<78:53:18, 5.92s/it] {'loss': 0.0167, 'learning_rate': 3.9932262818008774e-05, 'epoch': 2.77} 6%|▌ | 2813/50750 [7:39:06<78:53:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:21:50,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:21:50,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.18 | bwd_microstep: 3853.75 | bwd_inner_microstep: 3846.28 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.84 [2024-11-14 00:21:50,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.17 | bwd: 3853.76 | bwd_inner: 3846.28 | bwd_allreduce: 7.44 | step: 20.85 6%|▌ | 2814/50750 [7:39:12<78:54:04, 5.93s/it] {'loss': 0.0259, 'learning_rate': 3.993215781785287e-05, 'epoch': 2.77} 6%|▌ | 2814/50750 [7:39:12<78:54:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:21:56,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:21:56,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.75 | bwd_microstep: 3845.93 | bwd_inner_microstep: 3838.44 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.90 [2024-11-14 00:21:56,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.75 | bwd: 3845.94 | bwd_inner: 3838.44 | bwd_allreduce: 7.46 | step: 20.90 6%|▌ | 2815/50750 [7:39:18<78:53:00, 5.92s/it] {'loss': 0.0128, 'learning_rate': 3.99320527365173e-05, 'epoch': 2.77} 6%|▌ | 2815/50750 [7:39:18<78:53:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:22:02,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:22:02,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.00 | bwd_microstep: 3847.97 | bwd_inner_microstep: 3840.51 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.93 [2024-11-14 00:22:02,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.00 | bwd: 3847.98 | bwd_inner: 3840.51 | bwd_allreduce: 7.43 | step: 20.93 6%|▌ | 2816/50750 [7:39:24<78:51:36, 5.92s/it] {'loss': 0.6281, 'learning_rate': 3.9931947574002466e-05, 'epoch': 2.77} 6%|▌ | 2816/50750 [7:39:24<78:51:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 00:22:08,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:22:08,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.07 | bwd_microstep: 3853.50 | bwd_inner_microstep: 3846.02 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.85 [2024-11-14 00:22:08,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.05 | bwd: 3853.52 | bwd_inner: 3846.02 | bwd_allreduce: 7.45 | step: 20.86 6%|▌ | 2817/50750 [7:39:30<78:51:54, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.993184233030882e-05, 'epoch': 2.78} 6%|▌ | 2817/50750 [7:39:30<78:51:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:22:14,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:22:14,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.37 | bwd_microstep: 3850.93 | bwd_inner_microstep: 3843.41 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.15 [2024-11-14 00:22:14,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.37 | bwd: 3850.94 | bwd_inner: 3843.41 | bwd_allreduce: 7.49 | step: 21.16 6%|▌ | 2818/50750 [7:39:36<78:52:39, 5.92s/it] {'loss': 0.8919, 'learning_rate': 3.993173700543677e-05, 'epoch': 2.78} 6%|▌ | 2818/50750 [7:39:36<78:52:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:22:20,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 5.00 [2024-11-14 00:22:20,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.22 | bwd_microstep: 3851.59 | bwd_inner_microstep: 3844.07 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.14 [2024-11-14 00:22:20,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.22 | bwd: 3851.60 | bwd_inner: 3844.07 | bwd_allreduce: 7.49 | step: 21.14 6%|▌ | 2819/50750 [7:39:42<78:54:13, 5.93s/it] {'loss': 0.1196, 'learning_rate': 3.993163159938677e-05, 'epoch': 2.78} 6%|▌ | 2819/50750 [7:39:42<78:54:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:22:26,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-14 00:22:26,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.19 | bwd_microstep: 3852.37 | bwd_inner_microstep: 3844.59 | bwd_allreduce_microstep: 7.72 | step_microstep: 21.58 [2024-11-14 00:22:26,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.19 | bwd: 3852.39 | bwd_inner: 3844.59 | bwd_allreduce: 7.75 | step: 21.58 6%|▌ | 2820/50750 [7:39:48<78:54:17, 5.93s/it] {'loss': 0.3702, 'learning_rate': 3.993152611215923e-05, 'epoch': 2.78} 6%|▌ | 2820/50750 [7:39:48<78:54:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:22:32,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 5.00 [2024-11-14 00:22:32,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.47 | bwd_microstep: 3852.73 | bwd_inner_microstep: 3844.79 | bwd_allreduce_microstep: 7.89 | step_microstep: 23.19 [2024-11-14 00:22:32,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.46 | bwd: 3852.75 | bwd_inner: 3844.79 | bwd_allreduce: 7.91 | step: 23.19 6%|▌ | 2821/50750 [7:39:54<78:55:13, 5.93s/it] {'loss': 0.0692, 'learning_rate': 3.9931420543754586e-05, 'epoch': 2.78} 6%|▌ | 2821/50750 [7:39:54<78:55:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:22:38,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:22:38,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.95 | bwd_microstep: 3856.67 | bwd_inner_microstep: 3849.16 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.21 [2024-11-14 00:22:38,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.93 | bwd: 3856.69 | bwd_inner: 3849.16 | bwd_allreduce: 7.49 | step: 21.22 6%|▌ | 2822/50750 [7:40:00<78:57:56, 5.93s/it] {'loss': 0.0177, 'learning_rate': 3.9931314894173265e-05, 'epoch': 2.78} 6%|▌ | 2822/50750 [7:40:00<78:57:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:22:44,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.94 [2024-11-14 00:22:44,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.44 | bwd_microstep: 3850.75 | bwd_inner_microstep: 3843.23 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.52 [2024-11-14 00:22:44,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.42 | bwd: 3850.76 | bwd_inner: 3843.23 | bwd_allreduce: 7.49 | step: 21.52 6%|▌ | 2823/50750 [7:40:05<78:58:09, 5.93s/it] {'loss': 0.5688, 'learning_rate': 3.99312091634157e-05, 'epoch': 2.78} 6%|▌ | 2823/50750 [7:40:05<78:58:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:22:49,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:22:49,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.18 | bwd_microstep: 3852.49 | bwd_inner_microstep: 3845.01 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.29 [2024-11-14 00:22:49,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.18 | bwd: 3852.51 | bwd_inner: 3845.01 | bwd_allreduce: 7.46 | step: 21.30 6%|▌ | 2824/50750 [7:40:11<78:56:57, 5.93s/it] {'loss': 0.0167, 'learning_rate': 3.993110335148232e-05, 'epoch': 2.78} 6%|▌ | 2824/50750 [7:40:11<78:56:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:22:55,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:22:55,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.44 | bwd_microstep: 3849.96 | bwd_inner_microstep: 3842.23 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.77 [2024-11-14 00:22:55,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.44 | bwd: 3849.97 | bwd_inner: 3842.23 | bwd_allreduce: 7.70 | step: 21.78 6%|▌ | 2825/50750 [7:40:17<78:56:30, 5.93s/it] {'loss': 0.0138, 'learning_rate': 3.993099745837356e-05, 'epoch': 2.78} 6%|▌ | 2825/50750 [7:40:17<78:56:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:23:01,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:23:01,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.89 | bwd_microstep: 3848.73 | bwd_inner_microstep: 3841.26 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.96 [2024-11-14 00:23:01,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.89 | bwd: 3848.74 | bwd_inner: 3841.26 | bwd_allreduce: 7.44 | step: 20.96 6%|▌ | 2826/50750 [7:40:23<78:54:20, 5.93s/it] {'loss': 0.034, 'learning_rate': 3.993089148408984e-05, 'epoch': 2.78} 6%|▌ | 2826/50750 [7:40:23<78:54:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:23:07,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:23:07,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.30 | bwd_microstep: 3845.60 | bwd_inner_microstep: 3838.15 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.85 [2024-11-14 00:23:07,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.30 | bwd: 3845.61 | bwd_inner: 3838.15 | bwd_allreduce: 7.42 | step: 20.85 6%|▌ | 2827/50750 [7:40:29<78:52:06, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.993078542863161e-05, 'epoch': 2.79} 6%|▌ | 2827/50750 [7:40:29<78:52:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:23:13,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:23:13,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.20 | bwd_microstep: 3855.38 | bwd_inner_microstep: 3847.89 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.95 [2024-11-14 00:23:13,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.20 | bwd: 3855.39 | bwd_inner: 3847.89 | bwd_allreduce: 7.46 | step: 20.95 6%|▌ | 2828/50750 [7:40:35<78:53:02, 5.93s/it] {'loss': 0.4188, 'learning_rate': 3.993067929199929e-05, 'epoch': 2.79} 6%|▌ | 2828/50750 [7:40:35<78:53:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:23:19,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.92 [2024-11-14 00:23:19,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1994.75 | bwd_microstep: 3790.57 | bwd_inner_microstep: 3782.96 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.86 [2024-11-14 00:23:19,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1994.75 | bwd: 3790.58 | bwd_inner: 3782.96 | bwd_allreduce: 7.59 | step: 21.87 6%|▌ | 2829/50750 [7:40:41<78:30:50, 5.90s/it] {'loss': 0.5941, 'learning_rate': 3.993057307419331e-05, 'epoch': 2.79} 6%|▌ | 2829/50750 [7:40:41<78:30:50, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:23:25,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:23:25,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.16 | bwd_microstep: 3851.11 | bwd_inner_microstep: 3843.51 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.46 [2024-11-14 00:23:25,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.15 | bwd: 3851.12 | bwd_inner: 3843.51 | bwd_allreduce: 7.57 | step: 21.47 6%|▌ | 2830/50750 [7:40:47<78:38:13, 5.91s/it] {'loss': 0.5502, 'learning_rate': 3.993046677521411e-05, 'epoch': 2.79} 6%|▌ | 2830/50750 [7:40:47<78:38:13, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:23:31,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:23:31,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.49 | bwd_microstep: 3854.25 | bwd_inner_microstep: 3845.70 | bwd_allreduce_microstep: 8.49 | step_microstep: 21.45 [2024-11-14 00:23:31,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.48 | bwd: 3854.26 | bwd_inner: 3845.70 | bwd_allreduce: 8.51 | step: 21.45 6%|▌ | 2831/50750 [7:40:53<78:43:24, 5.91s/it] {'loss': 0.1671, 'learning_rate': 3.993036039506211e-05, 'epoch': 2.79} 6%|▌ | 2831/50750 [7:40:53<78:43:24, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:23:37,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.92 [2024-11-14 00:23:37,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.02 | bwd_microstep: 3849.47 | bwd_inner_microstep: 3841.43 | bwd_allreduce_microstep: 7.98 | step_microstep: 23.19 [2024-11-14 00:23:37,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.02 | bwd: 3849.49 | bwd_inner: 3841.43 | bwd_allreduce: 8.01 | step: 23.19 6%|▌ | 2832/50750 [7:40:59<78:46:40, 5.92s/it] {'loss': 2.0869, 'learning_rate': 3.9930253933737765e-05, 'epoch': 2.79} 6%|▌ | 2832/50750 [7:40:59<78:46:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:23:43,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 00:23:43,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.84 | bwd_microstep: 3850.74 | bwd_inner_microstep: 3843.17 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.71 [2024-11-14 00:23:43,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.84 | bwd: 3850.76 | bwd_inner: 3843.17 | bwd_allreduce: 7.55 | step: 21.72 6%|▌ | 2833/50750 [7:41:05<78:49:25, 5.92s/it] {'loss': 0.012, 'learning_rate': 3.99301473912415e-05, 'epoch': 2.79} 6%|▌ | 2833/50750 [7:41:05<78:49:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:23:49,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:23:49,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.95 | bwd_microstep: 3847.29 | bwd_inner_microstep: 3839.83 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.08 [2024-11-14 00:23:49,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.93 | bwd: 3847.31 | bwd_inner: 3839.83 | bwd_allreduce: 7.44 | step: 21.08 6%|▌ | 2834/50750 [7:41:11<78:50:50, 5.92s/it] {'loss': 0.0188, 'learning_rate': 3.993004076757373e-05, 'epoch': 2.79} 6%|▌ | 2834/50750 [7:41:11<78:50:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:23:55,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 5.08 [2024-11-14 00:23:55,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.39 | bwd_microstep: 3859.43 | bwd_inner_microstep: 3851.82 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.30 [2024-11-14 00:23:55,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.39 | bwd: 3859.44 | bwd_inner: 3851.82 | bwd_allreduce: 7.58 | step: 21.31 6%|▌ | 2835/50750 [7:41:17<78:52:29, 5.93s/it] {'loss': 0.0927, 'learning_rate': 3.992993406273491e-05, 'epoch': 2.79} 6%|▌ | 2835/50750 [7:41:17<78:52:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:24:01,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 00:24:01,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.46 | bwd_microstep: 3861.99 | bwd_inner_microstep: 3854.24 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.35 [2024-11-14 00:24:01,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.46 | bwd: 3862.00 | bwd_inner: 3854.24 | bwd_allreduce: 7.72 | step: 21.37 6%|▌ | 2836/50750 [7:41:22<78:56:42, 5.93s/it] {'loss': 0.6797, 'learning_rate': 3.992982727672547e-05, 'epoch': 2.79} 6%|▌ | 2836/50750 [7:41:22<78:56:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:24:06,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:24:06,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.35 | bwd_microstep: 3854.92 | bwd_inner_microstep: 3847.41 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.98 [2024-11-14 00:24:06,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.35 | bwd: 3854.94 | bwd_inner: 3847.41 | bwd_allreduce: 7.48 | step: 20.98 6%|▌ | 2837/50750 [7:41:28<78:55:23, 5.93s/it] {'loss': 0.0038, 'learning_rate': 3.992972040954585e-05, 'epoch': 2.8} 6%|▌ | 2837/50750 [7:41:28<78:55:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:24:12,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:24:12,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.37 | bwd_microstep: 3854.62 | bwd_inner_microstep: 3847.10 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.15 [2024-11-14 00:24:12,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.37 | bwd: 3854.63 | bwd_inner: 3847.10 | bwd_allreduce: 7.50 | step: 21.15 6%|▌ | 2838/50750 [7:41:34<78:55:11, 5.93s/it] {'loss': 1.9547, 'learning_rate': 3.992961346119647e-05, 'epoch': 2.8} 6%|▌ | 2838/50750 [7:41:34<78:55:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:24:18,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:24:18,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.42 | bwd_microstep: 3851.75 | bwd_inner_microstep: 3844.22 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.18 [2024-11-14 00:24:18,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.42 | bwd: 3851.76 | bwd_inner: 3844.22 | bwd_allreduce: 7.50 | step: 21.18 6%|▌ | 2839/50750 [7:41:40<78:53:41, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.9929506431677786e-05, 'epoch': 2.8} 6%|▌ | 2839/50750 [7:41:40<78:53:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:24:24,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:24:24,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.64 | bwd_microstep: 3854.31 | bwd_inner_microstep: 3846.58 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.45 [2024-11-14 00:24:24,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.64 | bwd: 3854.32 | bwd_inner: 3846.58 | bwd_allreduce: 7.70 | step: 21.45 6%|▌ | 2840/50750 [7:41:46<78:54:18, 5.93s/it] {'loss': 1.6184, 'learning_rate': 3.992939932099021e-05, 'epoch': 2.8} 6%|▌ | 2840/50750 [7:41:46<78:54:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:24:30,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-14 00:24:30,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.10 | bwd_microstep: 3848.57 | bwd_inner_microstep: 3841.07 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.98 [2024-11-14 00:24:30,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.10 | bwd: 3848.58 | bwd_inner: 3841.08 | bwd_allreduce: 7.47 | step: 20.99 6%|▌ | 2841/50750 [7:41:52<78:52:15, 5.93s/it] {'loss': 1.4034, 'learning_rate': 3.99292921291342e-05, 'epoch': 2.8} 6%|▌ | 2841/50750 [7:41:52<78:52:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:24:36,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.92 [2024-11-14 00:24:36,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.78 | bwd_microstep: 3849.81 | bwd_inner_microstep: 3842.14 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.83 [2024-11-14 00:24:36,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.77 | bwd: 3849.83 | bwd_inner: 3842.14 | bwd_allreduce: 7.65 | step: 21.84 6%|▌ | 2842/50750 [7:41:58<78:53:03, 5.93s/it] {'loss': 0.0103, 'learning_rate': 3.992918485611018e-05, 'epoch': 2.8} 6%|▌ | 2842/50750 [7:41:58<78:53:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:24:42,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.94 [2024-11-14 00:24:42,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.62 | bwd_microstep: 3850.25 | bwd_inner_microstep: 3842.74 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.31 [2024-11-14 00:24:42,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.61 | bwd: 3850.26 | bwd_inner: 3842.74 | bwd_allreduce: 7.49 | step: 21.31 6%|▌ | 2843/50750 [7:42:04<78:54:09, 5.93s/it] {'loss': 0.1881, 'learning_rate': 3.9929077501918596e-05, 'epoch': 2.8} 6%|▌ | 2843/50750 [7:42:04<78:54:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:24:48,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-14 00:24:48,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.60 | bwd_microstep: 3855.73 | bwd_inner_microstep: 3848.23 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.08 [2024-11-14 00:24:48,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.60 | bwd: 3855.74 | bwd_inner: 3848.23 | bwd_allreduce: 7.47 | step: 21.09 6%|▌ | 2844/50750 [7:42:10<78:53:44, 5.93s/it] {'loss': 1.264, 'learning_rate': 3.992897006655987e-05, 'epoch': 2.8} 6%|▌ | 2844/50750 [7:42:10<78:53:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:24:54,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:24:54,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.34 | bwd_microstep: 3845.83 | bwd_inner_microstep: 3838.31 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.02 [2024-11-14 00:24:54,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.33 | bwd: 3845.84 | bwd_inner: 3838.31 | bwd_allreduce: 7.49 | step: 21.02 6%|▌ | 2845/50750 [7:42:16<78:51:01, 5.93s/it] {'loss': 0.8108, 'learning_rate': 3.992886255003446e-05, 'epoch': 2.8} 6%|▌ | 2845/50750 [7:42:16<78:51:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:25:00,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:25:00,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.00 | bwd_microstep: 3853.61 | bwd_inner_microstep: 3846.09 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.02 [2024-11-14 00:25:00,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.00 | bwd: 3853.62 | bwd_inner: 3846.09 | bwd_allreduce: 7.49 | step: 21.02 6%|▌ | 2846/50750 [7:42:22<78:51:30, 5.93s/it] {'loss': 0.2959, 'learning_rate': 3.992875495234278e-05, 'epoch': 2.8} 6%|▌ | 2846/50750 [7:42:22<78:51:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:25:06,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:25:06,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.03 | bwd_microstep: 3849.07 | bwd_inner_microstep: 3841.53 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.15 [2024-11-14 00:25:06,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.02 | bwd: 3849.09 | bwd_inner: 3841.53 | bwd_allreduce: 7.51 | step: 21.15 6%|▌ | 2847/50750 [7:42:28<78:51:10, 5.93s/it] {'loss': 0.2769, 'learning_rate': 3.992864727348529e-05, 'epoch': 2.8} 6%|▌ | 2847/50750 [7:42:28<78:51:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:25:12,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:25:12,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.69 | bwd_microstep: 3848.28 | bwd_inner_microstep: 3840.77 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.97 [2024-11-14 00:25:12,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.69 | bwd: 3848.29 | bwd_inner: 3840.77 | bwd_allreduce: 7.49 | step: 20.97 6%|▌ | 2848/50750 [7:42:34<78:50:29, 5.93s/it] {'loss': 3.8107, 'learning_rate': 3.9928539513462424e-05, 'epoch': 2.81} 6%|▌ | 2848/50750 [7:42:34<78:50:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:25:18,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-14 00:25:18,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.87 | bwd_microstep: 3853.86 | bwd_inner_microstep: 3846.36 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.04 [2024-11-14 00:25:18,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.87 | bwd: 3853.88 | bwd_inner: 3846.36 | bwd_allreduce: 7.48 | step: 21.05 6%|▌ | 2849/50750 [7:42:40<78:50:20, 5.93s/it] {'loss': 0.1786, 'learning_rate': 3.9928431672274614e-05, 'epoch': 2.81} 6%|▌ | 2849/50750 [7:42:40<78:50:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:25:23,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-14 00:25:23,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.80 | bwd_microstep: 3849.30 | bwd_inner_microstep: 3841.75 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.06 [2024-11-14 00:25:23,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.80 | bwd: 3849.31 | bwd_inner: 3841.75 | bwd_allreduce: 7.52 | step: 21.06 6%|▌ | 2850/50750 [7:42:45<78:50:29, 5.93s/it] {'loss': 0.3532, 'learning_rate': 3.99283237499223e-05, 'epoch': 2.81} 6%|▌ | 2850/50750 [7:42:45<78:50:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:25:29,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:25:29,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.12 | bwd_microstep: 3850.00 | bwd_inner_microstep: 3842.48 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.07 [2024-11-14 00:25:29,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.12 | bwd: 3850.01 | bwd_inner: 3842.48 | bwd_allreduce: 7.49 | step: 21.07 6%|▌ | 2851/50750 [7:42:51<78:51:07, 5.93s/it] {'loss': 0.7039, 'learning_rate': 3.9928215746405924e-05, 'epoch': 2.81} 6%|▌ | 2851/50750 [7:42:51<78:51:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:25:35,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.38 | optimizer_step: 4.93 [2024-11-14 00:25:35,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.79 | bwd_microstep: 3850.82 | bwd_inner_microstep: 3842.93 | bwd_allreduce_microstep: 7.83 | step_microstep: 29.89 [2024-11-14 00:25:35,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.79 | bwd: 3850.84 | bwd_inner: 3842.93 | bwd_allreduce: 7.86 | step: 29.89 6%|▌ | 2852/50750 [7:42:57<78:54:03, 5.93s/it] {'loss': 0.4142, 'learning_rate': 3.9928107661725924e-05, 'epoch': 2.81} 6%|▌ | 2852/50750 [7:42:57<78:54:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:25:41,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 00:25:41,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.96 | bwd_microstep: 3854.76 | bwd_inner_microstep: 3846.96 | bwd_allreduce_microstep: 7.75 | step_microstep: 22.00 [2024-11-14 00:25:41,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.95 | bwd: 3854.77 | bwd_inner: 3846.96 | bwd_allreduce: 7.76 | step: 22.00 6%|▌ | 2853/50750 [7:43:03<78:54:40, 5.93s/it] {'loss': 0.2805, 'learning_rate': 3.9927999495882745e-05, 'epoch': 2.81} 6%|▌ | 2853/50750 [7:43:03<78:54:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:25:47,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:25:47,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.08 | bwd_microstep: 3850.32 | bwd_inner_microstep: 3842.47 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.30 [2024-11-14 00:25:47,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.08 | bwd: 3850.34 | bwd_inner: 3842.47 | bwd_allreduce: 7.81 | step: 22.30 6%|▌ | 2854/50750 [7:43:09<78:53:00, 5.93s/it] {'loss': 0.5022, 'learning_rate': 3.992789124887682e-05, 'epoch': 2.81} 6%|▌ | 2854/50750 [7:43:09<78:53:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:25:53,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:25:53,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.24 | bwd_microstep: 3853.15 | bwd_inner_microstep: 3845.66 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.05 [2024-11-14 00:25:53,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.23 | bwd: 3853.16 | bwd_inner: 3845.66 | bwd_allreduce: 7.46 | step: 21.06 6%|▌ | 2855/50750 [7:43:15<78:52:08, 5.93s/it] {'loss': 0.3025, 'learning_rate': 3.9927782920708596e-05, 'epoch': 2.81} 6%|▌ | 2855/50750 [7:43:15<78:52:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:25:59,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:25:59,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.63 | bwd_microstep: 3855.69 | bwd_inner_microstep: 3848.17 | bwd_allreduce_microstep: 7.48 | step_microstep: 23.08 [2024-11-14 00:25:59,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.63 | bwd: 3855.71 | bwd_inner: 3848.17 | bwd_allreduce: 7.50 | step: 23.09 6%|▌ | 2856/50750 [7:43:21<78:53:49, 5.93s/it] {'loss': 0.0102, 'learning_rate': 3.992767451137851e-05, 'epoch': 2.81} 6%|▌ | 2856/50750 [7:43:21<78:53:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:26:05,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:26:05,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.57 | bwd_microstep: 3849.44 | bwd_inner_microstep: 3841.50 | bwd_allreduce_microstep: 7.90 | step_microstep: 21.22 [2024-11-14 00:26:05,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.57 | bwd: 3849.45 | bwd_inner: 3841.50 | bwd_allreduce: 7.91 | step: 21.22 6%|▌ | 2857/50750 [7:43:27<78:51:49, 5.93s/it] {'loss': 0.065, 'learning_rate': 3.992756602088701e-05, 'epoch': 2.81} 6%|▌ | 2857/50750 [7:43:27<78:51:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:26:11,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 4.92 [2024-11-14 00:26:11,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.01 | bwd_microstep: 3853.20 | bwd_inner_microstep: 3845.69 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.25 [2024-11-14 00:26:11,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.01 | bwd: 3853.21 | bwd_inner: 3845.69 | bwd_allreduce: 7.48 | step: 21.25 6%|▌ | 2858/50750 [7:43:33<78:51:46, 5.93s/it] {'loss': 0.0427, 'learning_rate': 3.992745744923454e-05, 'epoch': 2.82} 6%|▌ | 2858/50750 [7:43:33<78:51:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:26:17,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:26:17,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.28 | bwd_microstep: 3845.11 | bwd_inner_microstep: 3837.63 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.05 [2024-11-14 00:26:17,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.27 | bwd: 3845.12 | bwd_inner: 3837.63 | bwd_allreduce: 7.45 | step: 21.05 6%|▌ | 2859/50750 [7:43:39<78:50:01, 5.93s/it] {'loss': 0.039, 'learning_rate': 3.992734879642153e-05, 'epoch': 2.82} 6%|▌ | 2859/50750 [7:43:39<78:50:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:26:23,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 00:26:23,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.31 | bwd_microstep: 3843.88 | bwd_inner_microstep: 3835.88 | bwd_allreduce_microstep: 7.93 | step_microstep: 23.99 [2024-11-14 00:26:23,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.30 | bwd: 3843.90 | bwd_inner: 3835.88 | bwd_allreduce: 7.96 | step: 24.00 6%|▌ | 2860/50750 [7:43:45<78:48:08, 5.92s/it] {'loss': 0.0455, 'learning_rate': 3.992724006244842e-05, 'epoch': 2.82} 6%|▌ | 2860/50750 [7:43:45<78:48:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:26:29,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.97 [2024-11-14 00:26:29,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.74 | bwd_microstep: 3853.31 | bwd_inner_microstep: 3845.84 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.17 [2024-11-14 00:26:29,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.73 | bwd: 3853.32 | bwd_inner: 3845.84 | bwd_allreduce: 7.44 | step: 21.17 6%|▌ | 2861/50750 [7:43:51<78:49:16, 5.93s/it] {'loss': 0.0174, 'learning_rate': 3.992713124731567e-05, 'epoch': 2.82} 6%|▌ | 2861/50750 [7:43:51<78:49:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:26:35,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:26:35,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.59 | bwd_microstep: 3846.46 | bwd_inner_microstep: 3839.00 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.75 [2024-11-14 00:26:35,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.59 | bwd: 3846.47 | bwd_inner: 3839.00 | bwd_allreduce: 7.43 | step: 21.75 6%|▌ | 2862/50750 [7:43:57<78:47:25, 5.92s/it] {'loss': 0.2299, 'learning_rate': 3.9927022351023714e-05, 'epoch': 2.82} 6%|▌ | 2862/50750 [7:43:57<78:47:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:26:41,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 00:26:41,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.81 | bwd_microstep: 3847.46 | bwd_inner_microstep: 3839.98 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.84 [2024-11-14 00:26:41,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.81 | bwd: 3847.48 | bwd_inner: 3839.98 | bwd_allreduce: 7.45 | step: 20.84 6%|▌ | 2863/50750 [7:44:02<78:46:00, 5.92s/it] {'loss': 0.0433, 'learning_rate': 3.992691337357299e-05, 'epoch': 2.82} 6%|▌ | 2863/50750 [7:44:02<78:46:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:26:46,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:26:46,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.49 | bwd_microstep: 3856.82 | bwd_inner_microstep: 3849.33 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.08 [2024-11-14 00:26:46,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.49 | bwd: 3856.83 | bwd_inner: 3849.33 | bwd_allreduce: 7.46 | step: 21.08 6%|▌ | 2864/50750 [7:44:08<78:47:48, 5.92s/it] {'loss': 0.0031, 'learning_rate': 3.992680431496395e-05, 'epoch': 2.82} 6%|▌ | 2864/50750 [7:44:08<78:47:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:26:52,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:26:52,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.38 | bwd_microstep: 3847.52 | bwd_inner_microstep: 3840.03 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.87 [2024-11-14 00:26:52,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.38 | bwd: 3847.53 | bwd_inner: 3840.03 | bwd_allreduce: 7.47 | step: 20.87 6%|▌ | 2865/50750 [7:44:14<78:45:57, 5.92s/it] {'loss': 0.272, 'learning_rate': 3.992669517519705e-05, 'epoch': 2.82} 6%|▌ | 2865/50750 [7:44:14<78:45:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:26:58,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 00:26:58,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.24 | bwd_microstep: 3848.38 | bwd_inner_microstep: 3839.70 | bwd_allreduce_microstep: 8.60 | step_microstep: 21.61 [2024-11-14 00:26:58,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.24 | bwd: 3848.41 | bwd_inner: 3839.70 | bwd_allreduce: 8.63 | step: 21.60 6%|▌ | 2866/50750 [7:44:20<78:46:10, 5.92s/it] {'loss': 0.023, 'learning_rate': 3.99265859542727e-05, 'epoch': 2.82} 6%|▌ | 2866/50750 [7:44:20<78:46:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:27:04,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:27:04,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.38 | bwd_microstep: 3846.79 | bwd_inner_microstep: 3839.32 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.88 [2024-11-14 00:27:04,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.37 | bwd: 3846.80 | bwd_inner: 3839.32 | bwd_allreduce: 7.44 | step: 20.88 6%|▌ | 2867/50750 [7:44:26<78:47:02, 5.92s/it] {'loss': 0.3815, 'learning_rate': 3.992647665219138e-05, 'epoch': 2.82} 6%|▌ | 2867/50750 [7:44:26<78:47:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:27:10,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 00:27:10,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.72 | bwd_microstep: 3843.92 | bwd_inner_microstep: 3836.48 | bwd_allreduce_microstep: 7.40 | step_microstep: 21.23 [2024-11-14 00:27:10,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.72 | bwd: 3843.93 | bwd_inner: 3836.48 | bwd_allreduce: 7.42 | step: 21.24 6%|▌ | 2868/50750 [7:44:32<78:44:27, 5.92s/it] {'loss': 0.0103, 'learning_rate': 3.9926367268953514e-05, 'epoch': 2.83} 6%|▌ | 2868/50750 [7:44:32<78:44:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:27:16,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-14 00:27:16,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.93 | bwd_microstep: 3843.88 | bwd_inner_microstep: 3836.42 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.42 [2024-11-14 00:27:16,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.93 | bwd: 3843.90 | bwd_inner: 3836.42 | bwd_allreduce: 7.44 | step: 21.42 6%|▌ | 2869/50750 [7:44:38<78:43:33, 5.92s/it] {'loss': 0.0194, 'learning_rate': 3.992625780455956e-05, 'epoch': 2.83} 6%|▌ | 2869/50750 [7:44:38<78:43:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:27:22,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 00:27:22,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.19 | bwd_microstep: 3848.38 | bwd_inner_microstep: 3840.91 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.49 [2024-11-14 00:27:22,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.19 | bwd: 3848.39 | bwd_inner: 3840.91 | bwd_allreduce: 7.44 | step: 21.49 6%|▌ | 2870/50750 [7:44:44<78:43:14, 5.92s/it] {'loss': 0.0023, 'learning_rate': 3.9926148259009944e-05, 'epoch': 2.83} 6%|▌ | 2870/50750 [7:44:44<78:43:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:27:28,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:27:28,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.11 | bwd_microstep: 3847.85 | bwd_inner_microstep: 3840.40 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.85 [2024-11-14 00:27:28,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.11 | bwd: 3847.86 | bwd_inner: 3840.40 | bwd_allreduce: 7.42 | step: 20.85 6%|▌ | 2871/50750 [7:44:50<78:43:26, 5.92s/it] {'loss': 0.0018, 'learning_rate': 3.992603863230514e-05, 'epoch': 2.83} 6%|▌ | 2871/50750 [7:44:50<78:43:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:27:34,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:27:34,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.74 | bwd_microstep: 3845.44 | bwd_inner_microstep: 3837.97 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.89 [2024-11-14 00:27:34,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.74 | bwd: 3845.45 | bwd_inner: 3837.97 | bwd_allreduce: 7.45 | step: 20.90 6%|▌ | 2872/50750 [7:44:56<78:42:06, 5.92s/it] {'loss': 0.0338, 'learning_rate': 3.992592892444557e-05, 'epoch': 2.83} 6%|▌ | 2872/50750 [7:44:56<78:42:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:27:40,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.92 [2024-11-14 00:27:40,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.20 | bwd_microstep: 3848.65 | bwd_inner_microstep: 3840.41 | bwd_allreduce_microstep: 8.18 | step_microstep: 22.22 [2024-11-14 00:27:40,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.20 | bwd: 3848.66 | bwd_inner: 3840.41 | bwd_allreduce: 8.20 | step: 22.23 6%|▌ | 2873/50750 [7:45:02<78:43:49, 5.92s/it] {'loss': 0.0802, 'learning_rate': 3.99258191354317e-05, 'epoch': 2.83} 6%|▌ | 2873/50750 [7:45:02<78:43:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:27:46,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:27:46,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.35 | bwd_microstep: 3855.62 | bwd_inner_microstep: 3848.10 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.10 [2024-11-14 00:27:46,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.34 | bwd: 3855.63 | bwd_inner: 3848.10 | bwd_allreduce: 7.49 | step: 21.11 6%|▌ | 2874/50750 [7:45:08<78:45:02, 5.92s/it] {'loss': 0.5868, 'learning_rate': 3.992570926526397e-05, 'epoch': 2.83} 6%|▌ | 2874/50750 [7:45:08<78:45:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:27:52,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-14 00:27:52,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.55 | bwd_microstep: 3859.66 | bwd_inner_microstep: 3851.91 | bwd_allreduce_microstep: 7.69 | step_microstep: 23.55 [2024-11-14 00:27:52,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.55 | bwd: 3859.68 | bwd_inner: 3851.91 | bwd_allreduce: 7.71 | step: 23.55 6%|▌ | 2875/50750 [7:45:14<78:48:19, 5.93s/it] {'loss': 0.0861, 'learning_rate': 3.992559931394282e-05, 'epoch': 2.83} 6%|▌ | 2875/50750 [7:45:14<78:48:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:27:58,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:27:58,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.14 | bwd_microstep: 3852.71 | bwd_inner_microstep: 3845.10 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.65 [2024-11-14 00:27:58,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.14 | bwd: 3852.73 | bwd_inner: 3845.10 | bwd_allreduce: 7.58 | step: 21.65 6%|▌ | 2876/50750 [7:45:19<78:49:19, 5.93s/it] {'loss': 0.0134, 'learning_rate': 3.99254892814687e-05, 'epoch': 2.83} 6%|▌ | 2876/50750 [7:45:19<78:49:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:28:03,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 00:28:03,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.21 | bwd_microstep: 3846.11 | bwd_inner_microstep: 3838.60 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.13 [2024-11-14 00:28:03,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.19 | bwd: 3846.13 | bwd_inner: 3838.60 | bwd_allreduce: 7.49 | step: 21.13 6%|▌ | 2877/50750 [7:45:25<78:47:32, 5.93s/it] {'loss': 0.5939, 'learning_rate': 3.9925379167842064e-05, 'epoch': 2.83} 6%|▌ | 2877/50750 [7:45:25<78:47:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:28:09,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.94 [2024-11-14 00:28:09,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.92 | bwd_microstep: 3846.73 | bwd_inner_microstep: 3839.19 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.30 [2024-11-14 00:28:09,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.92 | bwd: 3846.74 | bwd_inner: 3839.19 | bwd_allreduce: 7.51 | step: 21.30 6%|▌ | 2878/50750 [7:45:31<78:45:41, 5.92s/it] {'loss': 0.0405, 'learning_rate': 3.992526897306337e-05, 'epoch': 2.84} 6%|▌ | 2878/50750 [7:45:31<78:45:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:28:15,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:28:15,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.69 | bwd_microstep: 3845.95 | bwd_inner_microstep: 3838.42 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.25 [2024-11-14 00:28:15,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.68 | bwd: 3845.96 | bwd_inner: 3838.42 | bwd_allreduce: 7.49 | step: 21.26 6%|▌ | 2879/50750 [7:45:37<78:44:57, 5.92s/it] {'loss': 0.0702, 'learning_rate': 3.9925158697133046e-05, 'epoch': 2.84} 6%|▌ | 2879/50750 [7:45:37<78:44:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:28:21,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.92 [2024-11-14 00:28:21,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.25 | bwd_microstep: 3847.48 | bwd_inner_microstep: 3839.65 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.16 [2024-11-14 00:28:21,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.25 | bwd: 3847.50 | bwd_inner: 3839.65 | bwd_allreduce: 7.80 | step: 22.16 6%|▌ | 2880/50750 [7:45:43<78:46:31, 5.92s/it] {'loss': 0.0311, 'learning_rate': 3.992504834005155e-05, 'epoch': 2.84} 6%|▌ | 2880/50750 [7:45:43<78:46:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 00:28:27,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 00:28:27,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.56 | bwd_microstep: 3844.20 | bwd_inner_microstep: 3836.68 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.02 [2024-11-14 00:28:27,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.55 | bwd: 3844.21 | bwd_inner: 3836.68 | bwd_allreduce: 7.49 | step: 21.03 6%|▌ | 2881/50750 [7:45:49<78:46:01, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.9924937901819325e-05, 'epoch': 2.84} 6%|▌ | 2881/50750 [7:45:49<78:46:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:28:33,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:28:33,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.81 | bwd_microstep: 3843.06 | bwd_inner_microstep: 3835.55 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.16 [2024-11-14 00:28:33,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.81 | bwd: 3843.07 | bwd_inner: 3835.55 | bwd_allreduce: 7.48 | step: 21.16 6%|▌ | 2882/50750 [7:45:55<78:43:30, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.9924827382436834e-05, 'epoch': 2.84} 6%|▌ | 2882/50750 [7:45:55<78:43:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:28:39,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 00:28:39,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.53 | bwd_microstep: 3850.03 | bwd_inner_microstep: 3842.26 | bwd_allreduce_microstep: 7.72 | step_microstep: 25.43 [2024-11-14 00:28:39,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.53 | bwd: 3850.05 | bwd_inner: 3842.26 | bwd_allreduce: 7.74 | step: 25.43 6%|▌ | 2883/50750 [7:46:01<78:44:19, 5.92s/it] {'loss': 0.0021, 'learning_rate': 3.992471678190453e-05, 'epoch': 2.84} 6%|▌ | 2883/50750 [7:46:01<78:44:19, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:28:45,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:28:45,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.42 | bwd_microstep: 3847.40 | bwd_inner_microstep: 3839.70 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.35 [2024-11-14 00:28:45,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.42 | bwd: 3847.42 | bwd_inner: 3839.70 | bwd_allreduce: 7.67 | step: 21.36 6%|▌ | 2884/50750 [7:46:07<78:43:41, 5.92s/it] {'loss': 0.3538, 'learning_rate': 3.992460610022284e-05, 'epoch': 2.84} 6%|▌ | 2884/50750 [7:46:07<78:43:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:28:51,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 00:28:51,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.93 | bwd_microstep: 3843.94 | bwd_inner_microstep: 3836.28 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.52 [2024-11-14 00:28:51,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.94 | bwd: 3843.95 | bwd_inner: 3836.28 | bwd_allreduce: 7.64 | step: 21.53 6%|▌ | 2885/50750 [7:46:13<78:41:52, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.992449533739224e-05, 'epoch': 2.84} 6%|▌ | 2885/50750 [7:46:13<78:41:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:28:57,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:28:57,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.48 | bwd_microstep: 3853.30 | bwd_inner_microstep: 3845.78 | bwd_allreduce_microstep: 7.48 | step_microstep: 20.99 [2024-11-14 00:28:57,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.48 | bwd: 3853.32 | bwd_inner: 3845.78 | bwd_allreduce: 7.49 | step: 20.99 6%|▌ | 2886/50750 [7:46:19<78:43:27, 5.92s/it] {'loss': 0.0022, 'learning_rate': 3.992438449341316e-05, 'epoch': 2.84} 6%|▌ | 2886/50750 [7:46:19<78:43:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:29:03,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-14 00:29:03,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.42 | bwd_microstep: 3846.31 | bwd_inner_microstep: 3838.56 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.90 [2024-11-14 00:29:03,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.40 | bwd: 3846.33 | bwd_inner: 3838.56 | bwd_allreduce: 7.73 | step: 21.91 6%|▌ | 2887/50750 [7:46:25<78:43:48, 5.92s/it] {'loss': 0.5653, 'learning_rate': 3.9924273568286066e-05, 'epoch': 2.84} 6%|▌ | 2887/50750 [7:46:25<78:43:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:29:09,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 00:29:09,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.34 | bwd_microstep: 3842.58 | bwd_inner_microstep: 3834.90 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.50 [2024-11-14 00:29:09,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.33 | bwd: 3842.60 | bwd_inner: 3834.91 | bwd_allreduce: 7.65 | step: 21.50 6%|▌ | 2888/50750 [7:46:31<78:43:36, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.9924162562011405e-05, 'epoch': 2.85} 6%|▌ | 2888/50750 [7:46:31<78:43:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:29:14,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:29:14,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.65 | bwd_microstep: 3841.89 | bwd_inner_microstep: 3834.37 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.92 [2024-11-14 00:29:14,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.63 | bwd: 3841.91 | bwd_inner: 3834.37 | bwd_allreduce: 7.50 | step: 20.92 6%|▌ | 2889/50750 [7:46:36<78:43:02, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.992405147458963e-05, 'epoch': 2.85} 6%|▌ | 2889/50750 [7:46:36<78:43:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:29:20,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.94 [2024-11-14 00:29:20,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.22 | bwd_microstep: 3850.58 | bwd_inner_microstep: 3843.06 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.08 [2024-11-14 00:29:20,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.22 | bwd: 3850.59 | bwd_inner: 3843.06 | bwd_allreduce: 7.49 | step: 21.09 6%|▌ | 2890/50750 [7:46:42<78:42:30, 5.92s/it] {'loss': 0.0734, 'learning_rate': 3.9923940306021195e-05, 'epoch': 2.85} 6%|▌ | 2890/50750 [7:46:42<78:42:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:29:26,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:29:26,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.28 | bwd_microstep: 3848.45 | bwd_inner_microstep: 3840.94 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.09 [2024-11-14 00:29:26,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.27 | bwd: 3848.46 | bwd_inner: 3840.94 | bwd_allreduce: 7.48 | step: 21.09 6%|▌ | 2891/50750 [7:46:48<78:43:26, 5.92s/it] {'loss': 0.6062, 'learning_rate': 3.992382905630655e-05, 'epoch': 2.85} 6%|▌ | 2891/50750 [7:46:48<78:43:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:29:32,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-14 00:29:32,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.38 | bwd_microstep: 3842.88 | bwd_inner_microstep: 3835.38 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.02 [2024-11-14 00:29:32,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.38 | bwd: 3842.89 | bwd_inner: 3835.38 | bwd_allreduce: 7.47 | step: 21.02 6%|▌ | 2892/50750 [7:46:54<78:41:58, 5.92s/it] {'loss': 0.2963, 'learning_rate': 3.9923717725446137e-05, 'epoch': 2.85} 6%|▌ | 2892/50750 [7:46:54<78:41:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:29:38,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:29:38,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.85 | bwd_microstep: 3846.51 | bwd_inner_microstep: 3839.05 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.00 [2024-11-14 00:29:38,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.85 | bwd: 3846.52 | bwd_inner: 3839.05 | bwd_allreduce: 7.43 | step: 21.01 6%|▌ | 2893/50750 [7:47:00<78:40:59, 5.92s/it] {'loss': 0.0008, 'learning_rate': 3.992360631344043e-05, 'epoch': 2.85} 6%|▌ | 2893/50750 [7:47:00<78:40:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:29:44,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:29:44,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.06 | bwd_microstep: 3842.25 | bwd_inner_microstep: 3834.53 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.37 [2024-11-14 00:29:44,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.06 | bwd: 3842.27 | bwd_inner: 3834.53 | bwd_allreduce: 7.70 | step: 21.38 6%|▌ | 2894/50750 [7:47:06<78:39:25, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.9923494820289876e-05, 'epoch': 2.85} 6%|▌ | 2894/50750 [7:47:06<78:39:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:29:50,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:29:50,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.62 | bwd_microstep: 3847.65 | bwd_inner_microstep: 3840.12 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.06 [2024-11-14 00:29:50,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.62 | bwd: 3847.66 | bwd_inner: 3840.12 | bwd_allreduce: 7.50 | step: 21.06 6%|▌ | 2895/50750 [7:47:12<78:39:59, 5.92s/it] {'loss': 0.0068, 'learning_rate': 3.992338324599493e-05, 'epoch': 2.85} 6%|▌ | 2895/50750 [7:47:12<78:39:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:29:56,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:29:56,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.25 | bwd_microstep: 3843.03 | bwd_inner_microstep: 3835.50 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.05 [2024-11-14 00:29:56,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.25 | bwd: 3843.05 | bwd_inner: 3835.50 | bwd_allreduce: 7.50 | step: 21.05 6%|▌ | 2896/50750 [7:47:18<78:39:06, 5.92s/it] {'loss': 0.0008, 'learning_rate': 3.992327159055604e-05, 'epoch': 2.85} 6%|▌ | 2896/50750 [7:47:18<78:39:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:30:02,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:30:02,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.53 | bwd_microstep: 3849.94 | bwd_inner_microstep: 3842.39 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.07 [2024-11-14 00:30:02,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.52 | bwd: 3849.95 | bwd_inner: 3842.39 | bwd_allreduce: 7.52 | step: 21.08 6%|▌ | 2897/50750 [7:47:24<78:40:20, 5.92s/it] {'loss': 0.0059, 'learning_rate': 3.9923159853973656e-05, 'epoch': 2.85} 6%|▌ | 2897/50750 [7:47:24<78:40:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:30:08,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.35 | optimizer_step: 5.10 [2024-11-14 00:30:08,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.83 | bwd_microstep: 3856.38 | bwd_inner_microstep: 3848.13 | bwd_allreduce_microstep: 8.18 | step_microstep: 28.66 [2024-11-14 00:30:08,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.83 | bwd: 3856.40 | bwd_inner: 3848.13 | bwd_allreduce: 8.21 | step: 28.65 6%|▌ | 2898/50750 [7:47:30<78:45:48, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.992304803624825e-05, 'epoch': 2.86} 6%|▌ | 2898/50750 [7:47:30<78:45:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:30:14,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:30:14,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.97 | bwd_microstep: 3843.55 | bwd_inner_microstep: 3836.02 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.15 [2024-11-14 00:30:14,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.95 | bwd: 3843.56 | bwd_inner: 3836.02 | bwd_allreduce: 7.50 | step: 21.16 6%|▌ | 2899/50750 [7:47:36<78:43:40, 5.92s/it] {'loss': 0.1224, 'learning_rate': 3.992293613738027e-05, 'epoch': 2.86} 6%|▌ | 2899/50750 [7:47:36<78:43:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:30:20,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.92 [2024-11-14 00:30:20,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.85 | bwd_microstep: 3848.61 | bwd_inner_microstep: 3840.88 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.69 [2024-11-14 00:30:20,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.85 | bwd: 3848.63 | bwd_inner: 3840.88 | bwd_allreduce: 7.71 | step: 21.70 6%|▌ | 2900/50750 [7:47:42<78:43:40, 5.92s/it] {'loss': 0.1613, 'learning_rate': 3.992282415737016e-05, 'epoch': 2.86} 6%|▌ | 2900/50750 [7:47:42<78:43:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:30:26,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:30:26,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.21 | bwd_microstep: 3851.22 | bwd_inner_microstep: 3843.70 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.08 [2024-11-14 00:30:26,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.20 | bwd: 3851.24 | bwd_inner: 3843.70 | bwd_allreduce: 7.50 | step: 21.09 6%|▌ | 2901/50750 [7:47:48<78:45:38, 5.93s/it] {'loss': 0.0135, 'learning_rate': 3.99227120962184e-05, 'epoch': 2.86} 6%|▌ | 2901/50750 [7:47:48<78:45:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:30:31,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:30:31,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.02 | bwd_microstep: 3842.67 | bwd_inner_microstep: 3835.20 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.01 [2024-11-14 00:30:31,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.02 | bwd: 3842.68 | bwd_inner: 3835.20 | bwd_allreduce: 7.44 | step: 21.01 6%|▌ | 2902/50750 [7:47:53<78:43:39, 5.92s/it] {'loss': 0.002, 'learning_rate': 3.992259995392542e-05, 'epoch': 2.86} 6%|▌ | 2902/50750 [7:47:53<78:43:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:30:37,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 00:30:37,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.05 | bwd_microstep: 3858.06 | bwd_inner_microstep: 3850.50 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.71 [2024-11-14 00:30:37,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.05 | bwd: 3858.07 | bwd_inner: 3850.50 | bwd_allreduce: 7.53 | step: 21.72 6%|▌ | 2903/50750 [7:47:59<78:47:09, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.99224877304917e-05, 'epoch': 2.86} 6%|▌ | 2903/50750 [7:47:59<78:47:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:30:43,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-14 00:30:43,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.53 | bwd_microstep: 3844.31 | bwd_inner_microstep: 3836.64 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.66 [2024-11-14 00:30:43,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.51 | bwd: 3844.32 | bwd_inner: 3836.64 | bwd_allreduce: 7.64 | step: 21.67 6%|▌ | 2904/50750 [7:48:05<78:45:50, 5.93s/it] {'loss': 0.3244, 'learning_rate': 3.992237542591768e-05, 'epoch': 2.86} 6%|▌ | 2904/50750 [7:48:05<78:45:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:30:49,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.94 [2024-11-14 00:30:49,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.13 | bwd_microstep: 3843.65 | bwd_inner_microstep: 3836.17 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.91 [2024-11-14 00:30:49,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.11 | bwd: 3843.66 | bwd_inner: 3836.17 | bwd_allreduce: 7.45 | step: 20.92 6%|▌ | 2905/50750 [7:48:11<78:43:08, 5.92s/it] {'loss': 0.4625, 'learning_rate': 3.9922263040203824e-05, 'epoch': 2.86} 6%|▌ | 2905/50750 [7:48:11<78:43:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:30:55,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 00:30:55,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.44 | bwd_microstep: 3849.49 | bwd_inner_microstep: 3841.87 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.62 [2024-11-14 00:30:55,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.44 | bwd: 3849.50 | bwd_inner: 3841.86 | bwd_allreduce: 7.60 | step: 21.62 6%|▌ | 2906/50750 [7:48:17<78:42:30, 5.92s/it] {'loss': 0.0632, 'learning_rate': 3.99221505733506e-05, 'epoch': 2.86} 6%|▌ | 2906/50750 [7:48:17<78:42:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:31:01,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:31:01,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.04 | bwd_microstep: 3842.37 | bwd_inner_microstep: 3834.89 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.88 [2024-11-14 00:31:01,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.03 | bwd: 3842.38 | bwd_inner: 3834.89 | bwd_allreduce: 7.45 | step: 20.88 6%|▌ | 2907/50750 [7:48:23<78:40:11, 5.92s/it] {'loss': 0.0017, 'learning_rate': 3.9922038025358444e-05, 'epoch': 2.86} 6%|▌ | 2907/50750 [7:48:23<78:40:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:31:07,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.92 [2024-11-14 00:31:07,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.26 | bwd_microstep: 3847.30 | bwd_inner_microstep: 3839.44 | bwd_allreduce_microstep: 7.81 | step_microstep: 21.52 [2024-11-14 00:31:07,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.26 | bwd: 3847.32 | bwd_inner: 3839.44 | bwd_allreduce: 7.83 | step: 21.52 6%|▌ | 2908/50750 [7:48:29<78:39:44, 5.92s/it] {'loss': 0.0115, 'learning_rate': 3.9921925396227826e-05, 'epoch': 2.87} 6%|▌ | 2908/50750 [7:48:29<78:39:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:31:13,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.31 | optimizer_step: 4.93 [2024-11-14 00:31:13,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.05 | bwd_microstep: 3845.71 | bwd_inner_microstep: 3837.89 | bwd_allreduce_microstep: 7.77 | step_microstep: 21.68 [2024-11-14 00:31:13,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.05 | bwd: 3845.72 | bwd_inner: 3837.89 | bwd_allreduce: 7.79 | step: 21.69 6%|▌ | 2909/50750 [7:48:35<78:39:14, 5.92s/it] {'loss': 0.0032, 'learning_rate': 3.992181268595921e-05, 'epoch': 2.87} 6%|▌ | 2909/50750 [7:48:35<78:39:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:31:19,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:31:19,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.57 | bwd_microstep: 3845.22 | bwd_inner_microstep: 3837.77 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.90 [2024-11-14 00:31:19,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.57 | bwd: 3845.24 | bwd_inner: 3837.77 | bwd_allreduce: 7.43 | step: 20.91 6%|▌ | 2910/50750 [7:48:41<78:39:04, 5.92s/it] {'loss': 0.3411, 'learning_rate': 3.9921699894553046e-05, 'epoch': 2.87} 6%|▌ | 2910/50750 [7:48:41<78:39:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:31:25,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:31:25,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.20 | bwd_microstep: 3849.21 | bwd_inner_microstep: 3841.70 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.93 [2024-11-14 00:31:25,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.20 | bwd: 3849.22 | bwd_inner: 3841.70 | bwd_allreduce: 7.48 | step: 20.94 6%|▌ | 2911/50750 [7:48:47<78:38:48, 5.92s/it] {'loss': 0.1652, 'learning_rate': 3.9921587022009804e-05, 'epoch': 2.87} 6%|▌ | 2911/50750 [7:48:47<78:38:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:31:31,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:31:31,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.16 | bwd_microstep: 3852.06 | bwd_inner_microstep: 3844.59 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.87 [2024-11-14 00:31:31,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.16 | bwd: 3852.07 | bwd_inner: 3844.58 | bwd_allreduce: 7.44 | step: 20.87 6%|▌ | 2912/50750 [7:48:53<78:39:44, 5.92s/it] {'loss': 0.0134, 'learning_rate': 3.992147406832993e-05, 'epoch': 2.87} 6%|▌ | 2912/50750 [7:48:53<78:39:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:31:37,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 00:31:37,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.34 | bwd_microstep: 3847.13 | bwd_inner_microstep: 3839.66 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.88 [2024-11-14 00:31:37,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.34 | bwd: 3847.15 | bwd_inner: 3839.66 | bwd_allreduce: 7.44 | step: 20.89 6%|▌ | 2913/50750 [7:48:59<78:38:58, 5.92s/it] {'loss': 0.0397, 'learning_rate': 3.9921361033513896e-05, 'epoch': 2.87} 6%|▌ | 2913/50750 [7:48:59<78:38:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:31:42,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 00:31:42,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.75 | bwd_microstep: 3844.18 | bwd_inner_microstep: 3836.39 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.60 [2024-11-14 00:31:43,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.75 | bwd: 3844.19 | bwd_inner: 3836.39 | bwd_allreduce: 7.75 | step: 21.61 6%|▌ | 2914/50750 [7:49:04<78:39:08, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.992124791756215e-05, 'epoch': 2.87} 6%|▌ | 2914/50750 [7:49:04<78:39:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:31:48,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-14 00:31:48,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.15 | bwd_microstep: 3848.43 | bwd_inner_microstep: 3840.75 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.79 [2024-11-14 00:31:48,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.14 | bwd: 3848.45 | bwd_inner: 3840.75 | bwd_allreduce: 7.66 | step: 21.79 6%|▌ | 2915/50750 [7:49:10<78:42:49, 5.92s/it] {'loss': 0.2013, 'learning_rate': 3.9921134720475165e-05, 'epoch': 2.87} 6%|▌ | 2915/50750 [7:49:10<78:42:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:31:54,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-14 00:31:54,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.12 | bwd_microstep: 3852.60 | bwd_inner_microstep: 3845.11 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.34 [2024-11-14 00:31:54,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.12 | bwd: 3852.61 | bwd_inner: 3845.11 | bwd_allreduce: 7.46 | step: 21.34 6%|▌ | 2916/50750 [7:49:16<78:43:39, 5.93s/it] {'loss': 0.2942, 'learning_rate': 3.99210214422534e-05, 'epoch': 2.87} 6%|▌ | 2916/50750 [7:49:16<78:43:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:32:00,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-14 00:32:00,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.60 | bwd_microstep: 3849.13 | bwd_inner_microstep: 3841.13 | bwd_allreduce_microstep: 7.94 | step_microstep: 23.85 [2024-11-14 00:32:00,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.60 | bwd: 3849.15 | bwd_inner: 3841.13 | bwd_allreduce: 7.97 | step: 23.85 6%|▌ | 2917/50750 [7:49:22<78:44:09, 5.93s/it] {'loss': 0.6066, 'learning_rate': 3.9920908082897306e-05, 'epoch': 2.87} 6%|▌ | 2917/50750 [7:49:22<78:44:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:32:06,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:32:06,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.36 | bwd_microstep: 3848.10 | bwd_inner_microstep: 3840.64 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.03 [2024-11-14 00:32:06,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.34 | bwd: 3848.12 | bwd_inner: 3840.64 | bwd_allreduce: 7.44 | step: 21.04 6%|▌ | 2918/50750 [7:49:28<78:43:44, 5.93s/it] {'loss': 0.0093, 'learning_rate': 3.992079464240736e-05, 'epoch': 2.87} 6%|▌ | 2918/50750 [7:49:28<78:43:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:32:12,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 00:32:12,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.97 | bwd_microstep: 3856.17 | bwd_inner_microstep: 3848.39 | bwd_allreduce_microstep: 7.71 | step_microstep: 24.31 [2024-11-14 00:32:12,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.96 | bwd: 3856.18 | bwd_inner: 3848.39 | bwd_allreduce: 7.74 | step: 24.31 6%|▌ | 2919/50750 [7:49:34<78:46:17, 5.93s/it] {'loss': 0.0266, 'learning_rate': 3.992068112078402e-05, 'epoch': 2.88} 6%|▌ | 2919/50750 [7:49:34<78:46:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:32:18,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-14 00:32:18,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2036.93 | bwd_microstep: 3860.23 | bwd_inner_microstep: 3852.69 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.95 [2024-11-14 00:32:18,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2036.92 | bwd: 3860.24 | bwd_inner: 3852.69 | bwd_allreduce: 7.51 | step: 21.95 6%|▌ | 2920/50750 [7:49:40<78:50:07, 5.93s/it] {'loss': 0.0625, 'learning_rate': 3.9920567518027734e-05, 'epoch': 2.88} 6%|▌ | 2920/50750 [7:49:40<78:50:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:32:24,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.92 [2024-11-14 00:32:24,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.18 | bwd_microstep: 3854.96 | bwd_inner_microstep: 3847.24 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.67 [2024-11-14 00:32:24,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.18 | bwd: 3854.97 | bwd_inner: 3847.24 | bwd_allreduce: 7.69 | step: 21.68 6%|▌ | 2921/50750 [7:49:46<78:49:34, 5.93s/it] {'loss': 0.0225, 'learning_rate': 3.9920453834138974e-05, 'epoch': 2.88} 6%|▌ | 2921/50750 [7:49:46<78:49:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:32:30,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.65 | optimizer_step: 4.93 [2024-11-14 00:32:30,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.62 | bwd_microstep: 3853.80 | bwd_inner_microstep: 3845.91 | bwd_allreduce_microstep: 7.83 | step_microstep: 28.55 [2024-11-14 00:32:30,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.61 | bwd: 3853.82 | bwd_inner: 3845.91 | bwd_allreduce: 7.86 | step: 28.57 6%|▌ | 2922/50750 [7:49:52<78:49:45, 5.93s/it] {'loss': 0.0209, 'learning_rate': 3.992034006911821e-05, 'epoch': 2.88} 6%|▌ | 2922/50750 [7:49:52<78:49:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:32:36,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 00:32:36,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.13 | bwd_microstep: 3853.91 | bwd_inner_microstep: 3846.41 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.93 [2024-11-14 00:32:36,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.12 | bwd: 3853.92 | bwd_inner: 3846.41 | bwd_allreduce: 7.47 | step: 20.94 6%|▌ | 2923/50750 [7:49:58<78:48:35, 5.93s/it] {'loss': 0.014, 'learning_rate': 3.99202262229659e-05, 'epoch': 2.88} 6%|▌ | 2923/50750 [7:49:58<78:48:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:32:42,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 5.05 [2024-11-14 00:32:42,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.47 | bwd_microstep: 3848.41 | bwd_inner_microstep: 3840.52 | bwd_allreduce_microstep: 7.82 | step_microstep: 25.25 [2024-11-14 00:32:42,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.47 | bwd: 3848.44 | bwd_inner: 3840.52 | bwd_allreduce: 7.85 | step: 25.25 6%|▌ | 2924/50750 [7:50:04<78:47:09, 5.93s/it] {'loss': 0.1298, 'learning_rate': 3.992011229568251e-05, 'epoch': 2.88} 6%|▌ | 2924/50750 [7:50:04<78:47:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:32:48,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:32:48,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.65 | bwd_microstep: 3846.91 | bwd_inner_microstep: 3839.44 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.88 [2024-11-14 00:32:48,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.65 | bwd: 3846.92 | bwd_inner: 3839.44 | bwd_allreduce: 7.44 | step: 20.88 6%|▌ | 2925/50750 [7:50:10<78:45:21, 5.93s/it] {'loss': 0.0206, 'learning_rate': 3.99199982872685e-05, 'epoch': 2.88} 6%|▌ | 2925/50750 [7:50:10<78:45:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:32:54,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:32:54,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.80 | bwd_microstep: 3850.98 | bwd_inner_microstep: 3843.49 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.87 [2024-11-14 00:32:54,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.80 | bwd: 3850.99 | bwd_inner: 3843.49 | bwd_allreduce: 7.46 | step: 20.88 6%|▌ | 2926/50750 [7:50:16<78:44:25, 5.93s/it] {'loss': 0.6339, 'learning_rate': 3.991988419772433e-05, 'epoch': 2.88} 6%|▌ | 2926/50750 [7:50:16<78:44:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:33:00,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 00:33:00,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.59 | bwd_microstep: 3856.17 | bwd_inner_microstep: 3848.72 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.82 [2024-11-14 00:33:00,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.59 | bwd: 3856.18 | bwd_inner: 3848.72 | bwd_allreduce: 7.43 | step: 20.82 6%|▌ | 2927/50750 [7:50:22<78:44:54, 5.93s/it] {'loss': 0.0411, 'learning_rate': 3.991977002705047e-05, 'epoch': 2.88} 6%|▌ | 2927/50750 [7:50:22<78:44:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:33:06,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:33:06,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.06 | bwd_microstep: 3856.39 | bwd_inner_microstep: 3848.92 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.93 [2024-11-14 00:33:06,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.06 | bwd: 3856.40 | bwd_inner: 3848.92 | bwd_allreduce: 7.45 | step: 20.93 6%|▌ | 2928/50750 [7:50:27<78:44:51, 5.93s/it] {'loss': 0.001, 'learning_rate': 3.991965577524739e-05, 'epoch': 2.88} 6%|▌ | 2928/50750 [7:50:27<78:44:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:33:11,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:33:11,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.59 | bwd_microstep: 3858.82 | bwd_inner_microstep: 3851.30 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.96 [2024-11-14 00:33:11,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.59 | bwd: 3858.83 | bwd_inner: 3851.30 | bwd_allreduce: 7.48 | step: 20.97 6%|▌ | 2929/50750 [7:50:33<78:45:34, 5.93s/it] {'loss': 0.0078, 'learning_rate': 3.9919541442315545e-05, 'epoch': 2.89} 6%|▌ | 2929/50750 [7:50:33<78:45:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:33:17,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:33:17,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.70 | bwd_microstep: 3845.90 | bwd_inner_microstep: 3838.43 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.91 [2024-11-14 00:33:17,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.69 | bwd: 3845.91 | bwd_inner: 3838.43 | bwd_allreduce: 7.44 | step: 20.91 6%|▌ | 2930/50750 [7:50:39<78:43:25, 5.93s/it] {'loss': 0.0011, 'learning_rate': 3.991942702825541e-05, 'epoch': 2.89} 6%|▌ | 2930/50750 [7:50:39<78:43:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:33:23,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:33:23,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.18 | bwd_microstep: 3854.14 | bwd_inner_microstep: 3846.68 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.88 [2024-11-14 00:33:23,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.18 | bwd: 3854.15 | bwd_inner: 3846.68 | bwd_allreduce: 7.43 | step: 20.89 6%|▌ | 2931/50750 [7:50:45<78:44:01, 5.93s/it] {'loss': 0.4783, 'learning_rate': 3.991931253306745e-05, 'epoch': 2.89} 6%|▌ | 2931/50750 [7:50:45<78:44:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:33:29,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.97 [2024-11-14 00:33:29,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.81 | bwd_microstep: 3847.52 | bwd_inner_microstep: 3840.00 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.06 [2024-11-14 00:33:29,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.81 | bwd: 3847.53 | bwd_inner: 3840.00 | bwd_allreduce: 7.49 | step: 21.06 6%|▌ | 2932/50750 [7:50:51<78:42:27, 5.93s/it] {'loss': 0.0011, 'learning_rate': 3.991919795675213e-05, 'epoch': 2.89} 6%|▌ | 2932/50750 [7:50:51<78:42:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:33:35,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 00:33:35,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.71 | bwd_microstep: 3852.17 | bwd_inner_microstep: 3844.67 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.23 [2024-11-14 00:33:35,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.71 | bwd: 3852.18 | bwd_inner: 3844.67 | bwd_allreduce: 7.47 | step: 21.24 6%|▌ | 2933/50750 [7:50:57<78:43:53, 5.93s/it] {'loss': 0.0034, 'learning_rate': 3.991908329930991e-05, 'epoch': 2.89} 6%|▌ | 2933/50750 [7:50:57<78:43:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:33:41,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:33:41,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.56 | bwd_microstep: 3847.36 | bwd_inner_microstep: 3839.90 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.87 [2024-11-14 00:33:41,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.56 | bwd: 3847.37 | bwd_inner: 3839.90 | bwd_allreduce: 7.43 | step: 20.88 6%|▌ | 2934/50750 [7:51:03<78:43:05, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.991896856074126e-05, 'epoch': 2.89} 6%|▌ | 2934/50750 [7:51:03<78:43:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 00:33:47,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:33:47,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.66 | bwd_microstep: 3860.99 | bwd_inner_microstep: 3853.50 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.94 [2024-11-14 00:33:47,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.66 | bwd: 3861.00 | bwd_inner: 3853.50 | bwd_allreduce: 7.46 | step: 20.94 6%|▌ | 2935/50750 [7:51:09<78:44:24, 5.93s/it] {'loss': 0.5978, 'learning_rate': 3.9918853741046655e-05, 'epoch': 2.89} 6%|▌ | 2935/50750 [7:51:09<78:44:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:33:53,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:33:53,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.46 | bwd_microstep: 3847.86 | bwd_inner_microstep: 3840.38 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.86 [2024-11-14 00:33:53,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.46 | bwd: 3847.87 | bwd_inner: 3840.38 | bwd_allreduce: 7.45 | step: 20.86 6%|▌ | 2936/50750 [7:51:15<78:42:13, 5.93s/it] {'loss': 0.1247, 'learning_rate': 3.9918738840226555e-05, 'epoch': 2.89} 6%|▌ | 2936/50750 [7:51:15<78:42:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:33:59,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:33:59,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.83 | bwd_microstep: 3845.64 | bwd_inner_microstep: 3838.17 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.87 [2024-11-14 00:33:59,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.83 | bwd: 3845.65 | bwd_inner: 3838.17 | bwd_allreduce: 7.44 | step: 20.88 6%|▌ | 2937/50750 [7:51:21<78:40:45, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.991862385828143e-05, 'epoch': 2.89} 6%|▌ | 2937/50750 [7:51:21<78:40:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:34:05,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 00:34:05,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.62 | bwd_microstep: 3853.44 | bwd_inner_microstep: 3845.95 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.92 [2024-11-14 00:34:05,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.62 | bwd: 3853.45 | bwd_inner: 3845.95 | bwd_allreduce: 7.46 | step: 20.93 6%|▌ | 2938/50750 [7:51:27<78:41:16, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.9918508795211745e-05, 'epoch': 2.89} 6%|▌ | 2938/50750 [7:51:27<78:41:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:34:11,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:34:11,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.46 | bwd_microstep: 3848.36 | bwd_inner_microstep: 3840.87 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.87 [2024-11-14 00:34:11,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.46 | bwd: 3848.37 | bwd_inner: 3840.87 | bwd_allreduce: 7.46 | step: 20.87 6%|▌ | 2939/50750 [7:51:33<78:40:47, 5.92s/it] {'loss': 0.0469, 'learning_rate': 3.991839365101798e-05, 'epoch': 2.9} 6%|▌ | 2939/50750 [7:51:33<78:40:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:34:17,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:34:17,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.98 | bwd_microstep: 3850.35 | bwd_inner_microstep: 3842.79 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.24 [2024-11-14 00:34:17,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.97 | bwd: 3850.36 | bwd_inner: 3842.79 | bwd_allreduce: 7.53 | step: 21.25 6%|▌ | 2940/50750 [7:51:39<78:40:04, 5.92s/it] {'loss': 0.3641, 'learning_rate': 3.991827842570059e-05, 'epoch': 2.9} 6%|▌ | 2940/50750 [7:51:39<78:40:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:34:23,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:34:23,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.88 | bwd_microstep: 3847.23 | bwd_inner_microstep: 3838.96 | bwd_allreduce_microstep: 8.21 | step_microstep: 22.92 [2024-11-14 00:34:23,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.87 | bwd: 3847.25 | bwd_inner: 3838.96 | bwd_allreduce: 8.24 | step: 22.92 6%|▌ | 2941/50750 [7:51:45<78:41:35, 5.93s/it] {'loss': 0.3329, 'learning_rate': 3.9918163119260056e-05, 'epoch': 2.9} 6%|▌ | 2941/50750 [7:51:45<78:41:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:34:28,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-14 00:34:28,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.67 | bwd_microstep: 3850.31 | bwd_inner_microstep: 3842.83 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.24 [2024-11-14 00:34:28,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.65 | bwd: 3850.32 | bwd_inner: 3842.83 | bwd_allreduce: 7.46 | step: 21.24 6%|▌ | 2942/50750 [7:51:50<78:41:21, 5.93s/it] {'loss': 0.0012, 'learning_rate': 3.9918047731696836e-05, 'epoch': 2.9} 6%|▌ | 2942/50750 [7:51:50<78:41:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:34:34,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:34:34,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.30 | bwd_microstep: 3853.21 | bwd_inner_microstep: 3845.72 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.97 [2024-11-14 00:34:34,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.30 | bwd: 3853.22 | bwd_inner: 3845.72 | bwd_allreduce: 7.47 | step: 20.97 6%|▌ | 2943/50750 [7:51:56<78:41:33, 5.93s/it] {'loss': 0.0056, 'learning_rate': 3.99179322630114e-05, 'epoch': 2.9} 6%|▌ | 2943/50750 [7:51:56<78:41:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:34:40,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:34:40,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.46 | bwd_microstep: 3852.32 | bwd_inner_microstep: 3844.83 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.98 [2024-11-14 00:34:40,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.46 | bwd: 3852.33 | bwd_inner: 3844.83 | bwd_allreduce: 7.46 | step: 20.98 6%|▌ | 2944/50750 [7:52:02<78:40:32, 5.92s/it] {'loss': 0.2381, 'learning_rate': 3.991781671320423e-05, 'epoch': 2.9} 6%|▌ | 2944/50750 [7:52:02<78:40:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:34:46,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:34:46,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.85 | bwd_microstep: 3849.22 | bwd_inner_microstep: 3841.74 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.88 [2024-11-14 00:34:46,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.85 | bwd: 3849.24 | bwd_inner: 3841.74 | bwd_allreduce: 7.45 | step: 20.88 6%|▌ | 2945/50750 [7:52:08<78:40:38, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.99177010822758e-05, 'epoch': 2.9} 6%|▌ | 2945/50750 [7:52:08<78:40:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:34:52,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:34:52,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.75 | bwd_microstep: 3850.08 | bwd_inner_microstep: 3842.60 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.13 [2024-11-14 00:34:52,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.75 | bwd: 3850.09 | bwd_inner: 3842.60 | bwd_allreduce: 7.45 | step: 21.13 6%|▌ | 2946/50750 [7:52:14<78:41:56, 5.93s/it] {'loss': 0.0743, 'learning_rate': 3.991758537022656e-05, 'epoch': 2.9} 6%|▌ | 2946/50750 [7:52:14<78:41:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:34:58,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-14 00:34:58,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.54 | bwd_microstep: 3851.53 | bwd_inner_microstep: 3843.84 | bwd_allreduce_microstep: 7.64 | step_microstep: 22.39 [2024-11-14 00:34:58,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.54 | bwd: 3851.55 | bwd_inner: 3843.84 | bwd_allreduce: 7.66 | step: 22.39 6%|▌ | 2947/50750 [7:52:20<78:41:47, 5.93s/it] {'loss': 0.001, 'learning_rate': 3.9917469577057e-05, 'epoch': 2.9} 6%|▌ | 2947/50750 [7:52:20<78:41:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:35:04,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.94 [2024-11-14 00:35:04,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.52 | bwd_microstep: 3848.37 | bwd_inner_microstep: 3840.88 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.90 [2024-11-14 00:35:04,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.52 | bwd: 3848.38 | bwd_inner: 3840.88 | bwd_allreduce: 7.47 | step: 20.91 6%|▌ | 2948/50750 [7:52:26<78:40:16, 5.92s/it] {'loss': 0.017, 'learning_rate': 3.991735370276758e-05, 'epoch': 2.9} 6%|▌ | 2948/50750 [7:52:26<78:40:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:35:10,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:35:10,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.16 | bwd_microstep: 3848.07 | bwd_inner_microstep: 3840.56 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.97 [2024-11-14 00:35:10,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.16 | bwd: 3848.08 | bwd_inner: 3840.56 | bwd_allreduce: 7.48 | step: 20.98 6%|▌ | 2949/50750 [7:52:32<78:39:38, 5.92s/it] {'loss': 0.0035, 'learning_rate': 3.991723774735878e-05, 'epoch': 2.91} 6%|▌ | 2949/50750 [7:52:32<78:39:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:35:16,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 00:35:16,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.86 | bwd_microstep: 3850.71 | bwd_inner_microstep: 3843.24 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.83 [2024-11-14 00:35:16,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.86 | bwd: 3850.72 | bwd_inner: 3843.24 | bwd_allreduce: 7.45 | step: 20.84 6%|▌ | 2950/50750 [7:52:38<78:39:05, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.991712171083106e-05, 'epoch': 2.91} 6%|▌ | 2950/50750 [7:52:38<78:39:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:35:22,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:35:22,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.48 | bwd_microstep: 3849.98 | bwd_inner_microstep: 3842.48 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.09 [2024-11-14 00:35:22,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.48 | bwd: 3849.99 | bwd_inner: 3842.48 | bwd_allreduce: 7.47 | step: 21.09 6%|▌ | 2951/50750 [7:52:44<78:39:45, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.991700559318491e-05, 'epoch': 2.91} 6%|▌ | 2951/50750 [7:52:44<78:39:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:35:28,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:35:28,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.61 | bwd_microstep: 3846.87 | bwd_inner_microstep: 3839.38 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.07 [2024-11-14 00:35:28,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.61 | bwd: 3846.88 | bwd_inner: 3839.38 | bwd_allreduce: 7.47 | step: 21.07 6%|▌ | 2952/50750 [7:52:50<78:38:01, 5.92s/it] {'loss': 0.1894, 'learning_rate': 3.991688939442079e-05, 'epoch': 2.91} 6%|▌ | 2952/50750 [7:52:50<78:38:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:35:34,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 00:35:34,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.39 | bwd_microstep: 3853.39 | bwd_inner_microstep: 3845.89 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.42 [2024-11-14 00:35:34,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.37 | bwd: 3853.41 | bwd_inner: 3845.89 | bwd_allreduce: 7.48 | step: 21.42 6%|▌ | 2953/50750 [7:52:56<78:40:17, 5.93s/it] {'loss': 0.3739, 'learning_rate': 3.991677311453918e-05, 'epoch': 2.91} 6%|▌ | 2953/50750 [7:52:56<78:40:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:35:40,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 00:35:40,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.23 | bwd_microstep: 3858.15 | bwd_inner_microstep: 3850.64 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.96 [2024-11-14 00:35:40,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.22 | bwd: 3858.16 | bwd_inner: 3850.64 | bwd_allreduce: 7.48 | step: 20.96 6%|▌ | 2954/50750 [7:53:02<78:42:11, 5.93s/it] {'loss': 0.0008, 'learning_rate': 3.991665675354055e-05, 'epoch': 2.91} 6%|▌ | 2954/50750 [7:53:02<78:42:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:35:46,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:35:46,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.18 | bwd_microstep: 3857.17 | bwd_inner_microstep: 3849.67 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.92 [2024-11-14 00:35:46,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.17 | bwd: 3857.18 | bwd_inner: 3849.67 | bwd_allreduce: 7.48 | step: 20.92 6%|▌ | 2955/50750 [7:53:07<78:42:31, 5.93s/it] {'loss': 0.1092, 'learning_rate': 3.991654031142537e-05, 'epoch': 2.91} 6%|▌ | 2955/50750 [7:53:07<78:42:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:35:51,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 00:35:51,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.26 | bwd_microstep: 3853.86 | bwd_inner_microstep: 3846.25 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.89 [2024-11-14 00:35:51,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.26 | bwd: 3853.88 | bwd_inner: 3846.25 | bwd_allreduce: 7.58 | step: 21.89 6%|▌ | 2956/50750 [7:53:13<78:43:49, 5.93s/it] {'loss': 0.8843, 'learning_rate': 3.9916423788194135e-05, 'epoch': 2.91} 6%|▌ | 2956/50750 [7:53:13<78:43:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:35:57,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:35:57,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.34 | bwd_microstep: 3854.03 | bwd_inner_microstep: 3846.28 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.15 [2024-11-14 00:35:57,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.33 | bwd: 3854.04 | bwd_inner: 3846.28 | bwd_allreduce: 7.72 | step: 21.16 6%|▌ | 2957/50750 [7:53:19<78:45:31, 5.93s/it] {'loss': 0.0043, 'learning_rate': 3.991630718384729e-05, 'epoch': 2.91} 6%|▌ | 2957/50750 [7:53:19<78:45:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:36:03,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-14 00:36:03,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.96 | bwd_microstep: 3848.26 | bwd_inner_microstep: 3840.41 | bwd_allreduce_microstep: 7.80 | step_microstep: 21.29 [2024-11-14 00:36:03,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.96 | bwd: 3848.27 | bwd_inner: 3840.41 | bwd_allreduce: 7.82 | step: 21.29 6%|▌ | 2958/50750 [7:53:25<78:42:53, 5.93s/it] {'loss': 0.4644, 'learning_rate': 3.9916190498385325e-05, 'epoch': 2.91} 6%|▌ | 2958/50750 [7:53:25<78:42:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:36:09,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:36:09,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.62 | bwd_microstep: 3856.83 | bwd_inner_microstep: 3849.27 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.43 [2024-11-14 00:36:09,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.61 | bwd: 3856.84 | bwd_inner: 3849.27 | bwd_allreduce: 7.52 | step: 21.43 6%|▌ | 2959/50750 [7:53:31<78:43:58, 5.93s/it] {'loss': 0.3023, 'learning_rate': 3.991607373180871e-05, 'epoch': 2.92} 6%|▌ | 2959/50750 [7:53:31<78:43:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:36:15,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:36:15,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.44 | bwd_microstep: 3854.74 | bwd_inner_microstep: 3847.28 | bwd_allreduce_microstep: 7.41 | step_microstep: 21.55 [2024-11-14 00:36:15,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.44 | bwd: 3854.75 | bwd_inner: 3847.28 | bwd_allreduce: 7.43 | step: 21.55 6%|▌ | 2960/50750 [7:53:37<78:43:02, 5.93s/it] {'loss': 0.0161, 'learning_rate': 3.991595688411794e-05, 'epoch': 2.92} 6%|▌ | 2960/50750 [7:53:37<78:43:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:36:21,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.42 | optimizer_step: 4.92 [2024-11-14 00:36:21,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.97 | bwd_microstep: 3853.23 | bwd_inner_microstep: 3844.82 | bwd_allreduce_microstep: 8.34 | step_microstep: 28.74 [2024-11-14 00:36:21,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.97 | bwd: 3853.25 | bwd_inner: 3844.82 | bwd_allreduce: 8.37 | step: 28.74 6%|▌ | 2961/50750 [7:53:43<78:44:40, 5.93s/it] {'loss': 0.4325, 'learning_rate': 3.991583995531347e-05, 'epoch': 2.92} 6%|▌ | 2961/50750 [7:53:43<78:44:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:36:27,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:36:27,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2035.12 | bwd_microstep: 3851.28 | bwd_inner_microstep: 3843.74 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.04 [2024-11-14 00:36:27,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2035.11 | bwd: 3851.30 | bwd_inner: 3843.74 | bwd_allreduce: 7.52 | step: 21.04 6%|▌ | 2962/50750 [7:53:49<78:45:53, 5.93s/it] {'loss': 0.0068, 'learning_rate': 3.9915722945395775e-05, 'epoch': 2.92} 6%|▌ | 2962/50750 [7:53:49<78:45:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:36:33,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.92 [2024-11-14 00:36:33,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.11 | bwd_microstep: 3857.23 | bwd_inner_microstep: 3849.41 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.15 [2024-11-14 00:36:33,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.11 | bwd: 3857.25 | bwd_inner: 3849.41 | bwd_allreduce: 7.79 | step: 22.15 6%|▌ | 2963/50750 [7:53:55<78:47:43, 5.94s/it] {'loss': 0.1144, 'learning_rate': 3.991560585436534e-05, 'epoch': 2.92} 6%|▌ | 2963/50750 [7:53:55<78:47:43, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:36:39,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 00:36:39,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.94 | bwd_microstep: 3845.94 | bwd_inner_microstep: 3838.41 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.16 [2024-11-14 00:36:39,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.90 | bwd: 3845.95 | bwd_inner: 3838.41 | bwd_allreduce: 7.50 | step: 21.17 6%|▌ | 2964/50750 [7:54:01<78:46:17, 5.93s/it] {'loss': 0.0015, 'learning_rate': 3.991548868222264e-05, 'epoch': 2.92} 6%|▌ | 2964/50750 [7:54:01<78:46:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:36:45,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:36:45,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.91 | bwd_microstep: 3845.85 | bwd_inner_microstep: 3838.10 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.41 [2024-11-14 00:36:45,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.91 | bwd: 3845.87 | bwd_inner: 3838.10 | bwd_allreduce: 7.72 | step: 21.42 6%|▌ | 2965/50750 [7:54:07<78:41:52, 5.93s/it] {'loss': 0.0029, 'learning_rate': 3.9915371428968155e-05, 'epoch': 2.92} 6%|▌ | 2965/50750 [7:54:07<78:41:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:36:51,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-14 00:36:51,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.85 | bwd_microstep: 3848.81 | bwd_inner_microstep: 3841.31 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.28 [2024-11-14 00:36:51,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.85 | bwd: 3848.83 | bwd_inner: 3841.31 | bwd_allreduce: 7.48 | step: 21.28 6%|▌ | 2966/50750 [7:54:13<78:40:04, 5.93s/it] {'loss': 0.0059, 'learning_rate': 3.991525409460235e-05, 'epoch': 2.92} 6%|▌ | 2966/50750 [7:54:13<78:40:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:36:57,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:36:57,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.61 | bwd_microstep: 3851.04 | bwd_inner_microstep: 3843.53 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.08 [2024-11-14 00:36:57,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.61 | bwd: 3851.05 | bwd_inner: 3843.53 | bwd_allreduce: 7.48 | step: 21.09 6%|▌ | 2967/50750 [7:54:19<78:39:55, 5.93s/it] {'loss': 0.0334, 'learning_rate': 3.991513667912573e-05, 'epoch': 2.92} 6%|▌ | 2967/50750 [7:54:19<78:39:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:37:03,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:37:03,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.63 | bwd_microstep: 3856.04 | bwd_inner_microstep: 3848.48 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.49 [2024-11-14 00:37:03,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.62 | bwd: 3856.06 | bwd_inner: 3848.47 | bwd_allreduce: 7.54 | step: 21.49 6%|▌ | 2968/50750 [7:54:25<78:41:43, 5.93s/it] {'loss': 0.0025, 'learning_rate': 3.991501918253874e-05, 'epoch': 2.92} 6%|▌ | 2968/50750 [7:54:25<78:41:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:37:09,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.94 [2024-11-14 00:37:09,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.15 | bwd_microstep: 3854.84 | bwd_inner_microstep: 3847.23 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.53 [2024-11-14 00:37:09,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.14 | bwd: 3854.86 | bwd_inner: 3847.23 | bwd_allreduce: 7.58 | step: 21.53 6%|▌ | 2969/50750 [7:54:31<78:41:26, 5.93s/it] {'loss': 0.7284, 'learning_rate': 3.991490160484189e-05, 'epoch': 2.93} 6%|▌ | 2969/50750 [7:54:31<78:41:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:37:14,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:37:14,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.97 | bwd_microstep: 3855.72 | bwd_inner_microstep: 3848.20 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.02 [2024-11-14 00:37:14,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.96 | bwd: 3855.73 | bwd_inner: 3848.20 | bwd_allreduce: 7.49 | step: 21.02 6%|▌ | 2970/50750 [7:54:36<78:42:34, 5.93s/it] {'loss': 0.0123, 'learning_rate': 3.991478394603563e-05, 'epoch': 2.93} 6%|▌ | 2970/50750 [7:54:36<78:42:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:37:20,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:37:20,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.60 | bwd_microstep: 3851.49 | bwd_inner_microstep: 3843.95 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.15 [2024-11-14 00:37:20,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.60 | bwd: 3851.50 | bwd_inner: 3843.95 | bwd_allreduce: 7.51 | step: 21.15 6%|▌ | 2971/50750 [7:54:42<78:41:32, 5.93s/it] {'loss': 0.0447, 'learning_rate': 3.991466620612045e-05, 'epoch': 2.93} 6%|▌ | 2971/50750 [7:54:42<78:41:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:37:26,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 00:37:26,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.36 | bwd_microstep: 3851.95 | bwd_inner_microstep: 3844.44 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.48 [2024-11-14 00:37:26,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.36 | bwd: 3851.96 | bwd_inner: 3844.44 | bwd_allreduce: 7.48 | step: 21.49 6%|▌ | 2972/50750 [7:54:48<78:41:48, 5.93s/it] {'loss': 0.0143, 'learning_rate': 3.9914548385096845e-05, 'epoch': 2.93} 6%|▌ | 2972/50750 [7:54:48<78:41:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:37:32,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.47 | optimizer_step: 4.92 [2024-11-14 00:37:32,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.63 | bwd_microstep: 3862.16 | bwd_inner_microstep: 3854.57 | bwd_allreduce_microstep: 7.55 | step_microstep: 22.49 [2024-11-14 00:37:32,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.63 | bwd: 3862.18 | bwd_inner: 3854.57 | bwd_allreduce: 7.57 | step: 22.50 6%|▌ | 2973/50750 [7:54:54<78:43:42, 5.93s/it] {'loss': 0.0021, 'learning_rate': 3.991443048296527e-05, 'epoch': 2.93} 6%|▌ | 2973/50750 [7:54:54<78:43:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:37:38,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-14 00:37:38,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.10 | bwd_microstep: 3853.62 | bwd_inner_microstep: 3846.10 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.05 [2024-11-14 00:37:38,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.09 | bwd: 3853.63 | bwd_inner: 3846.10 | bwd_allreduce: 7.49 | step: 21.06 6%|▌ | 2974/50750 [7:55:00<78:42:25, 5.93s/it] {'loss': 0.005, 'learning_rate': 3.991431249972623e-05, 'epoch': 2.93} 6%|▌ | 2974/50750 [7:55:00<78:42:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:37:44,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:37:44,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.82 | bwd_microstep: 3849.16 | bwd_inner_microstep: 3841.68 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.86 [2024-11-14 00:37:44,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.83 | bwd: 3849.18 | bwd_inner: 3841.68 | bwd_allreduce: 7.45 | step: 20.86 6%|▌ | 2975/50750 [7:55:06<78:40:21, 5.93s/it] {'loss': 0.8571, 'learning_rate': 3.991419443538019e-05, 'epoch': 2.93} 6%|▌ | 2975/50750 [7:55:06<78:40:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:37:50,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 5.11 [2024-11-14 00:37:50,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.19 | bwd_microstep: 3847.07 | bwd_inner_microstep: 3839.61 | bwd_allreduce_microstep: 7.41 | step_microstep: 21.04 [2024-11-14 00:37:50,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.18 | bwd: 3847.08 | bwd_inner: 3839.61 | bwd_allreduce: 7.43 | step: 21.05 6%|▌ | 2976/50750 [7:55:12<78:38:04, 5.93s/it] {'loss': 0.1902, 'learning_rate': 3.991407628992763e-05, 'epoch': 2.93} 6%|▌ | 2976/50750 [7:55:12<78:38:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:37:56,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:37:56,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.30 | bwd_microstep: 3850.56 | bwd_inner_microstep: 3843.09 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.88 [2024-11-14 00:37:56,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.30 | bwd: 3850.57 | bwd_inner: 3843.09 | bwd_allreduce: 7.44 | step: 20.88 6%|▌ | 2977/50750 [7:55:18<78:37:24, 5.92s/it] {'loss': 0.3244, 'learning_rate': 3.991395806336903e-05, 'epoch': 2.93} 6%|▌ | 2977/50750 [7:55:18<78:37:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:38:02,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-14 00:38:02,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.08 | bwd_microstep: 3856.16 | bwd_inner_microstep: 3848.52 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.47 [2024-11-14 00:38:02,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.08 | bwd: 3856.17 | bwd_inner: 3848.52 | bwd_allreduce: 7.61 | step: 21.47 6%|▌ | 2978/50750 [7:55:24<78:39:15, 5.93s/it] {'loss': 0.1132, 'learning_rate': 3.991383975570488e-05, 'epoch': 2.93} 6%|▌ | 2978/50750 [7:55:24<78:39:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:38:08,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:38:08,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.72 | bwd_microstep: 3850.92 | bwd_inner_microstep: 3843.41 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.02 [2024-11-14 00:38:08,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.70 | bwd: 3850.93 | bwd_inner: 3843.41 | bwd_allreduce: 7.48 | step: 21.02 6%|▌ | 2979/50750 [7:55:30<78:38:30, 5.93s/it] {'loss': 0.0069, 'learning_rate': 3.991372136693566e-05, 'epoch': 2.93} 6%|▌ | 2979/50750 [7:55:30<78:38:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:38:14,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:38:14,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.19 | bwd_microstep: 3851.52 | bwd_inner_microstep: 3844.05 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.95 [2024-11-14 00:38:14,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.18 | bwd: 3851.54 | bwd_inner: 3844.05 | bwd_allreduce: 7.45 | step: 20.95 6%|▌ | 2980/50750 [7:55:36<78:37:53, 5.93s/it] {'loss': 0.0752, 'learning_rate': 3.9913602897061846e-05, 'epoch': 2.94} 6%|▌ | 2980/50750 [7:55:36<78:37:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:38:20,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:38:20,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.36 | bwd_microstep: 3848.56 | bwd_inner_microstep: 3841.10 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.75 [2024-11-14 00:38:20,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.36 | bwd: 3848.57 | bwd_inner: 3841.09 | bwd_allreduce: 7.44 | step: 20.75 6%|▌ | 2981/50750 [7:55:42<78:36:00, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.991348434608393e-05, 'epoch': 2.94} 6%|▌ | 2981/50750 [7:55:42<78:36:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:38:26,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:38:26,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.52 | bwd_microstep: 3849.14 | bwd_inner_microstep: 3841.69 | bwd_allreduce_microstep: 7.41 | step_microstep: 21.07 [2024-11-14 00:38:26,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.52 | bwd: 3849.15 | bwd_inner: 3841.69 | bwd_allreduce: 7.42 | step: 21.08 6%|▌ | 2982/50750 [7:55:48<78:35:11, 5.92s/it] {'loss': 0.0573, 'learning_rate': 3.991336571400239e-05, 'epoch': 2.94} 6%|▌ | 2982/50750 [7:55:48<78:35:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:38:32,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-14 00:38:32,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.41 | bwd_microstep: 3844.07 | bwd_inner_microstep: 3836.60 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.88 [2024-11-14 00:38:32,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.41 | bwd: 3844.08 | bwd_inner: 3836.60 | bwd_allreduce: 7.44 | step: 20.88 6%|▌ | 2983/50750 [7:55:53<78:33:24, 5.92s/it] {'loss': 0.0103, 'learning_rate': 3.99132470008177e-05, 'epoch': 2.94} 6%|▌ | 2983/50750 [7:55:53<78:33:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:38:37,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:38:37,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.71 | bwd_microstep: 3856.84 | bwd_inner_microstep: 3849.27 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.95 [2024-11-14 00:38:37,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.71 | bwd: 3856.85 | bwd_inner: 3849.27 | bwd_allreduce: 7.54 | step: 21.96 6%|▌ | 2984/50750 [7:55:59<78:36:57, 5.93s/it] {'loss': 0.0159, 'learning_rate': 3.991312820653036e-05, 'epoch': 2.94} 6%|▌ | 2984/50750 [7:55:59<78:36:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:38:43,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:38:43,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.81 | bwd_microstep: 3848.40 | bwd_inner_microstep: 3840.92 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.82 [2024-11-14 00:38:43,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.81 | bwd: 3848.41 | bwd_inner: 3840.92 | bwd_allreduce: 7.45 | step: 20.82 6%|▌ | 2985/50750 [7:56:05<78:37:41, 5.93s/it] {'loss': 0.0029, 'learning_rate': 3.991300933114085e-05, 'epoch': 2.94} 6%|▌ | 2985/50750 [7:56:05<78:37:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:38:49,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:38:49,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.34 | bwd_microstep: 3846.22 | bwd_inner_microstep: 3838.74 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.82 [2024-11-14 00:38:49,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.34 | bwd: 3846.23 | bwd_inner: 3838.74 | bwd_allreduce: 7.45 | step: 20.83 6%|▌ | 2986/50750 [7:56:11<78:35:11, 5.92s/it] {'loss': 0.0076, 'learning_rate': 3.991289037464964e-05, 'epoch': 2.94} 6%|▌ | 2986/50750 [7:56:11<78:35:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:38:55,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 00:38:55,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.20 | bwd_microstep: 3855.53 | bwd_inner_microstep: 3848.04 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.89 [2024-11-14 00:38:55,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.20 | bwd: 3855.54 | bwd_inner: 3848.04 | bwd_allreduce: 7.47 | step: 20.89 6%|▌ | 2987/50750 [7:56:17<78:35:50, 5.92s/it] {'loss': 0.5721, 'learning_rate': 3.991277133705723e-05, 'epoch': 2.94} 6%|▌ | 2987/50750 [7:56:17<78:35:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:39:01,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:39:01,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.19 | bwd_microstep: 3846.37 | bwd_inner_microstep: 3838.91 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.04 [2024-11-14 00:39:01,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.19 | bwd: 3846.38 | bwd_inner: 3838.91 | bwd_allreduce: 7.43 | step: 21.04 6%|▌ | 2988/50750 [7:56:23<78:34:25, 5.92s/it] {'loss': 0.0236, 'learning_rate': 3.9912652218364095e-05, 'epoch': 2.94} 6%|▌ | 2988/50750 [7:56:23<78:34:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:39:07,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.59 | optimizer_step: 4.93 [2024-11-14 00:39:07,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.99 | bwd_microstep: 3854.90 | bwd_inner_microstep: 3846.86 | bwd_allreduce_microstep: 7.97 | step_microstep: 29.47 [2024-11-14 00:39:07,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.99 | bwd: 3854.92 | bwd_inner: 3846.86 | bwd_allreduce: 8.00 | step: 29.47 6%|▌ | 2989/50750 [7:56:29<78:37:36, 5.93s/it] {'loss': 0.0015, 'learning_rate': 3.991253301857073e-05, 'epoch': 2.94} 6%|▌ | 2989/50750 [7:56:29<78:37:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:39:13,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.94 [2024-11-14 00:39:13,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.28 | bwd_microstep: 3852.14 | bwd_inner_microstep: 3844.63 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.05 [2024-11-14 00:39:13,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.26 | bwd: 3852.15 | bwd_inner: 3844.63 | bwd_allreduce: 7.48 | step: 21.05 6%|▌ | 2990/50750 [7:56:35<78:37:52, 5.93s/it] {'loss': 0.0225, 'learning_rate': 3.9912413737677616e-05, 'epoch': 2.95} 6%|▌ | 2990/50750 [7:56:35<78:37:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:39:19,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 00:39:19,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.80 | bwd_microstep: 3857.13 | bwd_inner_microstep: 3849.59 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.41 [2024-11-14 00:39:19,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.80 | bwd: 3857.14 | bwd_inner: 3849.59 | bwd_allreduce: 7.50 | step: 21.41 6%|▌ | 2991/50750 [7:56:41<78:38:45, 5.93s/it] {'loss': 0.0907, 'learning_rate': 3.991229437568523e-05, 'epoch': 2.95} 6%|▌ | 2991/50750 [7:56:41<78:38:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:39:25,351] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.92 [2024-11-14 00:39:25,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.17 | bwd_microstep: 3848.90 | bwd_inner_microstep: 3841.17 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.77 [2024-11-14 00:39:25,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.16 | bwd: 3848.91 | bwd_inner: 3841.17 | bwd_allreduce: 7.70 | step: 21.77 6%|▌ | 2992/50750 [7:56:47<78:38:14, 5.93s/it] {'loss': 0.562, 'learning_rate': 3.991217493259408e-05, 'epoch': 2.95} 6%|▌ | 2992/50750 [7:56:47<78:38:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:39:31,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.58 | optimizer_step: 4.93 [2024-11-14 00:39:31,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.19 | bwd_microstep: 3850.38 | bwd_inner_microstep: 3842.53 | bwd_allreduce_microstep: 7.79 | step_microstep: 29.91 [2024-11-14 00:39:31,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.18 | bwd: 3850.40 | bwd_inner: 3842.53 | bwd_allreduce: 7.82 | step: 29.91 6%|▌ | 2993/50750 [7:56:53<78:40:10, 5.93s/it] {'loss': 0.0329, 'learning_rate': 3.9912055408404624e-05, 'epoch': 2.95} 6%|▌ | 2993/50750 [7:56:53<78:40:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:39:37,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 00:39:37,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.60 | bwd_microstep: 3855.94 | bwd_inner_microstep: 3848.27 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.59 [2024-11-14 00:39:37,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.59 | bwd: 3855.95 | bwd_inner: 3848.27 | bwd_allreduce: 7.64 | step: 21.60 6%|▌ | 2994/50750 [7:56:59<78:41:26, 5.93s/it] {'loss': 0.0025, 'learning_rate': 3.991193580311737e-05, 'epoch': 2.95} 6%|▌ | 2994/50750 [7:56:59<78:41:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:39:43,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:39:43,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.38 | bwd_microstep: 3855.22 | bwd_inner_microstep: 3847.69 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.09 [2024-11-14 00:39:43,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.36 | bwd: 3855.23 | bwd_inner: 3847.69 | bwd_allreduce: 7.50 | step: 21.09 6%|▌ | 2995/50750 [7:57:05<78:41:56, 5.93s/it] {'loss': 0.2208, 'learning_rate': 3.99118161167328e-05, 'epoch': 2.95} 6%|▌ | 2995/50750 [7:57:05<78:41:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:39:49,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:39:49,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.90 | bwd_microstep: 3849.11 | bwd_inner_microstep: 3841.57 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.09 [2024-11-14 00:39:49,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.90 | bwd: 3849.12 | bwd_inner: 3841.57 | bwd_allreduce: 7.52 | step: 21.09 6%|▌ | 2996/50750 [7:57:11<78:39:11, 5.93s/it] {'loss': 0.0124, 'learning_rate': 3.99116963492514e-05, 'epoch': 2.95} 6%|▌ | 2996/50750 [7:57:11<78:39:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:39:55,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.92 [2024-11-14 00:39:55,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.24 | bwd_microstep: 3853.68 | bwd_inner_microstep: 3846.14 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.13 [2024-11-14 00:39:55,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.23 | bwd: 3853.69 | bwd_inner: 3846.14 | bwd_allreduce: 7.51 | step: 21.13 6%|▌ | 2997/50750 [7:57:16<78:39:42, 5.93s/it] {'loss': 0.0094, 'learning_rate': 3.991157650067365e-05, 'epoch': 2.95} 6%|▌ | 2997/50750 [7:57:16<78:39:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:40:00,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 00:40:00,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.69 | bwd_microstep: 3851.83 | bwd_inner_microstep: 3843.36 | bwd_allreduce_microstep: 8.43 | step_microstep: 21.80 [2024-11-14 00:40:00,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.69 | bwd: 3851.85 | bwd_inner: 3843.36 | bwd_allreduce: 8.44 | step: 21.80 6%|▌ | 2998/50750 [7:57:22<78:40:00, 5.93s/it] {'loss': 0.048, 'learning_rate': 3.991145657100005e-05, 'epoch': 2.95} 6%|▌ | 2998/50750 [7:57:22<78:40:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:40:06,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:40:06,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.02 | bwd_microstep: 3848.71 | bwd_inner_microstep: 3840.69 | bwd_allreduce_microstep: 7.98 | step_microstep: 21.12 [2024-11-14 00:40:06,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.00 | bwd: 3848.72 | bwd_inner: 3840.69 | bwd_allreduce: 7.99 | step: 21.13 6%|▌ | 2999/50750 [7:57:28<78:37:51, 5.93s/it] {'loss': 0.0491, 'learning_rate': 3.9911336560231085e-05, 'epoch': 2.95} 6%|▌ | 2999/50750 [7:57:28<78:37:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:40:12,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:40:12,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.28 | bwd_microstep: 3847.00 | bwd_inner_microstep: 3839.52 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.26 [2024-11-14 00:40:12,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.28 | bwd: 3847.01 | bwd_inner: 3839.52 | bwd_allreduce: 7.45 | step: 21.26 6%|▌ | 3000/50750 [7:57:34<78:36:54, 5.93s/it] {'loss': 0.0001, 'learning_rate': 3.991121646836725e-05, 'epoch': 2.96} 6%|▌ | 3000/50750 [7:57:34<78:36:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:40:18,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:40:18,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.73 | bwd_microstep: 3848.84 | bwd_inner_microstep: 3841.36 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.30 [2024-11-14 00:40:18,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.73 | bwd: 3848.85 | bwd_inner: 3841.36 | bwd_allreduce: 7.45 | step: 21.31 6%|▌ | 3001/50750 [7:57:40<78:35:17, 5.93s/it] {'loss': 0.0014, 'learning_rate': 3.9911096295409016e-05, 'epoch': 2.96} 6%|▌ | 3001/50750 [7:57:40<78:35:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:40:24,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:40:24,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.28 | bwd_microstep: 3846.01 | bwd_inner_microstep: 3838.49 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.15 [2024-11-14 00:40:24,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.28 | bwd: 3846.03 | bwd_inner: 3838.49 | bwd_allreduce: 7.49 | step: 21.15 6%|▌ | 3002/50750 [7:57:46<78:34:36, 5.92s/it] {'loss': 0.0435, 'learning_rate': 3.991097604135689e-05, 'epoch': 2.96} 6%|▌ | 3002/50750 [7:57:46<78:34:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2203 [2024-11-14 00:40:30,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:40:30,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.09 | bwd_microstep: 3852.59 | bwd_inner_microstep: 3845.05 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.09 [2024-11-14 00:40:30,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.09 | bwd: 3852.60 | bwd_inner: 3845.05 | bwd_allreduce: 7.51 | step: 21.09 6%|▌ | 3003/50750 [7:57:52<78:35:03, 5.93s/it] {'loss': 0.2795, 'learning_rate': 3.9910855706211356e-05, 'epoch': 2.96} 6%|▌ | 3003/50750 [7:57:52<78:35:03, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:40:36,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-14 00:40:36,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.45 | bwd_microstep: 3848.60 | bwd_inner_microstep: 3840.60 | bwd_allreduce_microstep: 7.95 | step_microstep: 21.51 [2024-11-14 00:40:36,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.45 | bwd: 3848.61 | bwd_inner: 3840.60 | bwd_allreduce: 7.97 | step: 21.52 6%|▌ | 3004/50750 [7:57:58<78:35:20, 5.93s/it] {'loss': 0.0046, 'learning_rate': 3.99107352899729e-05, 'epoch': 2.96} 6%|▌ | 3004/50750 [7:57:58<78:35:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:40:42,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:40:42,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.50 | bwd_microstep: 3850.67 | bwd_inner_microstep: 3843.16 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.11 [2024-11-14 00:40:42,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.49 | bwd: 3850.68 | bwd_inner: 3843.16 | bwd_allreduce: 7.48 | step: 21.12 6%|▌ | 3005/50750 [7:58:04<78:35:43, 5.93s/it] {'loss': 0.0425, 'learning_rate': 3.9910614792642016e-05, 'epoch': 2.96} 6%|▌ | 3005/50750 [7:58:04<78:35:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:40:48,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:40:48,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.13 | bwd_microstep: 3846.16 | bwd_inner_microstep: 3838.62 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.21 [2024-11-14 00:40:48,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.13 | bwd: 3846.17 | bwd_inner: 3838.63 | bwd_allreduce: 7.51 | step: 21.22 6%|▌ | 3006/50750 [7:58:10<78:34:46, 5.93s/it] {'loss': 0.0031, 'learning_rate': 3.991049421421919e-05, 'epoch': 2.96} 6%|▌ | 3006/50750 [7:58:10<78:34:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:40:54,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 00:40:54,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.26 | bwd_microstep: 3852.22 | bwd_inner_microstep: 3844.48 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.67 [2024-11-14 00:40:54,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.26 | bwd: 3852.23 | bwd_inner: 3844.48 | bwd_allreduce: 7.71 | step: 21.68 6%|▌ | 3007/50750 [7:58:16<78:35:18, 5.93s/it] {'loss': 0.0066, 'learning_rate': 3.9910373554704926e-05, 'epoch': 2.96} 6%|▌ | 3007/50750 [7:58:16<78:35:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:41:00,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:41:00,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.71 | bwd_microstep: 3857.43 | bwd_inner_microstep: 3849.93 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.98 [2024-11-14 00:41:00,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.69 | bwd: 3857.44 | bwd_inner: 3849.93 | bwd_allreduce: 7.47 | step: 20.99 6%|▌ | 3008/50750 [7:58:22<78:37:58, 5.93s/it] {'loss': 0.0016, 'learning_rate': 3.991025281409971e-05, 'epoch': 2.96} 6%|▌ | 3008/50750 [7:58:22<78:37:58, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:41:06,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.57 | optimizer_step: 4.93 [2024-11-14 00:41:06,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.03 | bwd_microstep: 3861.27 | bwd_inner_microstep: 3853.32 | bwd_allreduce_microstep: 7.88 | step_microstep: 29.55 [2024-11-14 00:41:06,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.03 | bwd: 3861.29 | bwd_inner: 3853.32 | bwd_allreduce: 7.91 | step: 29.54 6%|▌ | 3009/50750 [7:58:28<78:41:42, 5.93s/it] {'loss': 0.003, 'learning_rate': 3.991013199240402e-05, 'epoch': 2.96} 6%|▌ | 3009/50750 [7:58:28<78:41:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:41:12,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 00:41:12,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.91 | bwd_microstep: 3846.28 | bwd_inner_microstep: 3838.80 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.33 [2024-11-14 00:41:12,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.90 | bwd: 3846.29 | bwd_inner: 3838.80 | bwd_allreduce: 7.45 | step: 21.33 6%|▌ | 3010/50750 [7:58:34<78:38:10, 5.93s/it] {'loss': 0.0073, 'learning_rate': 3.991001108961837e-05, 'epoch': 2.97} 6%|▌ | 3010/50750 [7:58:34<78:38:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:41:17,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-14 00:41:17,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.69 | bwd_microstep: 3849.71 | bwd_inner_microstep: 3842.22 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.96 [2024-11-14 00:41:17,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.69 | bwd: 3849.72 | bwd_inner: 3842.22 | bwd_allreduce: 7.46 | step: 20.97 6%|▌ | 3011/50750 [7:58:39<78:37:01, 5.93s/it] {'loss': 0.4741, 'learning_rate': 3.9909890105743236e-05, 'epoch': 2.97} 6%|▌ | 3011/50750 [7:58:39<78:37:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:41:23,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:41:23,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.79 | bwd_microstep: 3844.62 | bwd_inner_microstep: 3837.15 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.31 [2024-11-14 00:41:23,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.79 | bwd: 3844.63 | bwd_inner: 3837.15 | bwd_allreduce: 7.45 | step: 21.31 6%|▌ | 3012/50750 [7:58:45<78:34:46, 5.93s/it] {'loss': 0.4463, 'learning_rate': 3.990976904077911e-05, 'epoch': 2.97} 6%|▌ | 3012/50750 [7:58:45<78:34:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:41:29,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 00:41:29,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.10 | bwd_microstep: 3861.18 | bwd_inner_microstep: 3853.42 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.47 [2024-11-14 00:41:29,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.10 | bwd: 3861.19 | bwd_inner: 3853.42 | bwd_allreduce: 7.73 | step: 21.48 6%|▌ | 3013/50750 [7:58:51<78:37:37, 5.93s/it] {'loss': 0.0009, 'learning_rate': 3.99096478947265e-05, 'epoch': 2.97} 6%|▌ | 3013/50750 [7:58:51<78:37:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:41:35,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 00:41:35,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.28 | bwd_microstep: 3855.22 | bwd_inner_microstep: 3847.41 | bwd_allreduce_microstep: 7.76 | step_microstep: 22.33 [2024-11-14 00:41:35,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.29 | bwd: 3855.24 | bwd_inner: 3847.41 | bwd_allreduce: 7.78 | step: 22.33 6%|▌ | 3014/50750 [7:58:57<78:38:26, 5.93s/it] {'loss': 0.0079, 'learning_rate': 3.990952666758589e-05, 'epoch': 2.97} 6%|▌ | 3014/50750 [7:58:57<78:38:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:41:41,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:41:41,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.22 | bwd_microstep: 3843.94 | bwd_inner_microstep: 3836.47 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.85 [2024-11-14 00:41:41,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.20 | bwd: 3843.95 | bwd_inner: 3836.47 | bwd_allreduce: 7.44 | step: 20.85 6%|▌ | 3015/50750 [7:59:03<78:35:00, 5.93s/it] {'loss': 0.0038, 'learning_rate': 3.990940535935777e-05, 'epoch': 2.97} 6%|▌ | 3015/50750 [7:59:03<78:35:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:41:47,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:41:47,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.89 | bwd_microstep: 3849.21 | bwd_inner_microstep: 3841.75 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.82 [2024-11-14 00:41:47,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.89 | bwd: 3849.22 | bwd_inner: 3841.75 | bwd_allreduce: 7.43 | step: 20.82 6%|▌ | 3016/50750 [7:59:09<78:33:42, 5.92s/it] {'loss': 0.0085, 'learning_rate': 3.9909283970042644e-05, 'epoch': 2.97} 6%|▌ | 3016/50750 [7:59:09<78:33:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:41:53,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 5.00 [2024-11-14 00:41:53,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.96 | bwd_microstep: 3845.80 | bwd_inner_microstep: 3838.32 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.12 [2024-11-14 00:41:53,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.96 | bwd: 3845.81 | bwd_inner: 3838.32 | bwd_allreduce: 7.45 | step: 21.12 6%|▌ | 3017/50750 [7:59:15<78:31:20, 5.92s/it] {'loss': 0.0044, 'learning_rate': 3.9909162499641e-05, 'epoch': 2.97} 6%|▌ | 3017/50750 [7:59:15<78:31:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:41:59,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.94 [2024-11-14 00:41:59,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.18 | bwd_microstep: 3864.35 | bwd_inner_microstep: 3856.45 | bwd_allreduce_microstep: 7.83 | step_microstep: 26.98 [2024-11-14 00:41:59,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.16 | bwd: 3864.37 | bwd_inner: 3856.45 | bwd_allreduce: 7.86 | step: 26.99 6%|▌ | 3018/50750 [7:59:21<78:41:11, 5.93s/it] {'loss': 0.0108, 'learning_rate': 3.990904094815333e-05, 'epoch': 2.97} 6%|▌ | 3018/50750 [7:59:21<78:41:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:42:05,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.90 | optimizer_step: 4.93 [2024-11-14 00:42:05,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.40 | bwd_microstep: 3848.12 | bwd_inner_microstep: 3840.45 | bwd_allreduce_microstep: 7.62 | step_microstep: 24.00 [2024-11-14 00:42:05,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.38 | bwd: 3848.13 | bwd_inner: 3840.46 | bwd_allreduce: 7.63 | step: 24.02 6%|▌ | 3019/50750 [7:59:27<78:40:26, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.9908919315580134e-05, 'epoch': 2.97} 6%|▌ | 3019/50750 [7:59:27<78:40:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:42:11,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.59 | optimizer_step: 4.93 [2024-11-14 00:42:11,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.63 | bwd_microstep: 3849.42 | bwd_inner_microstep: 3841.42 | bwd_allreduce_microstep: 7.94 | step_microstep: 26.34 [2024-11-14 00:42:11,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.62 | bwd: 3849.44 | bwd_inner: 3841.42 | bwd_allreduce: 7.96 | step: 26.35 6%|▌ | 3020/50750 [7:59:33<78:41:41, 5.94s/it] {'loss': 0.0041, 'learning_rate': 3.990879760192191e-05, 'epoch': 2.98} 6%|▌ | 3020/50750 [7:59:33<78:41:41, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 00:42:17,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 5.03 [2024-11-14 00:42:17,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.20 | bwd_microstep: 3855.37 | bwd_inner_microstep: 3847.66 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.53 [2024-11-14 00:42:17,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.20 | bwd: 3855.38 | bwd_inner: 3847.66 | bwd_allreduce: 7.67 | step: 21.53 6%|▌ | 3021/50750 [7:59:39<78:39:56, 5.93s/it] {'loss': 0.1061, 'learning_rate': 3.9908675807179144e-05, 'epoch': 2.98} 6%|▌ | 3021/50750 [7:59:39<78:39:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:42:23,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 00:42:23,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.94 | bwd_microstep: 3853.70 | bwd_inner_microstep: 3845.90 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.79 [2024-11-14 00:42:23,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.94 | bwd: 3853.71 | bwd_inner: 3845.90 | bwd_allreduce: 7.76 | step: 21.80 6%|▌ | 3022/50750 [7:59:45<78:37:35, 5.93s/it] {'loss': 0.0053, 'learning_rate': 3.990855393135234e-05, 'epoch': 2.98} 6%|▌ | 3022/50750 [7:59:45<78:37:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:42:29,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:42:29,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.33 | bwd_microstep: 3852.44 | bwd_inner_microstep: 3844.92 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.09 [2024-11-14 00:42:29,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.33 | bwd: 3852.45 | bwd_inner: 3844.92 | bwd_allreduce: 7.49 | step: 21.10 6%|▌ | 3023/50750 [7:59:51<78:36:27, 5.93s/it] {'loss': 0.3933, 'learning_rate': 3.990843197444199e-05, 'epoch': 2.98} 6%|▌ | 3023/50750 [7:59:51<78:36:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:42:35,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:42:35,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.86 | bwd_microstep: 3848.71 | bwd_inner_microstep: 3841.19 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.06 [2024-11-14 00:42:35,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.86 | bwd: 3848.72 | bwd_inner: 3841.19 | bwd_allreduce: 7.49 | step: 21.06 6%|▌ | 3024/50750 [7:59:57<78:33:56, 5.93s/it] {'loss': 0.0036, 'learning_rate': 3.99083099364486e-05, 'epoch': 2.98} 6%|▌ | 3024/50750 [7:59:57<78:33:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:42:40,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:42:40,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.14 | bwd_microstep: 3848.86 | bwd_inner_microstep: 3841.35 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.98 [2024-11-14 00:42:40,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.14 | bwd: 3848.87 | bwd_inner: 3841.35 | bwd_allreduce: 7.48 | step: 20.99 6%|▌ | 3025/50750 [8:00:02<78:32:56, 5.93s/it] {'loss': 1.1593, 'learning_rate': 3.9908187817372655e-05, 'epoch': 2.98} 6%|▌ | 3025/50750 [8:00:02<78:32:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:42:46,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.30 | optimizer_step: 4.93 [2024-11-14 00:42:46,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.47 | bwd_microstep: 3853.57 | bwd_inner_microstep: 3845.66 | bwd_allreduce_microstep: 7.85 | step_microstep: 26.83 [2024-11-14 00:42:46,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.47 | bwd: 3853.59 | bwd_inner: 3845.66 | bwd_allreduce: 7.87 | step: 26.83 6%|▌ | 3026/50750 [8:00:08<78:36:10, 5.93s/it] {'loss': 0.0013, 'learning_rate': 3.9908065617214656e-05, 'epoch': 2.98} 6%|▌ | 3026/50750 [8:00:08<78:36:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:42:52,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:42:52,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.73 | bwd_microstep: 3846.38 | bwd_inner_microstep: 3838.88 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.85 [2024-11-14 00:42:52,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.71 | bwd: 3846.39 | bwd_inner: 3838.88 | bwd_allreduce: 7.47 | step: 20.85 6%|▌ | 3027/50750 [8:00:14<78:35:39, 5.93s/it] {'loss': 0.1888, 'learning_rate': 3.99079433359751e-05, 'epoch': 2.98} 6%|▌ | 3027/50750 [8:00:14<78:35:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:42:58,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 00:42:58,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.20 | bwd_microstep: 3850.73 | bwd_inner_microstep: 3843.27 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.86 [2024-11-14 00:42:58,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.21 | bwd: 3850.74 | bwd_inner: 3843.27 | bwd_allreduce: 7.43 | step: 20.86 6%|▌ | 3028/50750 [8:00:20<78:34:37, 5.93s/it] {'loss': 0.026, 'learning_rate': 3.9907820973654485e-05, 'epoch': 2.98} 6%|▌ | 3028/50750 [8:00:20<78:34:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 00:43:04,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.92 [2024-11-14 00:43:04,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.56 | bwd_microstep: 3851.77 | bwd_inner_microstep: 3844.12 | bwd_allreduce_microstep: 7.61 | step_microstep: 22.05 [2024-11-14 00:43:04,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.56 | bwd: 3851.79 | bwd_inner: 3844.12 | bwd_allreduce: 7.62 | step: 22.06 6%|▌ | 3029/50750 [8:00:26<78:34:37, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.990769853025332e-05, 'epoch': 2.98} 6%|▌ | 3029/50750 [8:00:26<78:34:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:43:10,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 00:43:10,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.19 | bwd_microstep: 3849.55 | bwd_inner_microstep: 3841.86 | bwd_allreduce_microstep: 7.65 | step_microstep: 20.94 [2024-11-14 00:43:10,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.18 | bwd: 3849.56 | bwd_inner: 3841.86 | bwd_allreduce: 7.66 | step: 20.94 6%|▌ | 3030/50750 [8:00:32<78:33:50, 5.93s/it] {'loss': 0.012, 'learning_rate': 3.9907576005772094e-05, 'epoch': 2.99} 6%|▌ | 3030/50750 [8:00:32<78:33:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:43:16,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 00:43:16,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.08 | bwd_microstep: 3848.72 | bwd_inner_microstep: 3841.06 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.62 [2024-11-14 00:43:16,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.08 | bwd: 3848.73 | bwd_inner: 3841.06 | bwd_allreduce: 7.63 | step: 21.62 6%|▌ | 3031/50750 [8:00:38<78:33:48, 5.93s/it] {'loss': 0.2903, 'learning_rate': 3.99074534002113e-05, 'epoch': 2.99} 6%|▌ | 3031/50750 [8:00:38<78:33:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:43:22,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 00:43:22,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.52 | bwd_microstep: 3853.57 | bwd_inner_microstep: 3845.76 | bwd_allreduce_microstep: 7.76 | step_microstep: 21.83 [2024-11-14 00:43:22,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.51 | bwd: 3853.58 | bwd_inner: 3845.76 | bwd_allreduce: 7.78 | step: 21.84 6%|▌ | 3032/50750 [8:00:44<78:35:28, 5.93s/it] {'loss': 0.0044, 'learning_rate': 3.990733071357145e-05, 'epoch': 2.99} 6%|▌ | 3032/50750 [8:00:44<78:35:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:43:28,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 00:43:28,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.70 | bwd_microstep: 3849.64 | bwd_inner_microstep: 3842.12 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.33 [2024-11-14 00:43:28,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.68 | bwd: 3849.65 | bwd_inner: 3842.12 | bwd_allreduce: 7.50 | step: 21.33 6%|▌ | 3033/50750 [8:00:50<78:33:32, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.9907207945853044e-05, 'epoch': 2.99} 6%|▌ | 3033/50750 [8:00:50<78:33:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:43:34,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 00:43:34,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.28 | bwd_microstep: 3854.60 | bwd_inner_microstep: 3846.98 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.89 [2024-11-14 00:43:34,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.27 | bwd: 3854.61 | bwd_inner: 3846.98 | bwd_allreduce: 7.58 | step: 21.89 6%|▌ | 3034/50750 [8:00:56<78:35:20, 5.93s/it] {'loss': 0.0014, 'learning_rate': 3.990708509705656e-05, 'epoch': 2.99} 6%|▌ | 3034/50750 [8:00:56<78:35:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:43:40,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 00:43:40,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.90 | bwd_microstep: 3846.11 | bwd_inner_microstep: 3838.62 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.95 [2024-11-14 00:43:40,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.89 | bwd: 3846.12 | bwd_inner: 3838.62 | bwd_allreduce: 7.46 | step: 20.96 6%|▌ | 3035/50750 [8:01:02<78:32:13, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.990696216718253e-05, 'epoch': 2.99} 6%|▌ | 3035/50750 [8:01:02<78:32:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 00:43:46,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 00:43:46,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.58 | bwd_microstep: 3855.12 | bwd_inner_microstep: 3847.62 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.10 [2024-11-14 00:43:46,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.58 | bwd: 3855.14 | bwd_inner: 3847.62 | bwd_allreduce: 7.48 | step: 21.10 6%|▌ | 3036/50750 [8:01:08<78:32:13, 5.93s/it] {'loss': 0.0091, 'learning_rate': 3.990683915623143e-05, 'epoch': 2.99} 6%|▌ | 3036/50750 [8:01:08<78:32:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:43:52,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:43:52,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.33 | bwd_microstep: 3851.21 | bwd_inner_microstep: 3843.71 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.03 [2024-11-14 00:43:52,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.33 | bwd: 3851.22 | bwd_inner: 3843.71 | bwd_allreduce: 7.47 | step: 21.03 6%|▌ | 3037/50750 [8:01:14<78:31:42, 5.93s/it] {'loss': 0.0408, 'learning_rate': 3.9906716064203775e-05, 'epoch': 2.99} 6%|▌ | 3037/50750 [8:01:14<78:31:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:43:58,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 00:43:58,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.15 | bwd_microstep: 3846.39 | bwd_inner_microstep: 3838.71 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.12 [2024-11-14 00:43:58,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.15 | bwd: 3846.41 | bwd_inner: 3838.71 | bwd_allreduce: 7.65 | step: 21.12 6%|▌ | 3038/50750 [8:01:20<78:29:52, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.990659289110005e-05, 'epoch': 2.99} 6%|▌ | 3038/50750 [8:01:20<78:29:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:44:03,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 00:44:03,960] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.96 | bwd_microstep: 3847.06 | bwd_inner_microstep: 3839.53 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.26 [2024-11-14 00:44:03,960] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.96 | bwd: 3847.07 | bwd_inner: 3839.53 | bwd_allreduce: 7.51 | step: 21.26 6%|▌ | 3039/50750 [8:01:25<78:28:44, 5.92s/it] {'loss': 0.0338, 'learning_rate': 3.9906469636920783e-05, 'epoch': 2.99} 6%|▌ | 3039/50750 [8:01:25<78:28:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 00:44:09,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 00:44:09,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.54 | bwd_microstep: 3847.01 | bwd_inner_microstep: 3839.46 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.11 [2024-11-14 00:44:09,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.54 | bwd: 3847.02 | bwd_inner: 3839.46 | bwd_allreduce: 7.53 | step: 21.11 6%|▌ | 3040/50750 [8:01:31<78:27:40, 5.92s/it] {'loss': 0.1232, 'learning_rate': 3.9906346301666454e-05, 'epoch': 3.0} 6%|▌ | 3040/50750 [8:01:31<78:27:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 00:44:15,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 00:44:15,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.87 | bwd_microstep: 3853.33 | bwd_inner_microstep: 3845.50 | bwd_allreduce_microstep: 7.77 | step_microstep: 21.77 [2024-11-14 00:44:15,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.87 | bwd: 3853.35 | bwd_inner: 3845.50 | bwd_allreduce: 7.79 | step: 21.77 6%|▌ | 3041/50750 [8:01:37<78:29:21, 5.92s/it] {'loss': 0.5395, 'learning_rate': 3.9906222885337574e-05, 'epoch': 3.0} 6%|▌ | 3041/50750 [8:01:37<78:29:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 00:44:21,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 00:44:21,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.70 | bwd_microstep: 3853.45 | bwd_inner_microstep: 3845.95 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.00 [2024-11-14 00:44:21,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.70 | bwd: 3853.46 | bwd_inner: 3845.95 | bwd_allreduce: 7.47 | step: 21.00 6%|▌ | 3042/50750 [8:01:43<78:29:34, 5.92s/it] {'loss': 0.1464, 'learning_rate': 3.990609938793465e-05, 'epoch': 3.0} 6%|▌ | 3042/50750 [8:01:43<78:29:34, 5.92s/it]evaluate! dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B Results saved to qa_abcd_lora.csv Accuracy: 0.905511811023622 dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:19:23,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:19:23,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2008.21 | bwd_microstep: 3846.00 | bwd_inner_microstep: 3838.08 | bwd_allreduce_microstep: 7.88 | step_microstep: 21.43 [2024-11-14 01:19:23,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2008.19 | bwd: 3846.02 | bwd_inner: 3838.08 | bwd_allreduce: 7.90 | step: 21.43 6%|▌ | 3043/50750 [8:36:45<8411:27:24, 634.73s/it] {'loss': 0.0008, 'learning_rate': 3.990597580945817e-05, 'epoch': 3.0} 6%|▌ | 3043/50750 [8:36:45<8411:27:24, 634.73s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:19:29,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:19:29,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2013.02 | bwd_microstep: 3834.14 | bwd_inner_microstep: 3826.64 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.10 [2024-11-14 01:19:29,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2013.02 | bwd: 3834.15 | bwd_inner: 3826.64 | bwd_allreduce: 7.47 | step: 21.10 6%|▌ | 3044/50750 [8:36:51<5911:19:52, 446.08s/it] {'loss': 0.0009, 'learning_rate': 3.9905852149908645e-05, 'epoch': 3.0} 6%|▌ | 3044/50750 [8:36:51<5911:19:52, 446.08s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. dynamic ViT batch size: 16, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 01:19:33,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:19:33,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1008.82 | bwd_microstep: 1919.64 | bwd_inner_microstep: 1909.70 | bwd_allreduce_microstep: 9.90 | step_microstep: 20.95 [2024-11-14 01:19:33,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1008.80 | bwd: 1919.65 | bwd_inner: 1909.70 | bwd_allreduce: 9.91 | step: 20.96 6%|▌ | 3045/50750 [8:36:55<4155:16:52, 313.57s/it] {'loss': 1.0721, 'learning_rate': 3.990572840928659e-05, 'epoch': 3.0} 6%|▌ | 3045/50750 [8:36:55<4155:16:52, 313.57s/it][2024-11-14 01:19:36,687] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-14 01:19:41,883] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-14 01:19:47,136] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-14 01:19:52,511] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:20:11,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 01:20:11,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.13 | bwd_microstep: 3813.14 | bwd_inner_microstep: 3805.59 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.77 [2024-11-14 01:20:11,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.11 | bwd: 3813.15 | bwd_inner: 3805.59 | bwd_allreduce: 7.52 | step: 21.78 6%|▌ | 3046/50750 [8:37:33<3057:04:07, 230.70s/it] {'loss': 0.1041, 'learning_rate': 3.990560458759249e-05, 'epoch': 3.0} 6%|▌ | 3046/50750 [8:37:33<3057:04:07, 230.70s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:20:17,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:20:17,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2006.44 | bwd_microstep: 3824.39 | bwd_inner_microstep: 3816.59 | bwd_allreduce_microstep: 7.75 | step_microstep: 21.60 [2024-11-14 01:20:17,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2006.45 | bwd: 3824.40 | bwd_inner: 3816.59 | bwd_allreduce: 7.77 | step: 21.60 6%|▌ | 3047/50750 [8:37:39<2163:16:27, 163.26s/it] {'loss': 0.2736, 'learning_rate': 3.990548068482686e-05, 'epoch': 3.0} 6%|▌ | 3047/50750 [8:37:39<2163:16:27, 163.26s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:20:23,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.92 [2024-11-14 01:20:23,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.51 | bwd_microstep: 3834.99 | bwd_inner_microstep: 3826.61 | bwd_allreduce_microstep: 8.33 | step_microstep: 22.71 [2024-11-14 01:20:23,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.50 | bwd: 3835.01 | bwd_inner: 3826.61 | bwd_allreduce: 8.35 | step: 22.72 6%|▌ | 3048/50750 [8:37:45<1537:46:16, 116.05s/it] {'loss': 0.0404, 'learning_rate': 3.990535670099021e-05, 'epoch': 3.0} 6%|▌ | 3048/50750 [8:37:45<1537:46:16, 116.05s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:20:29,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-14 01:20:29,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.18 | bwd_microstep: 3834.21 | bwd_inner_microstep: 3826.65 | bwd_allreduce_microstep: 7.52 | step_microstep: 22.34 [2024-11-14 01:20:29,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.17 | bwd: 3834.22 | bwd_inner: 3826.65 | bwd_allreduce: 7.53 | step: 22.34 6%|▌ | 3049/50750 [8:37:50<1099:55:51, 83.01s/it] {'loss': 0.1055, 'learning_rate': 3.9905232636083023e-05, 'epoch': 3.0} 6%|▌ | 3049/50750 [8:37:50<1099:55:51, 83.01s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:20:34,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.01 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:20:34,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.08 | bwd_microstep: 3843.70 | bwd_inner_microstep: 3835.42 | bwd_allreduce_microstep: 8.24 | step_microstep: 22.97 [2024-11-14 01:20:34,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.06 | bwd: 3843.72 | bwd_inner: 3835.42 | bwd_allreduce: 8.26 | step: 22.97 6%|▌ | 3050/50750 [8:37:56<793:28:28, 59.88s/it] {'loss': 0.0041, 'learning_rate': 3.9905108490105834e-05, 'epoch': 3.0} 6%|▌ | 3050/50750 [8:37:56<793:28:28, 59.88s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:20:40,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:20:40,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.66 | bwd_microstep: 3836.91 | bwd_inner_microstep: 3829.32 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.58 [2024-11-14 01:20:40,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.65 | bwd: 3836.92 | bwd_inner: 3829.32 | bwd_allreduce: 7.55 | step: 21.58 6%|▌ | 3051/50750 [8:38:02<578:55:31, 43.69s/it] {'loss': 0.005, 'learning_rate': 3.990498426305912e-05, 'epoch': 3.01} 6%|▌ | 3051/50750 [8:38:02<578:55:31, 43.69s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:20:46,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-14 01:20:46,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.29 | bwd_microstep: 3852.74 | bwd_inner_microstep: 3844.87 | bwd_allreduce_microstep: 7.82 | step_microstep: 22.65 [2024-11-14 01:20:46,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.29 | bwd: 3852.75 | bwd_inner: 3844.87 | bwd_allreduce: 7.84 | step: 22.65 6%|▌ | 3052/50750 [8:38:08<428:48:54, 32.36s/it] {'loss': 0.0055, 'learning_rate': 3.990485995494341e-05, 'epoch': 3.01} 6%|▌ | 3052/50750 [8:38:08<428:48:54, 32.36s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:20:52,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:20:52,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.53 | bwd_microstep: 3818.36 | bwd_inner_microstep: 3810.65 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.73 [2024-11-14 01:20:52,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.52 | bwd: 3818.38 | bwd_inner: 3810.65 | bwd_allreduce: 7.69 | step: 21.73 6%|▌ | 3053/50750 [8:38:14<323:36:55, 24.43s/it] {'loss': 0.1758, 'learning_rate': 3.9904735565759204e-05, 'epoch': 3.01} 6%|▌ | 3053/50750 [8:38:14<323:36:55, 24.43s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:20:58,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.29 | optimizer_step: 5.06 [2024-11-14 01:20:58,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2011.91 | bwd_microstep: 3822.54 | bwd_inner_microstep: 3814.69 | bwd_allreduce_microstep: 7.80 | step_microstep: 23.23 [2024-11-14 01:20:58,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2011.91 | bwd: 3822.56 | bwd_inner: 3814.69 | bwd_allreduce: 7.82 | step: 23.24 6%|▌ | 3054/50750 [8:38:20<249:58:17, 18.87s/it] {'loss': 0.0081, 'learning_rate': 3.9904611095507004e-05, 'epoch': 3.01} 6%|▌ | 3054/50750 [8:38:20<249:58:17, 18.87s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:21:04,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 5.07 [2024-11-14 01:21:04,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2015.04 | bwd_microstep: 3820.50 | bwd_inner_microstep: 3812.59 | bwd_allreduce_microstep: 7.86 | step_microstep: 22.36 [2024-11-14 01:21:04,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2015.02 | bwd: 3820.52 | bwd_inner: 3812.59 | bwd_allreduce: 7.88 | step: 22.37 6%|▌ | 3055/50750 [8:38:26<198:24:10, 14.98s/it] {'loss': 0.0208, 'learning_rate': 3.990448654418731e-05, 'epoch': 3.01} 6%|▌ | 3055/50750 [8:38:26<198:24:10, 14.98s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:21:10,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 5.05 [2024-11-14 01:21:10,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2016.41 | bwd_microstep: 3829.19 | bwd_inner_microstep: 3821.47 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.45 [2024-11-14 01:21:10,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2016.39 | bwd: 3829.21 | bwd_inner: 3821.47 | bwd_allreduce: 7.70 | step: 21.46 6%|▌ | 3056/50750 [8:38:32<162:18:48, 12.25s/it] {'loss': 0.0309, 'learning_rate': 3.990436191180064e-05, 'epoch': 3.01} 6%|▌ | 3056/50750 [8:38:32<162:18:48, 12.25s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:21:16,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:21:16,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.91 | bwd_microstep: 3826.50 | bwd_inner_microstep: 3818.97 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.29 [2024-11-14 01:21:16,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.90 | bwd: 3826.51 | bwd_inner: 3818.97 | bwd_allreduce: 7.50 | step: 21.30 6%|▌ | 3057/50750 [8:38:38<137:03:10, 10.35s/it] {'loss': 0.1747, 'learning_rate': 3.990423719834751e-05, 'epoch': 3.01} 6%|▌ | 3057/50750 [8:38:38<137:03:10, 10.35s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:21:22,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:21:22,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2015.33 | bwd_microstep: 3827.71 | bwd_inner_microstep: 3820.22 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.63 [2024-11-14 01:21:22,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2015.33 | bwd: 3827.72 | bwd_inner: 3820.22 | bwd_allreduce: 7.46 | step: 21.64 6%|▌ | 3058/50750 [8:38:44<119:21:54, 9.01s/it] {'loss': 0.0069, 'learning_rate': 3.990411240382841e-05, 'epoch': 3.01} 6%|▌ | 3058/50750 [8:38:44<119:21:54, 9.01s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 01:21:28,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 01:21:28,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2014.03 | bwd_microstep: 3825.04 | bwd_inner_microstep: 3817.51 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.19 [2024-11-14 01:21:28,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2014.03 | bwd: 3825.05 | bwd_inner: 3817.51 | bwd_allreduce: 7.50 | step: 21.20 6%|▌ | 3059/50750 [8:38:50<106:57:11, 8.07s/it] {'loss': 0.0078, 'learning_rate': 3.990398752824385e-05, 'epoch': 3.01} 6%|▌ | 3059/50750 [8:38:50<106:57:11, 8.07s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:21:33,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 01:21:33,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.72 | bwd_microstep: 3830.02 | bwd_inner_microstep: 3822.50 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.21 [2024-11-14 01:21:33,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.72 | bwd: 3830.03 | bwd_inner: 3822.50 | bwd_allreduce: 7.49 | step: 21.21 6%|▌ | 3060/50750 [8:38:55<98:18:11, 7.42s/it] {'loss': 0.017, 'learning_rate': 3.990386257159435e-05, 'epoch': 3.01} 6%|▌ | 3060/50750 [8:38:55<98:18:11, 7.42s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:21:39,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 01:21:39,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.21 | bwd_microstep: 3832.96 | bwd_inner_microstep: 3825.41 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.74 [2024-11-14 01:21:39,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.21 | bwd: 3832.98 | bwd_inner: 3825.41 | bwd_allreduce: 7.53 | step: 21.74 6%|▌ | 3061/50750 [8:39:01<92:17:35, 6.97s/it] {'loss': 0.0003, 'learning_rate': 3.990373753388042e-05, 'epoch': 3.02} 6%|▌ | 3061/50750 [8:39:01<92:17:35, 6.97s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:21:45,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 01:21:45,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.18 | bwd_microstep: 3835.51 | bwd_inner_microstep: 3827.91 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.40 [2024-11-14 01:21:45,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.17 | bwd: 3835.53 | bwd_inner: 3827.91 | bwd_allreduce: 7.57 | step: 21.41 6%|▌ | 3062/50750 [8:39:07<88:05:31, 6.65s/it] {'loss': 0.0122, 'learning_rate': 3.990361241510255e-05, 'epoch': 3.02} 6%|▌ | 3062/50750 [8:39:07<88:05:31, 6.65s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:21:51,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:21:51,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.47 | bwd_microstep: 3838.24 | bwd_inner_microstep: 3830.73 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.09 [2024-11-14 01:21:51,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.46 | bwd: 3838.25 | bwd_inner: 3830.73 | bwd_allreduce: 7.48 | step: 21.10 6%|▌ | 3063/50750 [8:39:13<85:07:47, 6.43s/it] {'loss': 0.1449, 'learning_rate': 3.990348721526127e-05, 'epoch': 3.02} 6%|▌ | 3063/50750 [8:39:13<85:07:47, 6.43s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:21:57,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:21:57,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.21 | bwd_microstep: 3846.94 | bwd_inner_microstep: 3839.42 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.58 [2024-11-14 01:21:57,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.21 | bwd: 3846.95 | bwd_inner: 3839.42 | bwd_allreduce: 7.49 | step: 21.58 6%|▌ | 3064/50750 [8:39:19<83:06:23, 6.27s/it] {'loss': 0.0003, 'learning_rate': 3.990336193435708e-05, 'epoch': 3.02} 6%|▌ | 3064/50750 [8:39:19<83:06:23, 6.27s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:22:03,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 01:22:03,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.79 | bwd_microstep: 3844.80 | bwd_inner_microstep: 3837.31 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.00 [2024-11-14 01:22:03,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.79 | bwd: 3844.81 | bwd_inner: 3837.31 | bwd_allreduce: 7.46 | step: 21.01 6%|▌ | 3065/50750 [8:39:25<81:40:16, 6.17s/it] {'loss': 0.0036, 'learning_rate': 3.990323657239049e-05, 'epoch': 3.02} 6%|▌ | 3065/50750 [8:39:25<81:40:16, 6.17s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:22:09,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:22:09,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.82 | bwd_microstep: 3844.76 | bwd_inner_microstep: 3837.18 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.47 [2024-11-14 01:22:09,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.82 | bwd: 3844.78 | bwd_inner: 3837.18 | bwd_allreduce: 7.55 | step: 21.48 6%|▌ | 3066/50750 [8:39:31<80:39:52, 6.09s/it] {'loss': 0.0051, 'learning_rate': 3.9903111129362013e-05, 'epoch': 3.02} 6%|▌ | 3066/50750 [8:39:31<80:39:52, 6.09s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:22:15,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 01:22:15,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.56 | bwd_microstep: 3850.36 | bwd_inner_microstep: 3842.72 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.51 [2024-11-14 01:22:15,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.55 | bwd: 3850.37 | bwd_inner: 3842.72 | bwd_allreduce: 7.61 | step: 21.52 6%|▌ | 3067/50750 [8:39:37<80:00:11, 6.04s/it] {'loss': 0.0062, 'learning_rate': 3.990298560527216e-05, 'epoch': 3.02} 6%|▌ | 3067/50750 [8:39:37<80:00:11, 6.04s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:22:21,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 01:22:21,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.16 | bwd_microstep: 3845.51 | bwd_inner_microstep: 3837.79 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.96 [2024-11-14 01:22:21,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.16 | bwd: 3845.52 | bwd_inner: 3837.79 | bwd_allreduce: 7.69 | step: 21.96 6%|▌ | 3068/50750 [8:39:43<79:34:09, 6.01s/it] {'loss': 0.0001, 'learning_rate': 3.990286000012145e-05, 'epoch': 3.02} 6%|▌ | 3068/50750 [8:39:43<79:34:09, 6.01s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:22:27,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.92 [2024-11-14 01:22:27,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.71 | bwd_microstep: 3851.39 | bwd_inner_microstep: 3843.11 | bwd_allreduce_microstep: 8.24 | step_microstep: 21.92 [2024-11-14 01:22:27,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.69 | bwd: 3851.40 | bwd_inner: 3843.11 | bwd_allreduce: 8.26 | step: 21.92 6%|▌ | 3069/50750 [8:39:49<79:15:06, 5.98s/it] {'loss': 0.2668, 'learning_rate': 3.990273431391037e-05, 'epoch': 3.02} 6%|▌ | 3069/50750 [8:39:49<79:15:06, 5.98s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:22:33,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 01:22:33,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.20 | bwd_microstep: 3854.47 | bwd_inner_microstep: 3846.98 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.37 [2024-11-14 01:22:33,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.18 | bwd: 3854.48 | bwd_inner: 3846.98 | bwd_allreduce: 7.46 | step: 21.37 6%|▌ | 3070/50750 [8:39:55<79:03:05, 5.97s/it] {'loss': 0.1264, 'learning_rate': 3.990260854663946e-05, 'epoch': 3.02} 6%|▌ | 3070/50750 [8:39:55<79:03:05, 5.97s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:22:39,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:22:39,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.65 | bwd_microstep: 3845.92 | bwd_inner_microstep: 3838.31 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.81 [2024-11-14 01:22:39,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.65 | bwd: 3845.94 | bwd_inner: 3838.31 | bwd_allreduce: 7.58 | step: 21.81 6%|▌ | 3071/50750 [8:40:01<78:51:45, 5.95s/it] {'loss': 0.0003, 'learning_rate': 3.990248269830922e-05, 'epoch': 3.03} 6%|▌ | 3071/50750 [8:40:01<78:51:45, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:22:44,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.92 [2024-11-14 01:22:44,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.28 | bwd_microstep: 3848.39 | bwd_inner_microstep: 3840.89 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.36 [2024-11-14 01:22:44,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.26 | bwd: 3848.40 | bwd_inner: 3840.89 | bwd_allreduce: 7.47 | step: 21.36 6%|▌ | 3072/50750 [8:40:06<78:45:05, 5.95s/it] {'loss': 0.0552, 'learning_rate': 3.990235676892016e-05, 'epoch': 3.03} 6%|▌ | 3072/50750 [8:40:06<78:45:05, 5.95s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:22:50,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 5.09 [2024-11-14 01:22:50,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.38 | bwd_microstep: 3855.57 | bwd_inner_microstep: 3848.04 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.51 [2024-11-14 01:22:50,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.38 | bwd: 3855.59 | bwd_inner: 3848.04 | bwd_allreduce: 7.51 | step: 21.52 6%|▌ | 3073/50750 [8:40:12<78:41:03, 5.94s/it] {'loss': 0.0124, 'learning_rate': 3.9902230758472794e-05, 'epoch': 3.03} 6%|▌ | 3073/50750 [8:40:12<78:41:03, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:22:56,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-14 01:22:56,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.91 | bwd_microstep: 3853.26 | bwd_inner_microstep: 3845.26 | bwd_allreduce_microstep: 7.94 | step_microstep: 22.58 [2024-11-14 01:22:56,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.92 | bwd: 3853.27 | bwd_inner: 3845.26 | bwd_allreduce: 7.97 | step: 22.59 6%|▌ | 3074/50750 [8:40:18<78:38:51, 5.94s/it] {'loss': 0.0017, 'learning_rate': 3.9902104666967644e-05, 'epoch': 3.03} 6%|▌ | 3074/50750 [8:40:18<78:38:51, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:23:02,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.94 [2024-11-14 01:23:02,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.53 | bwd_microstep: 3849.33 | bwd_inner_microstep: 3841.45 | bwd_allreduce_microstep: 7.84 | step_microstep: 21.85 [2024-11-14 01:23:02,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.54 | bwd: 3849.34 | bwd_inner: 3841.45 | bwd_allreduce: 7.86 | step: 21.86 6%|▌ | 3075/50750 [8:40:24<78:36:09, 5.94s/it] {'loss': 0.0008, 'learning_rate': 3.990197849440521e-05, 'epoch': 3.03} 6%|▌ | 3075/50750 [8:40:24<78:36:09, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:23:08,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-14 01:23:08,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.56 | bwd_microstep: 3853.45 | bwd_inner_microstep: 3845.71 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.93 [2024-11-14 01:23:08,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.54 | bwd: 3853.46 | bwd_inner: 3845.71 | bwd_allreduce: 7.71 | step: 21.94 6%|▌ | 3076/50750 [8:40:30<78:35:00, 5.93s/it] {'loss': 0.3904, 'learning_rate': 3.9901852240786016e-05, 'epoch': 3.03} 6%|▌ | 3076/50750 [8:40:30<78:35:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:23:14,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 01:23:14,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.81 | bwd_microstep: 3852.85 | bwd_inner_microstep: 3844.98 | bwd_allreduce_microstep: 7.82 | step_microstep: 22.51 [2024-11-14 01:23:14,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.81 | bwd: 3852.86 | bwd_inner: 3844.98 | bwd_allreduce: 7.84 | step: 22.51 6%|▌ | 3077/50750 [8:40:36<78:34:18, 5.93s/it] {'loss': 0.0155, 'learning_rate': 3.9901725906110574e-05, 'epoch': 3.03} 6%|▌ | 3077/50750 [8:40:36<78:34:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:23:20,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:23:20,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.91 | bwd_microstep: 3854.73 | bwd_inner_microstep: 3847.16 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.30 [2024-11-14 01:23:20,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.90 | bwd: 3854.75 | bwd_inner: 3847.16 | bwd_allreduce: 7.54 | step: 21.30 6%|▌ | 3078/50750 [8:40:42<78:34:38, 5.93s/it] {'loss': 0.0001, 'learning_rate': 3.990159949037939e-05, 'epoch': 3.03} 6%|▌ | 3078/50750 [8:40:42<78:34:38, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:23:26,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:23:26,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.98 | bwd_microstep: 3856.76 | bwd_inner_microstep: 3849.08 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.91 [2024-11-14 01:23:26,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.96 | bwd: 3856.77 | bwd_inner: 3849.08 | bwd_allreduce: 7.65 | step: 21.91 6%|▌ | 3079/50750 [8:40:48<78:40:00, 5.94s/it] {'loss': 0.0009, 'learning_rate': 3.9901472993593e-05, 'epoch': 3.03} 6%|▌ | 3079/50750 [8:40:48<78:40:00, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:23:32,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-14 01:23:32,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.80 | bwd_microstep: 3853.30 | bwd_inner_microstep: 3845.42 | bwd_allreduce_microstep: 7.83 | step_microstep: 22.65 [2024-11-14 01:23:32,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.79 | bwd: 3853.32 | bwd_inner: 3845.42 | bwd_allreduce: 7.85 | step: 22.66 6%|▌ | 3080/50750 [8:40:54<78:38:07, 5.94s/it] {'loss': 0.328, 'learning_rate': 3.9901346415751895e-05, 'epoch': 3.03} 6%|▌ | 3080/50750 [8:40:54<78:38:07, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:23:38,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:23:38,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.15 | bwd_microstep: 3849.58 | bwd_inner_microstep: 3841.99 | bwd_allreduce_microstep: 7.55 | step_microstep: 22.15 [2024-11-14 01:23:38,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.14 | bwd: 3849.59 | bwd_inner: 3841.99 | bwd_allreduce: 7.56 | step: 22.16 6%|▌ | 3081/50750 [8:41:00<78:34:53, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.99012197568566e-05, 'epoch': 3.04} 6%|▌ | 3081/50750 [8:41:00<78:34:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:23:44,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:23:44,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.34 | bwd_microstep: 3853.42 | bwd_inner_microstep: 3845.17 | bwd_allreduce_microstep: 8.21 | step_microstep: 22.02 [2024-11-14 01:23:44,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.34 | bwd: 3853.44 | bwd_inner: 3845.17 | bwd_allreduce: 8.23 | step: 22.03 6%|▌ | 3082/50750 [8:41:06<78:33:21, 5.93s/it] {'loss': 0.0821, 'learning_rate': 3.9901093016907636e-05, 'epoch': 3.04} 6%|▌ | 3082/50750 [8:41:06<78:33:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:23:50,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 01:23:50,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.06 | bwd_microstep: 3854.54 | bwd_inner_microstep: 3847.02 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.18 [2024-11-14 01:23:50,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.04 | bwd: 3854.55 | bwd_inner: 3847.02 | bwd_allreduce: 7.50 | step: 21.19 6%|▌ | 3083/50750 [8:41:12<78:33:53, 5.93s/it] {'loss': 0.0033, 'learning_rate': 3.990096619590551e-05, 'epoch': 3.04} 6%|▌ | 3083/50750 [8:41:12<78:33:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:23:56,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-14 01:23:56,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.06 | bwd_microstep: 3854.59 | bwd_inner_microstep: 3846.86 | bwd_allreduce_microstep: 7.68 | step_microstep: 22.30 [2024-11-14 01:23:56,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.04 | bwd: 3854.60 | bwd_inner: 3846.86 | bwd_allreduce: 7.70 | step: 22.31 6%|▌ | 3084/50750 [8:41:18<78:33:49, 5.93s/it] {'loss': 0.0037, 'learning_rate': 3.990083929385075e-05, 'epoch': 3.04} 6%|▌ | 3084/50750 [8:41:18<78:33:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:24:02,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.92 [2024-11-14 01:24:02,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.71 | bwd_microstep: 3850.59 | bwd_inner_microstep: 3841.39 | bwd_allreduce_microstep: 9.15 | step_microstep: 22.00 [2024-11-14 01:24:02,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.69 | bwd: 3850.60 | bwd_inner: 3841.39 | bwd_allreduce: 9.17 | step: 22.00 6%|▌ | 3085/50750 [8:41:24<78:34:30, 5.93s/it] {'loss': 0.7366, 'learning_rate': 3.9900712310743864e-05, 'epoch': 3.04} 6%|▌ | 3085/50750 [8:41:24<78:34:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:24:08,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-14 01:24:08,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.20 | bwd_microstep: 3848.12 | bwd_inner_microstep: 3838.99 | bwd_allreduce_microstep: 9.09 | step_microstep: 21.91 [2024-11-14 01:24:08,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.19 | bwd: 3848.14 | bwd_inner: 3838.99 | bwd_allreduce: 9.11 | step: 21.91 6%|▌ | 3086/50750 [8:41:30<78:32:48, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.9900585246585366e-05, 'epoch': 3.04} 6%|▌ | 3086/50750 [8:41:30<78:32:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:24:13,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-14 01:24:13,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.15 | bwd_microstep: 3858.33 | bwd_inner_microstep: 3850.39 | bwd_allreduce_microstep: 7.89 | step_microstep: 22.62 [2024-11-14 01:24:13,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.13 | bwd: 3858.34 | bwd_inner: 3850.38 | bwd_allreduce: 7.91 | step: 22.63 6%|▌ | 3087/50750 [8:41:35<78:35:19, 5.94s/it] {'loss': 0.0375, 'learning_rate': 3.990045810137579e-05, 'epoch': 3.04} 6%|▌ | 3087/50750 [8:41:35<78:35:19, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:24:19,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:24:19,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.10 | bwd_microstep: 3858.20 | bwd_inner_microstep: 3850.63 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.01 [2024-11-14 01:24:19,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.08 | bwd: 3858.21 | bwd_inner: 3850.63 | bwd_allreduce: 7.53 | step: 21.01 6%|▌ | 3088/50750 [8:41:41<78:37:19, 5.94s/it] {'loss': 0.0001, 'learning_rate': 3.990033087511563e-05, 'epoch': 3.04} 6%|▌ | 3088/50750 [8:41:41<78:37:19, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:24:25,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-14 01:24:25,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.98 | bwd_microstep: 3857.85 | bwd_inner_microstep: 3850.31 | bwd_allreduce_microstep: 7.49 | step_microstep: 20.92 [2024-11-14 01:24:25,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.96 | bwd: 3857.86 | bwd_inner: 3850.31 | bwd_allreduce: 7.51 | step: 20.93 6%|▌ | 3089/50750 [8:41:47<78:36:19, 5.94s/it] {'loss': 0.388, 'learning_rate': 3.990020356780543e-05, 'epoch': 3.04} 6%|▌ | 3089/50750 [8:41:47<78:36:19, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:24:31,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 01:24:31,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.16 | bwd_microstep: 3848.64 | bwd_inner_microstep: 3841.03 | bwd_allreduce_microstep: 7.56 | step_microstep: 22.70 [2024-11-14 01:24:31,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.16 | bwd: 3848.65 | bwd_inner: 3841.03 | bwd_allreduce: 7.58 | step: 22.70 6%|▌ | 3090/50750 [8:41:53<78:33:52, 5.93s/it] {'loss': 0.0159, 'learning_rate': 3.990007617944569e-05, 'epoch': 3.04} 6%|▌ | 3090/50750 [8:41:53<78:33:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:24:37,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 01:24:37,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.55 | bwd_microstep: 3859.41 | bwd_inner_microstep: 3851.43 | bwd_allreduce_microstep: 7.92 | step_microstep: 22.02 [2024-11-14 01:24:37,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.53 | bwd: 3859.42 | bwd_inner: 3851.43 | bwd_allreduce: 7.94 | step: 22.02 6%|▌ | 3091/50750 [8:41:59<78:34:05, 5.93s/it] {'loss': 0.0027, 'learning_rate': 3.989994871003693e-05, 'epoch': 3.05} 6%|▌ | 3091/50750 [8:41:59<78:34:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:24:43,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:24:43,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.22 | bwd_microstep: 3863.28 | bwd_inner_microstep: 3855.48 | bwd_allreduce_microstep: 7.75 | step_microstep: 21.98 [2024-11-14 01:24:43,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.22 | bwd: 3863.30 | bwd_inner: 3855.48 | bwd_allreduce: 7.77 | step: 21.98 6%|▌ | 3092/50750 [8:42:05<78:35:14, 5.94s/it] {'loss': 0.0002, 'learning_rate': 3.989982115957968e-05, 'epoch': 3.05} 6%|▌ | 3092/50750 [8:42:05<78:35:14, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:24:49,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-14 01:24:49,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.80 | bwd_microstep: 3856.15 | bwd_inner_microstep: 3848.37 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.95 [2024-11-14 01:24:49,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.80 | bwd: 3856.16 | bwd_inner: 3848.37 | bwd_allreduce: 7.75 | step: 21.96 6%|▌ | 3093/50750 [8:42:11<78:34:41, 5.94s/it] {'loss': 0.0664, 'learning_rate': 3.9899693528074445e-05, 'epoch': 3.05} 6%|▌ | 3093/50750 [8:42:11<78:34:41, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:24:55,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:24:55,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.29 | bwd_microstep: 3853.44 | bwd_inner_microstep: 3845.82 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.73 [2024-11-14 01:24:55,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.28 | bwd: 3853.45 | bwd_inner: 3845.82 | bwd_allreduce: 7.59 | step: 21.73 6%|▌ | 3094/50750 [8:42:17<78:33:41, 5.93s/it] {'loss': 0.5028, 'learning_rate': 3.989956581552176e-05, 'epoch': 3.05} 6%|▌ | 3094/50750 [8:42:17<78:33:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:25:01,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:25:01,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.21 | bwd_microstep: 3853.36 | bwd_inner_microstep: 3845.84 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.15 [2024-11-14 01:25:01,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.19 | bwd: 3853.37 | bwd_inner: 3845.84 | bwd_allreduce: 7.49 | step: 21.16 6%|▌ | 3095/50750 [8:42:23<78:31:34, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.989943802192214e-05, 'epoch': 3.05} 6%|▌ | 3095/50750 [8:42:23<78:31:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:25:07,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 01:25:07,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.01 | bwd_microstep: 3859.90 | bwd_inner_microstep: 3852.31 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.77 [2024-11-14 01:25:07,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.01 | bwd: 3859.92 | bwd_inner: 3852.31 | bwd_allreduce: 7.57 | step: 21.77 6%|▌ | 3096/50750 [8:42:29<78:32:36, 5.93s/it] {'loss': 0.0202, 'learning_rate': 3.98993101472761e-05, 'epoch': 3.05} 6%|▌ | 3096/50750 [8:42:29<78:32:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:25:13,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-14 01:25:13,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.13 | bwd_microstep: 3860.08 | bwd_inner_microstep: 3852.52 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.23 [2024-11-14 01:25:13,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.13 | bwd: 3860.10 | bwd_inner: 3852.52 | bwd_allreduce: 7.54 | step: 21.26 6%|▌ | 3097/50750 [8:42:35<78:35:04, 5.94s/it] {'loss': 0.362, 'learning_rate': 3.989918219158416e-05, 'epoch': 3.05} 6%|▌ | 3097/50750 [8:42:35<78:35:04, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:25:19,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:25:19,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.38 | bwd_microstep: 3856.48 | bwd_inner_microstep: 3848.93 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.74 [2024-11-14 01:25:19,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.37 | bwd: 3856.49 | bwd_inner: 3848.93 | bwd_allreduce: 7.52 | step: 21.74 6%|▌ | 3098/50750 [8:42:41<78:34:19, 5.94s/it] {'loss': 0.0023, 'learning_rate': 3.989905415484685e-05, 'epoch': 3.05} 6%|▌ | 3098/50750 [8:42:41<78:34:19, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:25:25,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:25:25,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.81 | bwd_microstep: 3857.10 | bwd_inner_microstep: 3849.58 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.10 [2024-11-14 01:25:25,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.79 | bwd: 3857.11 | bwd_inner: 3849.58 | bwd_allreduce: 7.49 | step: 21.10 6%|▌ | 3099/50750 [8:42:47<78:35:23, 5.94s/it] {'loss': 0.0007, 'learning_rate': 3.989892603706469e-05, 'epoch': 3.05} 6%|▌ | 3099/50750 [8:42:47<78:35:23, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:25:31,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 01:25:31,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.86 | bwd_microstep: 3852.65 | bwd_inner_microstep: 3845.01 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.59 [2024-11-14 01:25:31,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.86 | bwd: 3852.66 | bwd_inner: 3845.01 | bwd_allreduce: 7.62 | step: 21.60 6%|▌ | 3100/50750 [8:42:53<78:32:24, 5.93s/it] {'loss': 0.0478, 'learning_rate': 3.989879783823819e-05, 'epoch': 3.05} 6%|▌ | 3100/50750 [8:42:53<78:32:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:25:37,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 01:25:37,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.16 | bwd_microstep: 3851.00 | bwd_inner_microstep: 3843.31 | bwd_allreduce_microstep: 7.65 | step_microstep: 22.01 [2024-11-14 01:25:37,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.14 | bwd: 3851.02 | bwd_inner: 3843.31 | bwd_allreduce: 7.67 | step: 22.01 6%|▌ | 3101/50750 [8:42:59<78:31:09, 5.93s/it] {'loss': 0.0305, 'learning_rate': 3.9898669558367886e-05, 'epoch': 3.06} 6%|▌ | 3101/50750 [8:42:59<78:31:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:25:43,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 01:25:43,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.92 | bwd_microstep: 3846.81 | bwd_inner_microstep: 3839.08 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.94 [2024-11-14 01:25:43,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.91 | bwd: 3846.82 | bwd_inner: 3839.08 | bwd_allreduce: 7.70 | step: 21.94 6%|▌ | 3102/50750 [8:43:04<78:31:11, 5.93s/it] {'loss': 0.0018, 'learning_rate': 3.989854119745429e-05, 'epoch': 3.06} 6%|▌ | 3102/50750 [8:43:04<78:31:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:25:48,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 01:25:48,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.75 | bwd_microstep: 3847.89 | bwd_inner_microstep: 3839.85 | bwd_allreduce_microstep: 7.99 | step_microstep: 21.82 [2024-11-14 01:25:48,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.74 | bwd: 3847.90 | bwd_inner: 3839.85 | bwd_allreduce: 8.01 | step: 21.82 6%|▌ | 3103/50750 [8:43:10<78:29:41, 5.93s/it] {'loss': 0.0827, 'learning_rate': 3.9898412755497936e-05, 'epoch': 3.06} 6%|▌ | 3103/50750 [8:43:10<78:29:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:25:54,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:25:54,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.85 | bwd_microstep: 3855.17 | bwd_inner_microstep: 3847.64 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.84 [2024-11-14 01:25:54,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.84 | bwd: 3855.19 | bwd_inner: 3847.64 | bwd_allreduce: 7.50 | step: 21.85 6%|▌ | 3104/50750 [8:43:16<78:31:17, 5.93s/it] {'loss': 0.0094, 'learning_rate': 3.989828423249934e-05, 'epoch': 3.06} 6%|▌ | 3104/50750 [8:43:16<78:31:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:26:00,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:26:00,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.87 | bwd_microstep: 3851.37 | bwd_inner_microstep: 3843.82 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.07 [2024-11-14 01:26:00,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.85 | bwd: 3851.38 | bwd_inner: 3843.82 | bwd_allreduce: 7.52 | step: 21.07 6%|▌ | 3105/50750 [8:43:22<78:30:33, 5.93s/it] {'loss': 0.0627, 'learning_rate': 3.989815562845903e-05, 'epoch': 3.06} 6%|▌ | 3105/50750 [8:43:22<78:30:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:26:06,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:26:06,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.85 | bwd_microstep: 3848.40 | bwd_inner_microstep: 3840.90 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.44 [2024-11-14 01:26:06,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.85 | bwd: 3848.42 | bwd_inner: 3840.90 | bwd_allreduce: 7.48 | step: 21.45 6%|▌ | 3106/50750 [8:43:28<78:27:20, 5.93s/it] {'loss': 0.0057, 'learning_rate': 3.989802694337752e-05, 'epoch': 3.06} 6%|▌ | 3106/50750 [8:43:28<78:27:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:26:12,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-14 01:26:12,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.85 | bwd_microstep: 3850.46 | bwd_inner_microstep: 3842.44 | bwd_allreduce_microstep: 7.96 | step_microstep: 22.69 [2024-11-14 01:26:12,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.85 | bwd: 3850.47 | bwd_inner: 3842.44 | bwd_allreduce: 7.98 | step: 22.70 6%|▌ | 3107/50750 [8:43:34<78:27:21, 5.93s/it] {'loss': 0.0101, 'learning_rate': 3.989789817725534e-05, 'epoch': 3.06} 6%|▌ | 3107/50750 [8:43:34<78:27:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:26:18,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.84 | optimizer_step: 4.93 [2024-11-14 01:26:18,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.69 | bwd_microstep: 3848.08 | bwd_inner_microstep: 3840.03 | bwd_allreduce_microstep: 7.98 | step_microstep: 29.59 [2024-11-14 01:26:18,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.67 | bwd: 3848.10 | bwd_inner: 3840.03 | bwd_allreduce: 8.01 | step: 29.59 6%|▌ | 3108/50750 [8:43:40<78:29:59, 5.93s/it] {'loss': 0.0259, 'learning_rate': 3.989776933009302e-05, 'epoch': 3.06} 6%|▌ | 3108/50750 [8:43:40<78:29:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:26:24,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-14 01:26:24,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.74 | bwd_microstep: 3852.52 | bwd_inner_microstep: 3844.67 | bwd_allreduce_microstep: 7.80 | step_microstep: 22.50 [2024-11-14 01:26:24,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.73 | bwd: 3852.53 | bwd_inner: 3844.67 | bwd_allreduce: 7.82 | step: 22.51 6%|▌ | 3109/50750 [8:43:46<78:32:07, 5.93s/it] {'loss': 0.058, 'learning_rate': 3.989764040189108e-05, 'epoch': 3.06} 6%|▌ | 3109/50750 [8:43:46<78:32:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:26:30,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:26:30,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.55 | bwd_microstep: 3850.23 | bwd_inner_microstep: 3841.69 | bwd_allreduce_microstep: 8.50 | step_microstep: 21.64 [2024-11-14 01:26:30,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.54 | bwd: 3850.25 | bwd_inner: 3841.69 | bwd_allreduce: 8.51 | step: 21.64 6%|▌ | 3110/50750 [8:43:52<78:30:07, 5.93s/it] {'loss': 0.056, 'learning_rate': 3.989751139265004e-05, 'epoch': 3.06} 6%|▌ | 3110/50750 [8:43:52<78:30:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:26:36,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:26:36,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.34 | bwd_microstep: 3850.11 | bwd_inner_microstep: 3842.58 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.18 [2024-11-14 01:26:36,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.32 | bwd: 3850.12 | bwd_inner: 3842.58 | bwd_allreduce: 7.50 | step: 21.18 6%|▌ | 3111/50750 [8:43:58<78:28:19, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.989738230237043e-05, 'epoch': 3.07} 6%|▌ | 3111/50750 [8:43:58<78:28:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:26:42,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.93 [2024-11-14 01:26:42,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.37 | bwd_microstep: 3847.82 | bwd_inner_microstep: 3840.29 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.81 [2024-11-14 01:26:42,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.35 | bwd: 3847.83 | bwd_inner: 3840.29 | bwd_allreduce: 7.50 | step: 21.82 6%|▌ | 3112/50750 [8:44:04<78:25:53, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.989725313105278e-05, 'epoch': 3.07} 6%|▌ | 3112/50750 [8:44:04<78:25:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:26:48,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:26:48,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.61 | bwd_microstep: 3845.39 | bwd_inner_microstep: 3837.88 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.06 [2024-11-14 01:26:48,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.59 | bwd: 3845.40 | bwd_inner: 3837.88 | bwd_allreduce: 7.48 | step: 21.06 6%|▌ | 3113/50750 [8:44:10<78:23:40, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.989712387869761e-05, 'epoch': 3.07} 6%|▌ | 3113/50750 [8:44:10<78:23:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:26:54,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:26:54,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.77 | bwd_microstep: 3857.55 | bwd_inner_microstep: 3849.27 | bwd_allreduce_microstep: 8.23 | step_microstep: 21.71 [2024-11-14 01:26:54,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.77 | bwd: 3857.56 | bwd_inner: 3849.27 | bwd_allreduce: 8.25 | step: 21.71 6%|▌ | 3114/50750 [8:44:16<78:25:23, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.989699454530545e-05, 'epoch': 3.07} 6%|▌ | 3114/50750 [8:44:16<78:25:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:27:00,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:27:00,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.75 | bwd_microstep: 3853.92 | bwd_inner_microstep: 3846.41 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.15 [2024-11-14 01:27:00,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.72 | bwd: 3853.94 | bwd_inner: 3846.41 | bwd_allreduce: 7.49 | step: 21.15 6%|▌ | 3115/50750 [8:44:22<78:26:52, 5.93s/it] {'loss': 0.0001, 'learning_rate': 3.989686513087683e-05, 'epoch': 3.07} 6%|▌ | 3115/50750 [8:44:22<78:26:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:27:06,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.30 | optimizer_step: 4.93 [2024-11-14 01:27:06,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.10 | bwd_microstep: 3848.22 | bwd_inner_microstep: 3840.71 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.41 [2024-11-14 01:27:06,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.10 | bwd: 3848.24 | bwd_inner: 3840.71 | bwd_allreduce: 7.49 | step: 21.42 6%|▌ | 3116/50750 [8:44:27<78:24:24, 5.93s/it] {'loss': 0.7463, 'learning_rate': 3.9896735635412265e-05, 'epoch': 3.07} 6%|▌ | 3116/50750 [8:44:27<78:24:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:27:11,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:27:11,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.15 | bwd_microstep: 3851.29 | bwd_inner_microstep: 3843.73 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.85 [2024-11-14 01:27:11,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.14 | bwd: 3851.31 | bwd_inner: 3843.73 | bwd_allreduce: 7.54 | step: 21.86 6%|▌ | 3117/50750 [8:44:33<78:24:17, 5.93s/it] {'loss': 1.4184, 'learning_rate': 3.989660605891229e-05, 'epoch': 3.07} 6%|▌ | 3117/50750 [8:44:33<78:24:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:27:17,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:27:17,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.83 | bwd_microstep: 3849.22 | bwd_inner_microstep: 3841.70 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.45 [2024-11-14 01:27:17,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.81 | bwd: 3849.24 | bwd_inner: 3841.70 | bwd_allreduce: 7.50 | step: 21.45 6%|▌ | 3118/50750 [8:44:39<78:24:12, 5.93s/it] {'loss': 0.1461, 'learning_rate': 3.989647640137743e-05, 'epoch': 3.07} 6%|▌ | 3118/50750 [8:44:39<78:24:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:27:23,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:27:23,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.01 | bwd_microstep: 3870.30 | bwd_inner_microstep: 3862.78 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.19 [2024-11-14 01:27:23,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.01 | bwd: 3870.31 | bwd_inner: 3862.78 | bwd_allreduce: 7.49 | step: 21.19 6%|▌ | 3119/50750 [8:44:45<78:29:54, 5.93s/it] {'loss': 0.001, 'learning_rate': 3.989634666280822e-05, 'epoch': 3.07} 6%|▌ | 3119/50750 [8:44:45<78:29:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:27:29,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.92 [2024-11-14 01:27:29,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.96 | bwd_microstep: 3846.77 | bwd_inner_microstep: 3838.94 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.44 [2024-11-14 01:27:29,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.96 | bwd: 3846.79 | bwd_inner: 3838.94 | bwd_allreduce: 7.80 | step: 22.45 6%|▌ | 3120/50750 [8:44:51<78:28:39, 5.93s/it] {'loss': 0.004, 'learning_rate': 3.989621684320518e-05, 'epoch': 3.07} 6%|▌ | 3120/50750 [8:44:51<78:28:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:27:35,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.92 [2024-11-14 01:27:35,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.49 | bwd_microstep: 3856.13 | bwd_inner_microstep: 3848.59 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.53 [2024-11-14 01:27:35,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.48 | bwd: 3856.15 | bwd_inner: 3848.59 | bwd_allreduce: 7.51 | step: 21.54 6%|▌ | 3121/50750 [8:44:57<78:29:29, 5.93s/it] {'loss': 0.3201, 'learning_rate': 3.9896086942568845e-05, 'epoch': 3.07} 6%|▌ | 3121/50750 [8:44:57<78:29:29, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:27:41,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.94 [2024-11-14 01:27:41,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.36 | bwd_microstep: 3853.95 | bwd_inner_microstep: 3846.45 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.19 [2024-11-14 01:27:41,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.36 | bwd: 3853.97 | bwd_inner: 3846.45 | bwd_allreduce: 7.48 | step: 21.20 6%|▌ | 3122/50750 [8:45:03<78:27:50, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.989595696089975e-05, 'epoch': 3.08} 6%|▌ | 3122/50750 [8:45:03<78:27:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:27:47,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 01:27:47,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.29 | bwd_microstep: 3854.00 | bwd_inner_microstep: 3845.65 | bwd_allreduce_microstep: 8.31 | step_microstep: 20.94 [2024-11-14 01:27:47,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.29 | bwd: 3854.02 | bwd_inner: 3845.65 | bwd_allreduce: 8.33 | step: 20.94 6%|▌ | 3123/50750 [8:45:09<78:26:14, 5.93s/it] {'loss': 0.0009, 'learning_rate': 3.989582689819841e-05, 'epoch': 3.08} 6%|▌ | 3123/50750 [8:45:09<78:26:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:27:53,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 01:27:53,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.96 | bwd_microstep: 3847.82 | bwd_inner_microstep: 3839.91 | bwd_allreduce_microstep: 7.87 | step_microstep: 21.89 [2024-11-14 01:27:53,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.94 | bwd: 3847.83 | bwd_inner: 3839.91 | bwd_allreduce: 7.88 | step: 21.89 6%|▌ | 3124/50750 [8:45:15<78:26:43, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.989569675446536e-05, 'epoch': 3.08} 6%|▌ | 3124/50750 [8:45:15<78:26:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:27:59,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-14 01:27:59,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.39 | bwd_microstep: 3853.59 | bwd_inner_microstep: 3845.67 | bwd_allreduce_microstep: 7.87 | step_microstep: 22.49 [2024-11-14 01:27:59,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.37 | bwd: 3853.61 | bwd_inner: 3845.67 | bwd_allreduce: 7.90 | step: 22.49 6%|▌ | 3125/50750 [8:45:21<78:31:08, 5.94s/it] {'loss': 0.0007, 'learning_rate': 3.9895566529701125e-05, 'epoch': 3.08} 6%|▌ | 3125/50750 [8:45:21<78:31:08, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:28:05,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:28:05,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.55 | bwd_microstep: 3845.86 | bwd_inner_microstep: 3838.24 | bwd_allreduce_microstep: 7.58 | step_microstep: 22.95 [2024-11-14 01:28:05,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.53 | bwd: 3845.87 | bwd_inner: 3838.24 | bwd_allreduce: 7.59 | step: 22.95 6%|▌ | 3126/50750 [8:45:27<78:29:28, 5.93s/it] {'loss': 0.0031, 'learning_rate': 3.989543622390625e-05, 'epoch': 3.08} 6%|▌ | 3126/50750 [8:45:27<78:29:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:28:11,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-14 01:28:11,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.17 | bwd_microstep: 3856.42 | bwd_inner_microstep: 3848.82 | bwd_allreduce_microstep: 7.56 | step_microstep: 22.72 [2024-11-14 01:28:11,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.15 | bwd: 3856.43 | bwd_inner: 3848.82 | bwd_allreduce: 7.57 | step: 22.72 6%|▌ | 3127/50750 [8:45:33<78:30:12, 5.93s/it] {'loss': 0.5348, 'learning_rate': 3.989530583708125e-05, 'epoch': 3.08} 6%|▌ | 3127/50750 [8:45:33<78:30:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:28:17,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:28:17,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.07 | bwd_microstep: 3854.83 | bwd_inner_microstep: 3847.32 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.01 [2024-11-14 01:28:17,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.06 | bwd: 3854.84 | bwd_inner: 3847.32 | bwd_allreduce: 7.49 | step: 21.02 6%|▌ | 3128/50750 [8:45:39<78:28:00, 5.93s/it] {'loss': 0.0122, 'learning_rate': 3.989517536922667e-05, 'epoch': 3.08} 6%|▌ | 3128/50750 [8:45:39<78:28:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:28:23,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:28:23,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.47 | bwd_microstep: 3855.44 | bwd_inner_microstep: 3847.87 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.24 [2024-11-14 01:28:23,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.47 | bwd: 3855.45 | bwd_inner: 3847.87 | bwd_allreduce: 7.54 | step: 21.24 6%|▌ | 3129/50750 [8:45:45<78:26:20, 5.93s/it] {'loss': 0.0227, 'learning_rate': 3.9895044820343024e-05, 'epoch': 3.08} 6%|▌ | 3129/50750 [8:45:45<78:26:20, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:28:29,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:28:29,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.30 | bwd_microstep: 3845.45 | bwd_inner_microstep: 3837.90 | bwd_allreduce_microstep: 7.49 | step_microstep: 22.07 [2024-11-14 01:28:29,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.30 | bwd: 3845.46 | bwd_inner: 3837.90 | bwd_allreduce: 7.52 | step: 22.07 6%|▌ | 3130/50750 [8:45:51<78:24:21, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.9894914190430863e-05, 'epoch': 3.08} 6%|▌ | 3130/50750 [8:45:51<78:24:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:28:34,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-14 01:28:34,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.54 | bwd_microstep: 3859.85 | bwd_inner_microstep: 3852.37 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.13 [2024-11-14 01:28:34,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.55 | bwd: 3859.87 | bwd_inner: 3852.37 | bwd_allreduce: 7.46 | step: 21.13 6%|▌ | 3131/50750 [8:45:56<78:25:43, 5.93s/it] {'loss': 0.2458, 'learning_rate': 3.989478347949071e-05, 'epoch': 3.08} 6%|▌ | 3131/50750 [8:45:56<78:25:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:28:40,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:28:40,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.61 | bwd_microstep: 3842.83 | bwd_inner_microstep: 3835.35 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.94 [2024-11-14 01:28:40,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.61 | bwd: 3842.84 | bwd_inner: 3835.35 | bwd_allreduce: 7.45 | step: 20.96 6%|▌ | 3132/50750 [8:46:02<78:21:54, 5.92s/it] {'loss': 0.0033, 'learning_rate': 3.9894652687523096e-05, 'epoch': 3.09} 6%|▌ | 3132/50750 [8:46:02<78:21:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:28:46,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-14 01:28:46,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.18 | bwd_microstep: 3849.22 | bwd_inner_microstep: 3841.39 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.61 [2024-11-14 01:28:46,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.17 | bwd: 3849.24 | bwd_inner: 3841.39 | bwd_allreduce: 7.80 | step: 22.62 6%|▌ | 3133/50750 [8:46:08<78:22:27, 5.93s/it] {'loss': 0.4606, 'learning_rate': 3.989452181452855e-05, 'epoch': 3.09} 6%|▌ | 3133/50750 [8:46:08<78:22:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:28:52,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 01:28:52,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.48 | bwd_microstep: 3854.83 | bwd_inner_microstep: 3847.31 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.22 [2024-11-14 01:28:52,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.46 | bwd: 3854.84 | bwd_inner: 3847.31 | bwd_allreduce: 7.49 | step: 21.23 6%|▌ | 3134/50750 [8:46:14<78:22:33, 5.93s/it] {'loss': 0.2298, 'learning_rate': 3.9894390860507617e-05, 'epoch': 3.09} 6%|▌ | 3134/50750 [8:46:14<78:22:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:28:58,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:28:58,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.48 | bwd_microstep: 3851.64 | bwd_inner_microstep: 3844.07 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.42 [2024-11-14 01:28:58,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.48 | bwd: 3851.65 | bwd_inner: 3844.07 | bwd_allreduce: 7.54 | step: 21.42 6%|▌ | 3135/50750 [8:46:20<78:22:39, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.989425982546082e-05, 'epoch': 3.09} 6%|▌ | 3135/50750 [8:46:20<78:22:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:29:04,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:29:04,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.18 | bwd_microstep: 3847.50 | bwd_inner_microstep: 3839.93 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.51 [2024-11-14 01:29:04,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.18 | bwd: 3847.51 | bwd_inner: 3839.93 | bwd_allreduce: 7.54 | step: 21.51 6%|▌ | 3136/50750 [8:46:26<78:21:32, 5.92s/it] {'loss': 0.0013, 'learning_rate': 3.98941287093887e-05, 'epoch': 3.09} 6%|▌ | 3136/50750 [8:46:26<78:21:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:29:10,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 01:29:10,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.62 | bwd_microstep: 3842.98 | bwd_inner_microstep: 3835.50 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.49 [2024-11-14 01:29:10,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.62 | bwd: 3842.99 | bwd_inner: 3835.50 | bwd_allreduce: 7.46 | step: 21.49 6%|▌ | 3137/50750 [8:46:32<78:19:26, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.989399751229179e-05, 'epoch': 3.09} 6%|▌ | 3137/50750 [8:46:32<78:19:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:29:16,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:29:16,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.59 | bwd_microstep: 3847.01 | bwd_inner_microstep: 3839.47 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.39 [2024-11-14 01:29:16,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.59 | bwd: 3847.02 | bwd_inner: 3839.46 | bwd_allreduce: 7.52 | step: 21.40 6%|▌ | 3138/50750 [8:46:38<78:18:14, 5.92s/it] {'loss': 0.0019, 'learning_rate': 3.989386623417061e-05, 'epoch': 3.09} 6%|▌ | 3138/50750 [8:46:38<78:18:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:29:22,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-14 01:29:22,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.17 | bwd_microstep: 3845.01 | bwd_inner_microstep: 3836.19 | bwd_allreduce_microstep: 8.73 | step_microstep: 21.47 [2024-11-14 01:29:22,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.17 | bwd: 3845.03 | bwd_inner: 3836.19 | bwd_allreduce: 8.77 | step: 21.46 6%|▌ | 3139/50750 [8:46:44<78:18:29, 5.92s/it] {'loss': 0.2133, 'learning_rate': 3.989373487502571e-05, 'epoch': 3.09} 6%|▌ | 3139/50750 [8:46:44<78:18:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:29:28,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 01:29:28,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.49 | bwd_microstep: 3851.89 | bwd_inner_microstep: 3844.34 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.75 [2024-11-14 01:29:28,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.49 | bwd: 3851.91 | bwd_inner: 3844.34 | bwd_allreduce: 7.52 | step: 21.75 6%|▌ | 3140/50750 [8:46:50<78:20:01, 5.92s/it] {'loss': 0.4784, 'learning_rate': 3.989360343485762e-05, 'epoch': 3.09} 6%|▌ | 3140/50750 [8:46:50<78:20:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:29:34,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:29:34,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.16 | bwd_microstep: 3846.69 | bwd_inner_microstep: 3839.20 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.60 [2024-11-14 01:29:34,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.15 | bwd: 3846.70 | bwd_inner: 3839.20 | bwd_allreduce: 7.46 | step: 21.61 6%|▌ | 3141/50750 [8:46:56<78:19:00, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.989347191366689e-05, 'epoch': 3.09} 6%|▌ | 3141/50750 [8:46:56<78:19:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:29:40,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:29:40,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.58 | bwd_microstep: 3849.00 | bwd_inner_microstep: 3841.47 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.39 [2024-11-14 01:29:40,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.58 | bwd: 3849.01 | bwd_inner: 3841.47 | bwd_allreduce: 7.49 | step: 21.40 6%|▌ | 3142/50750 [8:47:02<78:18:10, 5.92s/it] {'loss': 0.0891, 'learning_rate': 3.9893340311454024e-05, 'epoch': 3.1} 6%|▌ | 3142/50750 [8:47:02<78:18:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:29:46,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 01:29:46,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.00 | bwd_microstep: 3844.39 | bwd_inner_microstep: 3836.65 | bwd_allreduce_microstep: 7.69 | step_microstep: 24.50 [2024-11-14 01:29:46,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.00 | bwd: 3844.41 | bwd_inner: 3836.65 | bwd_allreduce: 7.71 | step: 24.50 6%|▌ | 3143/50750 [8:47:07<78:17:27, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.989320862821959e-05, 'epoch': 3.1} 6%|▌ | 3143/50750 [8:47:07<78:17:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:29:51,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:29:51,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.59 | bwd_microstep: 3846.27 | bwd_inner_microstep: 3838.72 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.72 [2024-11-14 01:29:51,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.59 | bwd: 3846.29 | bwd_inner: 3838.72 | bwd_allreduce: 7.53 | step: 21.72 6%|▌ | 3144/50750 [8:47:13<78:18:45, 5.92s/it] {'loss': 0.0, 'learning_rate': 3.9893076863964106e-05, 'epoch': 3.1} 6%|▌ | 3144/50750 [8:47:13<78:18:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:29:57,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 01:29:57,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.05 | bwd_microstep: 3845.37 | bwd_inner_microstep: 3837.60 | bwd_allreduce_microstep: 7.72 | step_microstep: 22.22 [2024-11-14 01:29:57,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.04 | bwd: 3845.39 | bwd_inner: 3837.60 | bwd_allreduce: 7.74 | step: 22.22 6%|▌ | 3145/50750 [8:47:19<78:21:18, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.989294501868811e-05, 'epoch': 3.1} 6%|▌ | 3145/50750 [8:47:19<78:21:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:30:03,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-14 01:30:03,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.79 | bwd_microstep: 3844.54 | bwd_inner_microstep: 3836.77 | bwd_allreduce_microstep: 7.73 | step_microstep: 21.91 [2024-11-14 01:30:03,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.78 | bwd: 3844.56 | bwd_inner: 3836.77 | bwd_allreduce: 7.75 | step: 21.92 6%|▌ | 3146/50750 [8:47:25<78:20:48, 5.92s/it] {'loss': 0.0242, 'learning_rate': 3.989281309239214e-05, 'epoch': 3.1} 6%|▌ | 3146/50750 [8:47:25<78:20:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:30:09,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 01:30:09,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.50 | bwd_microstep: 3851.50 | bwd_inner_microstep: 3843.80 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.20 [2024-11-14 01:30:09,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.49 | bwd: 3851.52 | bwd_inner: 3843.80 | bwd_allreduce: 7.67 | step: 21.20 6%|▌ | 3147/50750 [8:47:31<78:20:08, 5.92s/it] {'loss': 1.3675, 'learning_rate': 3.989268108507674e-05, 'epoch': 3.1} 6%|▌ | 3147/50750 [8:47:31<78:20:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:30:15,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 01:30:15,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.21 | bwd_microstep: 3849.97 | bwd_inner_microstep: 3841.83 | bwd_allreduce_microstep: 8.08 | step_microstep: 25.57 [2024-11-14 01:30:15,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.21 | bwd: 3849.99 | bwd_inner: 3841.83 | bwd_allreduce: 8.10 | step: 25.57 6%|▌ | 3148/50750 [8:47:37<78:23:25, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.9892548996742445e-05, 'epoch': 3.1} 6%|▌ | 3148/50750 [8:47:37<78:23:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:30:21,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-14 01:30:21,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.87 | bwd_microstep: 3853.62 | bwd_inner_microstep: 3845.98 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.69 [2024-11-14 01:30:21,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.87 | bwd: 3853.63 | bwd_inner: 3845.98 | bwd_allreduce: 7.62 | step: 21.69 6%|▌ | 3149/50750 [8:47:43<78:25:28, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.989241682738979e-05, 'epoch': 3.1} 6%|▌ | 3149/50750 [8:47:43<78:25:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:30:27,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.29 | optimizer_step: 4.93 [2024-11-14 01:30:27,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.70 | bwd_microstep: 3845.17 | bwd_inner_microstep: 3837.31 | bwd_allreduce_microstep: 7.82 | step_microstep: 22.15 [2024-11-14 01:30:27,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.68 | bwd: 3845.19 | bwd_inner: 3837.31 | bwd_allreduce: 7.84 | step: 22.15 6%|▌ | 3150/50750 [8:47:49<78:25:16, 5.93s/it] {'loss': 0.1857, 'learning_rate': 3.989228457701931e-05, 'epoch': 3.1} 6%|▌ | 3150/50750 [8:47:49<78:25:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:30:33,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-14 01:30:33,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.57 | bwd_microstep: 3857.20 | bwd_inner_microstep: 3848.95 | bwd_allreduce_microstep: 8.18 | step_microstep: 24.73 [2024-11-14 01:30:33,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.55 | bwd: 3857.22 | bwd_inner: 3848.95 | bwd_allreduce: 8.21 | step: 24.73 6%|▌ | 3151/50750 [8:47:55<78:28:26, 5.94s/it] {'loss': 0.0008, 'learning_rate': 3.989215224563155e-05, 'epoch': 3.1} 6%|▌ | 3151/50750 [8:47:55<78:28:26, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:30:39,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 01:30:39,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.28 | bwd_microstep: 3849.73 | bwd_inner_microstep: 3841.83 | bwd_allreduce_microstep: 7.85 | step_microstep: 21.50 [2024-11-14 01:30:39,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.26 | bwd: 3849.75 | bwd_inner: 3841.83 | bwd_allreduce: 7.87 | step: 21.51 6%|▌ | 3152/50750 [8:48:01<78:27:43, 5.93s/it] {'loss': 0.0044, 'learning_rate': 3.989201983322705e-05, 'epoch': 3.11} 6%|▌ | 3152/50750 [8:48:01<78:27:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:30:45,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:30:45,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.30 | bwd_microstep: 3851.61 | bwd_inner_microstep: 3844.10 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.05 [2024-11-14 01:30:45,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.28 | bwd: 3851.62 | bwd_inner: 3844.10 | bwd_allreduce: 7.49 | step: 21.06 6%|▌ | 3153/50750 [8:48:07<78:24:48, 5.93s/it] {'loss': 0.0009, 'learning_rate': 3.989188733980634e-05, 'epoch': 3.11} 6%|▌ | 3153/50750 [8:48:07<78:24:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:30:51,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 01:30:51,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.23 | bwd_microstep: 3850.05 | bwd_inner_microstep: 3842.47 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.81 [2024-11-14 01:30:51,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.23 | bwd: 3850.06 | bwd_inner: 3842.47 | bwd_allreduce: 7.55 | step: 21.81 6%|▌ | 3154/50750 [8:48:13<78:22:28, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.989175476536997e-05, 'epoch': 3.11} 6%|▌ | 3154/50750 [8:48:13<78:22:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:30:57,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:30:57,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.77 | bwd_microstep: 3845.03 | bwd_inner_microstep: 3837.53 | bwd_allreduce_microstep: 7.46 | step_microstep: 22.15 [2024-11-14 01:30:57,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.77 | bwd: 3845.05 | bwd_inner: 3837.54 | bwd_allreduce: 7.47 | step: 22.15 6%|▌ | 3155/50750 [8:48:19<78:20:39, 5.93s/it] {'loss': 0.0107, 'learning_rate': 3.989162210991848e-05, 'epoch': 3.11} 6%|▌ | 3155/50750 [8:48:19<78:20:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:31:03,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 01:31:03,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.14 | bwd_microstep: 3847.78 | bwd_inner_microstep: 3840.14 | bwd_allreduce_microstep: 7.59 | step_microstep: 22.02 [2024-11-14 01:31:03,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.14 | bwd: 3847.79 | bwd_inner: 3840.14 | bwd_allreduce: 7.61 | step: 22.03 6%|▌ | 3156/50750 [8:48:25<78:20:33, 5.93s/it] {'loss': 0.0321, 'learning_rate': 3.98914893734524e-05, 'epoch': 3.11} 6%|▌ | 3156/50750 [8:48:25<78:20:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:31:09,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:31:09,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.75 | bwd_microstep: 3854.70 | bwd_inner_microstep: 3847.17 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.11 [2024-11-14 01:31:09,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.74 | bwd: 3854.72 | bwd_inner: 3847.17 | bwd_allreduce: 7.50 | step: 21.11 6%|▌ | 3157/50750 [8:48:31<78:21:01, 5.93s/it] {'loss': 0.0001, 'learning_rate': 3.989135655597228e-05, 'epoch': 3.11} 6%|▌ | 3157/50750 [8:48:31<78:21:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:31:14,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:31:14,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.27 | bwd_microstep: 3847.46 | bwd_inner_microstep: 3839.92 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.33 [2024-11-14 01:31:14,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.28 | bwd: 3847.47 | bwd_inner: 3839.92 | bwd_allreduce: 7.51 | step: 21.34 6%|▌ | 3158/50750 [8:48:36<78:20:22, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.9891223657478654e-05, 'epoch': 3.11} 6%|▌ | 3158/50750 [8:48:36<78:20:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 01:31:20,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:31:20,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.21 | bwd_microstep: 3845.11 | bwd_inner_microstep: 3837.51 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.93 [2024-11-14 01:31:20,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.21 | bwd: 3845.12 | bwd_inner: 3837.51 | bwd_allreduce: 7.57 | step: 21.93 6%|▌ | 3159/50750 [8:48:42<78:18:24, 5.92s/it] {'loss': 0.0145, 'learning_rate': 3.989109067797207e-05, 'epoch': 3.11} 6%|▌ | 3159/50750 [8:48:42<78:18:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:31:26,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:31:26,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.94 | bwd_microstep: 3848.65 | bwd_inner_microstep: 3840.94 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.69 [2024-11-14 01:31:26,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.94 | bwd: 3848.66 | bwd_inner: 3840.94 | bwd_allreduce: 7.68 | step: 21.69 6%|▌ | 3160/50750 [8:48:48<78:18:16, 5.92s/it] {'loss': 0.0068, 'learning_rate': 3.989095761745307e-05, 'epoch': 3.11} 6%|▌ | 3160/50750 [8:48:48<78:18:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:31:32,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-14 01:31:32,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.56 | bwd_microstep: 3848.84 | bwd_inner_microstep: 3841.33 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.11 [2024-11-14 01:31:32,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.56 | bwd: 3848.85 | bwd_inner: 3841.33 | bwd_allreduce: 7.48 | step: 21.12 6%|▌ | 3161/50750 [8:48:54<78:17:56, 5.92s/it] {'loss': 0.0121, 'learning_rate': 3.9890824475922185e-05, 'epoch': 3.11} 6%|▌ | 3161/50750 [8:48:54<78:17:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:31:38,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 01:31:38,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.98 | bwd_microstep: 3846.37 | bwd_inner_microstep: 3838.81 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.70 [2024-11-14 01:31:38,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.98 | bwd: 3846.38 | bwd_inner: 3838.81 | bwd_allreduce: 7.53 | step: 21.71 6%|▌ | 3162/50750 [8:49:00<78:16:57, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.989069125337997e-05, 'epoch': 3.12} 6%|▌ | 3162/50750 [8:49:00<78:16:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:31:44,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:31:44,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.82 | bwd_microstep: 3848.85 | bwd_inner_microstep: 3841.15 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.61 [2024-11-14 01:31:44,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.82 | bwd: 3848.86 | bwd_inner: 3841.15 | bwd_allreduce: 7.67 | step: 21.62 6%|▌ | 3163/50750 [8:49:06<78:17:18, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.989055794982696e-05, 'epoch': 3.12} 6%|▌ | 3163/50750 [8:49:06<78:17:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:31:50,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.92 [2024-11-14 01:31:50,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.10 | bwd_microstep: 3849.55 | bwd_inner_microstep: 3841.85 | bwd_allreduce_microstep: 7.66 | step_microstep: 22.11 [2024-11-14 01:31:50,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.10 | bwd: 3849.56 | bwd_inner: 3841.85 | bwd_allreduce: 7.67 | step: 22.12 6%|▌ | 3164/50750 [8:49:12<78:18:00, 5.92s/it] {'loss': 0.0251, 'learning_rate': 3.9890424565263696e-05, 'epoch': 3.12} 6%|▌ | 3164/50750 [8:49:12<78:18:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:31:56,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-14 01:31:56,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.08 | bwd_microstep: 3850.91 | bwd_inner_microstep: 3843.24 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.52 [2024-11-14 01:31:56,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.07 | bwd: 3850.92 | bwd_inner: 3843.24 | bwd_allreduce: 7.64 | step: 21.53 6%|▌ | 3165/50750 [8:49:18<78:18:51, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.989029109969073e-05, 'epoch': 3.12} 6%|▌ | 3165/50750 [8:49:18<78:18:51, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:32:02,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 01:32:02,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.02 | bwd_microstep: 3849.72 | bwd_inner_microstep: 3842.00 | bwd_allreduce_microstep: 7.67 | step_microstep: 22.01 [2024-11-14 01:32:02,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.00 | bwd: 3849.73 | bwd_inner: 3842.00 | bwd_allreduce: 7.69 | step: 22.01 6%|▌ | 3166/50750 [8:49:24<78:21:27, 5.93s/it] {'loss': 0.0001, 'learning_rate': 3.98901575531086e-05, 'epoch': 3.12} 6%|▌ | 3166/50750 [8:49:24<78:21:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:32:08,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:32:08,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.76 | bwd_microstep: 3849.14 | bwd_inner_microstep: 3841.58 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.31 [2024-11-14 01:32:08,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.75 | bwd: 3849.16 | bwd_inner: 3841.58 | bwd_allreduce: 7.53 | step: 21.32 6%|▌ | 3167/50750 [8:49:30<78:21:07, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.989002392551785e-05, 'epoch': 3.12} 6%|▌ | 3167/50750 [8:49:30<78:21:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:32:14,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 01:32:14,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.02 | bwd_microstep: 3849.65 | bwd_inner_microstep: 3841.97 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.26 [2024-11-14 01:32:14,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.02 | bwd: 3849.67 | bwd_inner: 3841.97 | bwd_allreduce: 7.65 | step: 21.27 6%|▌ | 3168/50750 [8:49:36<78:19:43, 5.93s/it] {'loss': 0.002, 'learning_rate': 3.988989021691903e-05, 'epoch': 3.12} 6%|▌ | 3168/50750 [8:49:36<78:19:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:32:20,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-14 01:32:20,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.76 | bwd_microstep: 3839.52 | bwd_inner_microstep: 3831.98 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.34 [2024-11-14 01:32:20,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.76 | bwd: 3839.53 | bwd_inner: 3831.98 | bwd_allreduce: 7.51 | step: 21.34 6%|▌ | 3169/50750 [8:49:42<78:16:18, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.988975642731268e-05, 'epoch': 3.12} 6%|▌ | 3169/50750 [8:49:42<78:16:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:32:26,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:32:26,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.62 | bwd_microstep: 3848.23 | bwd_inner_microstep: 3840.65 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.28 [2024-11-14 01:32:26,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.63 | bwd: 3848.24 | bwd_inner: 3840.65 | bwd_allreduce: 7.55 | step: 21.28 6%|▌ | 3170/50750 [8:49:48<78:15:32, 5.92s/it] {'loss': 0.0338, 'learning_rate': 3.988962255669933e-05, 'epoch': 3.12} 6%|▌ | 3170/50750 [8:49:48<78:15:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:32:31,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:32:31,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.97 | bwd_microstep: 3846.46 | bwd_inner_microstep: 3838.95 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.97 [2024-11-14 01:32:31,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.97 | bwd: 3846.47 | bwd_inner: 3838.95 | bwd_allreduce: 7.48 | step: 20.97 6%|▌ | 3171/50750 [8:49:53<78:14:44, 5.92s/it] {'loss': 0.0266, 'learning_rate': 3.988948860507956e-05, 'epoch': 3.12} 6%|▌ | 3171/50750 [8:49:53<78:14:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:32:37,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.94 [2024-11-14 01:32:37,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.96 | bwd_microstep: 3847.91 | bwd_inner_microstep: 3840.43 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.96 [2024-11-14 01:32:37,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.96 | bwd: 3847.92 | bwd_inner: 3840.43 | bwd_allreduce: 7.45 | step: 20.97 6%|▋ | 3172/50750 [8:49:59<78:14:20, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.988935457245388e-05, 'epoch': 3.13} 6%|▋ | 3172/50750 [8:49:59<78:14:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:32:43,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 01:32:43,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.34 | bwd_microstep: 3842.25 | bwd_inner_microstep: 3834.72 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.07 [2024-11-14 01:32:43,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.34 | bwd: 3842.26 | bwd_inner: 3834.72 | bwd_allreduce: 7.50 | step: 21.07 6%|▋ | 3173/50750 [8:50:05<78:13:04, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.9889220458822866e-05, 'epoch': 3.13} 6%|▋ | 3173/50750 [8:50:05<78:13:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:32:49,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:32:49,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.86 | bwd_microstep: 3845.37 | bwd_inner_microstep: 3837.68 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.06 [2024-11-14 01:32:49,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.86 | bwd: 3845.38 | bwd_inner: 3837.68 | bwd_allreduce: 7.66 | step: 21.06 6%|▋ | 3174/50750 [8:50:11<78:12:11, 5.92s/it] {'loss': 0.4588, 'learning_rate': 3.9889086264187034e-05, 'epoch': 3.13} 6%|▋ | 3174/50750 [8:50:11<78:12:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:32:55,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 01:32:55,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.73 | bwd_microstep: 3847.64 | bwd_inner_microstep: 3840.06 | bwd_allreduce_microstep: 7.54 | step_microstep: 22.01 [2024-11-14 01:32:55,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.73 | bwd: 3847.65 | bwd_inner: 3840.06 | bwd_allreduce: 7.55 | step: 22.04 6%|▋ | 3175/50750 [8:50:17<78:14:15, 5.92s/it] {'loss': 0.0215, 'learning_rate': 3.9888951988546955e-05, 'epoch': 3.13} 6%|▋ | 3175/50750 [8:50:17<78:14:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:33:01,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 01:33:01,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.04 | bwd_microstep: 3851.45 | bwd_inner_microstep: 3843.92 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.13 [2024-11-14 01:33:01,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.02 | bwd: 3851.47 | bwd_inner: 3843.92 | bwd_allreduce: 7.51 | step: 21.13 6%|▋ | 3176/50750 [8:50:23<78:15:44, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.988881763190317e-05, 'epoch': 3.13} 6%|▋ | 3176/50750 [8:50:23<78:15:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:33:07,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 01:33:07,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.68 | bwd_microstep: 3851.68 | bwd_inner_microstep: 3844.04 | bwd_allreduce_microstep: 7.58 | step_microstep: 20.91 [2024-11-14 01:33:07,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.68 | bwd: 3851.69 | bwd_inner: 3844.04 | bwd_allreduce: 7.60 | step: 20.91 6%|▋ | 3177/50750 [8:50:29<78:16:58, 5.92s/it] {'loss': 0.7038, 'learning_rate': 3.9888683194256215e-05, 'epoch': 3.13} 6%|▋ | 3177/50750 [8:50:29<78:16:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:33:13,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:33:13,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.89 | bwd_microstep: 3853.99 | bwd_inner_microstep: 3846.32 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.26 [2024-11-14 01:33:13,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.88 | bwd: 3854.01 | bwd_inner: 3846.32 | bwd_allreduce: 7.64 | step: 21.27 6%|▋ | 3178/50750 [8:50:35<78:18:40, 5.93s/it] {'loss': 0.0111, 'learning_rate': 3.9888548675606646e-05, 'epoch': 3.13} 6%|▋ | 3178/50750 [8:50:35<78:18:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:33:19,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.96 [2024-11-14 01:33:19,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.23 | bwd_microstep: 3851.68 | bwd_inner_microstep: 3844.03 | bwd_allreduce_microstep: 7.60 | step_microstep: 20.92 [2024-11-14 01:33:19,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.22 | bwd: 3851.69 | bwd_inner: 3844.03 | bwd_allreduce: 7.62 | step: 20.92 6%|▋ | 3179/50750 [8:50:41<78:18:32, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.988841407595502e-05, 'epoch': 3.13} 6%|▋ | 3179/50750 [8:50:41<78:18:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:33:25,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.94 [2024-11-14 01:33:25,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.76 | bwd_microstep: 3861.01 | bwd_inner_microstep: 3853.48 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.11 [2024-11-14 01:33:25,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.76 | bwd: 3861.03 | bwd_inner: 3853.48 | bwd_allreduce: 7.50 | step: 21.11 6%|▋ | 3180/50750 [8:50:47<78:21:08, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.988827939530186e-05, 'epoch': 3.13} 6%|▋ | 3180/50750 [8:50:47<78:21:08, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:33:31,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 01:33:31,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.38 | bwd_microstep: 3855.72 | bwd_inner_microstep: 3848.23 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.10 [2024-11-14 01:33:31,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.38 | bwd: 3855.73 | bwd_inner: 3848.23 | bwd_allreduce: 7.47 | step: 21.10 6%|▋ | 3181/50750 [8:50:53<78:21:46, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.988814463364774e-05, 'epoch': 3.13} 6%|▋ | 3181/50750 [8:50:53<78:21:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:33:37,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:33:37,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.57 | bwd_microstep: 3856.70 | bwd_inner_microstep: 3849.23 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.02 [2024-11-14 01:33:37,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.56 | bwd: 3856.72 | bwd_inner: 3849.23 | bwd_allreduce: 7.45 | step: 21.02 6%|▋ | 3182/50750 [8:50:59<78:22:13, 5.93s/it] {'loss': 0.024, 'learning_rate': 3.98880097909932e-05, 'epoch': 3.13} 6%|▋ | 3182/50750 [8:50:59<78:22:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:33:43,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:33:43,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.32 | bwd_microstep: 3857.56 | bwd_inner_microstep: 3849.77 | bwd_allreduce_microstep: 7.75 | step_microstep: 21.52 [2024-11-14 01:33:43,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.32 | bwd: 3857.57 | bwd_inner: 3849.77 | bwd_allreduce: 7.77 | step: 21.52 6%|▋ | 3183/50750 [8:51:05<78:23:28, 5.93s/it] {'loss': 0.0132, 'learning_rate': 3.988787486733879e-05, 'epoch': 3.14} 6%|▋ | 3183/50750 [8:51:05<78:23:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:33:49,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:33:49,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.41 | bwd_microstep: 3852.71 | bwd_inner_microstep: 3845.24 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.12 [2024-11-14 01:33:49,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.40 | bwd: 3852.73 | bwd_inner: 3845.24 | bwd_allreduce: 7.44 | step: 21.13 6%|▋ | 3184/50750 [8:51:10<78:23:23, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.988773986268505e-05, 'epoch': 3.14} 6%|▋ | 3184/50750 [8:51:10<78:23:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:33:54,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-14 01:33:54,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.81 | bwd_microstep: 3851.25 | bwd_inner_microstep: 3843.78 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.04 [2024-11-14 01:33:54,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.81 | bwd: 3851.26 | bwd_inner: 3843.78 | bwd_allreduce: 7.44 | step: 21.04 6%|▋ | 3185/50750 [8:51:16<78:21:23, 5.93s/it] {'loss': 0.6261, 'learning_rate': 3.9887604777032544e-05, 'epoch': 3.14} 6%|▋ | 3185/50750 [8:51:16<78:21:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:34:00,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 01:34:00,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.96 | bwd_microstep: 3852.33 | bwd_inner_microstep: 3844.61 | bwd_allreduce_microstep: 7.67 | step_microstep: 22.09 [2024-11-14 01:34:00,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.96 | bwd: 3852.34 | bwd_inner: 3844.61 | bwd_allreduce: 7.69 | step: 22.10 6%|▋ | 3186/50750 [8:51:22<78:21:45, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.988746961038182e-05, 'epoch': 3.14} 6%|▋ | 3186/50750 [8:51:22<78:21:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:34:06,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 01:34:06,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.12 | bwd_microstep: 3847.24 | bwd_inner_microstep: 3839.75 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.87 [2024-11-14 01:34:06,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.11 | bwd: 3847.26 | bwd_inner: 3839.75 | bwd_allreduce: 7.47 | step: 20.87 6%|▋ | 3187/50750 [8:51:28<78:21:15, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.988733436273341e-05, 'epoch': 3.14} 6%|▋ | 3187/50750 [8:51:28<78:21:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:34:12,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 01:34:12,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.13 | bwd_microstep: 3851.41 | bwd_inner_microstep: 3843.75 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.61 [2024-11-14 01:34:12,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.13 | bwd: 3851.43 | bwd_inner: 3843.75 | bwd_allreduce: 7.64 | step: 21.61 6%|▋ | 3188/50750 [8:51:34<78:20:12, 5.93s/it] {'loss': 0.001, 'learning_rate': 3.988719903408789e-05, 'epoch': 3.14} 6%|▋ | 3188/50750 [8:51:34<78:20:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:34:18,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 01:34:18,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.84 | bwd_microstep: 3850.81 | bwd_inner_microstep: 3843.35 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.08 [2024-11-14 01:34:18,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.84 | bwd: 3850.82 | bwd_inner: 3843.35 | bwd_allreduce: 7.44 | step: 21.09 6%|▋ | 3189/50750 [8:51:40<78:18:59, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.98870636244458e-05, 'epoch': 3.14} 6%|▋ | 3189/50750 [8:51:40<78:18:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:34:24,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:34:24,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.28 | bwd_microstep: 3853.89 | bwd_inner_microstep: 3846.35 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.12 [2024-11-14 01:34:24,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.28 | bwd: 3853.90 | bwd_inner: 3846.35 | bwd_allreduce: 7.51 | step: 21.12 6%|▋ | 3190/50750 [8:51:46<78:18:27, 5.93s/it] {'loss': 0.0342, 'learning_rate': 3.988692813380769e-05, 'epoch': 3.14} 6%|▋ | 3190/50750 [8:51:46<78:18:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 01:34:30,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 01:34:30,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.38 | bwd_microstep: 3845.59 | bwd_inner_microstep: 3838.10 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.98 [2024-11-14 01:34:30,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.38 | bwd: 3845.60 | bwd_inner: 3838.10 | bwd_allreduce: 7.46 | step: 20.98 6%|▋ | 3191/50750 [8:51:52<78:16:24, 5.92s/it] {'loss': 0.0022, 'learning_rate': 3.988679256217412e-05, 'epoch': 3.14} 6%|▋ | 3191/50750 [8:51:52<78:16:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:34:36,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.35 | optimizer_step: 4.93 [2024-11-14 01:34:36,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.48 | bwd_microstep: 3847.91 | bwd_inner_microstep: 3839.92 | bwd_allreduce_microstep: 7.94 | step_microstep: 22.60 [2024-11-14 01:34:36,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.46 | bwd: 3847.92 | bwd_inner: 3839.92 | bwd_allreduce: 7.96 | step: 22.60 6%|▋ | 3192/50750 [8:51:58<78:17:00, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.9886656909545624e-05, 'epoch': 3.14} 6%|▋ | 3192/50750 [8:51:58<78:17:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:34:42,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 01:34:42,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2037.49 | bwd_microstep: 3852.18 | bwd_inner_microstep: 3844.08 | bwd_allreduce_microstep: 8.04 | step_microstep: 23.86 [2024-11-14 01:34:42,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2037.47 | bwd: 3852.20 | bwd_inner: 3844.08 | bwd_allreduce: 8.06 | step: 23.86 6%|▋ | 3193/50750 [8:52:04<78:21:56, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.988652117592277e-05, 'epoch': 3.15} 6%|▋ | 3193/50750 [8:52:04<78:21:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:34:48,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 4.93 [2024-11-14 01:34:48,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.57 | bwd_microstep: 3847.11 | bwd_inner_microstep: 3839.35 | bwd_allreduce_microstep: 7.72 | step_microstep: 23.95 [2024-11-14 01:34:48,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.55 | bwd: 3847.13 | bwd_inner: 3839.35 | bwd_allreduce: 7.74 | step: 23.95 6%|▋ | 3194/50750 [8:52:10<78:23:57, 5.93s/it] {'loss': 0.0071, 'learning_rate': 3.988638536130611e-05, 'epoch': 3.15} 6%|▋ | 3194/50750 [8:52:10<78:23:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:34:54,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.92 [2024-11-14 01:34:54,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.75 | bwd_microstep: 3851.04 | bwd_inner_microstep: 3843.40 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.90 [2024-11-14 01:34:54,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.73 | bwd: 3851.06 | bwd_inner: 3843.40 | bwd_allreduce: 7.62 | step: 21.91 6%|▋ | 3195/50750 [8:52:16<78:24:49, 5.94s/it] {'loss': 0.0003, 'learning_rate': 3.988624946569619e-05, 'epoch': 3.15} 6%|▋ | 3195/50750 [8:52:16<78:24:49, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:35:00,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:35:00,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.47 | bwd_microstep: 3855.72 | bwd_inner_microstep: 3848.11 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.42 [2024-11-14 01:35:00,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.46 | bwd: 3855.74 | bwd_inner: 3848.11 | bwd_allreduce: 7.59 | step: 21.43 6%|▋ | 3196/50750 [8:52:22<78:23:24, 5.93s/it] {'loss': 0.0028, 'learning_rate': 3.9886113489093574e-05, 'epoch': 3.15} 6%|▋ | 3196/50750 [8:52:22<78:23:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:35:06,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:35:06,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.37 | bwd_microstep: 3852.52 | bwd_inner_microstep: 3844.85 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.54 [2024-11-14 01:35:06,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.36 | bwd: 3852.53 | bwd_inner: 3844.85 | bwd_allreduce: 7.65 | step: 21.54 6%|▋ | 3197/50750 [8:52:28<78:23:50, 5.94s/it] {'loss': 0.0125, 'learning_rate': 3.9885977431498805e-05, 'epoch': 3.15} 6%|▋ | 3197/50750 [8:52:28<78:23:50, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:35:12,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 01:35:12,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.20 | bwd_microstep: 3843.03 | bwd_inner_microstep: 3835.48 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.67 [2024-11-14 01:35:12,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.19 | bwd: 3843.05 | bwd_inner: 3835.48 | bwd_allreduce: 7.52 | step: 21.68 6%|▋ | 3198/50750 [8:52:34<78:20:16, 5.93s/it] {'loss': 0.1667, 'learning_rate': 3.988584129291244e-05, 'epoch': 3.15} 6%|▋ | 3198/50750 [8:52:34<78:20:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:35:17,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:35:17,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.34 | bwd_microstep: 3851.71 | bwd_inner_microstep: 3844.16 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.77 [2024-11-14 01:35:17,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.34 | bwd: 3851.72 | bwd_inner: 3844.16 | bwd_allreduce: 7.52 | step: 21.78 6%|▋ | 3199/50750 [8:52:39<78:18:48, 5.93s/it] {'loss': 0.4126, 'learning_rate': 3.988570507333504e-05, 'epoch': 3.15} 6%|▋ | 3199/50750 [8:52:39<78:18:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:35:23,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:35:23,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.72 | bwd_microstep: 3853.79 | bwd_inner_microstep: 3846.21 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.90 [2024-11-14 01:35:23,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.70 | bwd: 3853.81 | bwd_inner: 3846.21 | bwd_allreduce: 7.56 | step: 21.90 6%|▋ | 3200/50750 [8:52:45<78:20:09, 5.93s/it] {'loss': 0.0048, 'learning_rate': 3.9885568772767154e-05, 'epoch': 3.15} 6%|▋ | 3200/50750 [8:52:45<78:20:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2203 [2024-11-14 01:35:29,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.92 [2024-11-14 01:35:29,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.94 | bwd_microstep: 3851.87 | bwd_inner_microstep: 3844.39 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.06 [2024-11-14 01:35:29,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.93 | bwd: 3851.88 | bwd_inner: 3844.39 | bwd_allreduce: 7.45 | step: 21.07 6%|▋ | 3201/50750 [8:52:51<78:20:22, 5.93s/it] {'loss': 0.412, 'learning_rate': 3.988543239120934e-05, 'epoch': 3.15} 6%|▋ | 3201/50750 [8:52:51<78:20:22, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:35:35,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:35:35,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.64 | bwd_microstep: 3846.84 | bwd_inner_microstep: 3839.33 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.14 [2024-11-14 01:35:35,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.63 | bwd: 3846.85 | bwd_inner: 3839.33 | bwd_allreduce: 7.48 | step: 21.14 6%|▋ | 3202/50750 [8:52:57<78:17:43, 5.93s/it] {'loss': 0.0054, 'learning_rate': 3.988529592866215e-05, 'epoch': 3.15} 6%|▋ | 3202/50750 [8:52:57<78:17:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:35:41,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 01:35:41,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.58 | bwd_microstep: 3851.91 | bwd_inner_microstep: 3844.02 | bwd_allreduce_microstep: 7.84 | step_microstep: 22.44 [2024-11-14 01:35:41,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.57 | bwd: 3851.92 | bwd_inner: 3844.02 | bwd_allreduce: 7.87 | step: 22.44 6%|▋ | 3203/50750 [8:53:03<78:18:25, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.9885159385126145e-05, 'epoch': 3.16} 6%|▋ | 3203/50750 [8:53:03<78:18:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:35:47,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 01:35:47,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.83 | bwd_microstep: 3850.37 | bwd_inner_microstep: 3842.88 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.12 [2024-11-14 01:35:47,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.81 | bwd: 3850.38 | bwd_inner: 3842.88 | bwd_allreduce: 7.47 | step: 21.13 6%|▋ | 3204/50750 [8:53:09<78:18:18, 5.93s/it] {'loss': 0.7975, 'learning_rate': 3.988502276060187e-05, 'epoch': 3.16} 6%|▋ | 3204/50750 [8:53:09<78:18:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:35:53,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:35:53,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.15 | bwd_microstep: 3854.00 | bwd_inner_microstep: 3846.47 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.10 [2024-11-14 01:35:53,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.15 | bwd: 3854.02 | bwd_inner: 3846.47 | bwd_allreduce: 7.51 | step: 21.10 6%|▋ | 3205/50750 [8:53:15<78:18:46, 5.93s/it] {'loss': 0.001, 'learning_rate': 3.988488605508989e-05, 'epoch': 3.16} 6%|▋ | 3205/50750 [8:53:15<78:18:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:35:59,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:35:59,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.24 | bwd_microstep: 3845.79 | bwd_inner_microstep: 3838.29 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.08 [2024-11-14 01:35:59,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.24 | bwd: 3845.81 | bwd_inner: 3838.29 | bwd_allreduce: 7.48 | step: 21.09 6%|▋ | 3206/50750 [8:53:21<78:15:45, 5.93s/it] {'loss': 0.0257, 'learning_rate': 3.988474926859076e-05, 'epoch': 3.16} 6%|▋ | 3206/50750 [8:53:21<78:15:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:36:05,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-14 01:36:05,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.53 | bwd_microstep: 3855.01 | bwd_inner_microstep: 3847.15 | bwd_allreduce_microstep: 7.81 | step_microstep: 22.85 [2024-11-14 01:36:05,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.53 | bwd: 3855.02 | bwd_inner: 3847.15 | bwd_allreduce: 7.83 | step: 22.86 6%|▋ | 3207/50750 [8:53:27<78:18:32, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.9884612401105046e-05, 'epoch': 3.16} 6%|▋ | 3207/50750 [8:53:27<78:18:32, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:36:11,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 5.11 [2024-11-14 01:36:11,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.97 | bwd_microstep: 3848.68 | bwd_inner_microstep: 3841.19 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.25 [2024-11-14 01:36:11,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.97 | bwd: 3848.69 | bwd_inner: 3841.19 | bwd_allreduce: 7.47 | step: 21.26 6%|▋ | 3208/50750 [8:53:33<78:16:55, 5.93s/it] {'loss': 0.0149, 'learning_rate': 3.988447545263329e-05, 'epoch': 3.16} 6%|▋ | 3208/50750 [8:53:33<78:16:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:36:17,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 01:36:17,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.44 | bwd_microstep: 3847.63 | bwd_inner_microstep: 3839.95 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.25 [2024-11-14 01:36:17,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.44 | bwd: 3847.64 | bwd_inner: 3839.95 | bwd_allreduce: 7.66 | step: 21.26 6%|▋ | 3209/50750 [8:53:39<78:14:20, 5.92s/it] {'loss': 0.0596, 'learning_rate': 3.988433842317605e-05, 'epoch': 3.16} 6%|▋ | 3209/50750 [8:53:39<78:14:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:36:23,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:36:23,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.50 | bwd_microstep: 3844.24 | bwd_inner_microstep: 3836.75 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.31 [2024-11-14 01:36:23,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.50 | bwd: 3844.25 | bwd_inner: 3836.75 | bwd_allreduce: 7.47 | step: 21.32 6%|▋ | 3210/50750 [8:53:45<78:12:05, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.9884201312733905e-05, 'epoch': 3.16} 6%|▋ | 3210/50750 [8:53:45<78:12:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:36:29,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 5.11 [2024-11-14 01:36:29,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.49 | bwd_microstep: 3843.75 | bwd_inner_microstep: 3836.19 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.63 [2024-11-14 01:36:29,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.49 | bwd: 3843.76 | bwd_inner: 3836.19 | bwd_allreduce: 7.54 | step: 21.64 6%|▋ | 3211/50750 [8:53:51<78:11:16, 5.92s/it] {'loss': 0.0601, 'learning_rate': 3.988406412130739e-05, 'epoch': 3.16} 6%|▋ | 3211/50750 [8:53:51<78:11:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:36:35,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:36:35,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.43 | bwd_microstep: 3849.21 | bwd_inner_microstep: 3841.54 | bwd_allreduce_microstep: 7.62 | step_microstep: 23.38 [2024-11-14 01:36:35,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.42 | bwd: 3849.23 | bwd_inner: 3841.54 | bwd_allreduce: 7.64 | step: 23.38 6%|▋ | 3212/50750 [8:53:56<78:12:16, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.988392684889708e-05, 'epoch': 3.16} 6%|▋ | 3212/50750 [8:53:56<78:12:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 01:36:40,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 01:36:40,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.08 | bwd_microstep: 3848.20 | bwd_inner_microstep: 3840.74 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.86 [2024-11-14 01:36:40,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.06 | bwd: 3848.22 | bwd_inner: 3840.74 | bwd_allreduce: 7.44 | step: 20.86 6%|▋ | 3213/50750 [8:54:02<78:11:52, 5.92s/it] {'loss': 0.0048, 'learning_rate': 3.988378949550352e-05, 'epoch': 3.17} 6%|▋ | 3213/50750 [8:54:02<78:11:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:36:46,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-14 01:36:46,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.47 | bwd_microstep: 3844.56 | bwd_inner_microstep: 3836.81 | bwd_allreduce_microstep: 7.71 | step_microstep: 21.78 [2024-11-14 01:36:46,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.46 | bwd: 3844.57 | bwd_inner: 3836.81 | bwd_allreduce: 7.73 | step: 21.78 6%|▋ | 3214/50750 [8:54:08<78:11:25, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.988365206112729e-05, 'epoch': 3.17} 6%|▋ | 3214/50750 [8:54:08<78:11:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:36:52,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:36:52,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.71 | bwd_microstep: 3848.64 | bwd_inner_microstep: 3841.08 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.53 [2024-11-14 01:36:52,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.70 | bwd: 3848.66 | bwd_inner: 3841.08 | bwd_allreduce: 7.53 | step: 21.53 6%|▋ | 3215/50750 [8:54:14<78:13:52, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.9883514545768925e-05, 'epoch': 3.17} 6%|▋ | 3215/50750 [8:54:14<78:13:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:36:58,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:36:58,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.70 | bwd_microstep: 3882.16 | bwd_inner_microstep: 3874.49 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.25 [2024-11-14 01:36:58,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.70 | bwd: 3882.17 | bwd_inner: 3874.49 | bwd_allreduce: 7.64 | step: 21.25 6%|▋ | 3216/50750 [8:54:20<78:21:55, 5.94s/it] {'loss': 0.0002, 'learning_rate': 3.9883376949429e-05, 'epoch': 3.17} 6%|▋ | 3216/50750 [8:54:20<78:21:55, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:37:04,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.94 [2024-11-14 01:37:04,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.08 | bwd_microstep: 3852.90 | bwd_inner_microstep: 3845.06 | bwd_allreduce_microstep: 7.80 | step_microstep: 21.53 [2024-11-14 01:37:04,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.06 | bwd: 3852.92 | bwd_inner: 3845.06 | bwd_allreduce: 7.82 | step: 21.54 6%|▋ | 3217/50750 [8:54:26<78:22:14, 5.94s/it] {'loss': 0.0029, 'learning_rate': 3.988323927210807e-05, 'epoch': 3.17} 6%|▋ | 3217/50750 [8:54:26<78:22:14, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:37:10,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 5.00 [2024-11-14 01:37:10,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.34 | bwd_microstep: 3866.67 | bwd_inner_microstep: 3859.17 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.53 [2024-11-14 01:37:10,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.33 | bwd: 3866.68 | bwd_inner: 3859.17 | bwd_allreduce: 7.47 | step: 21.54 6%|▋ | 3218/50750 [8:54:32<78:23:59, 5.94s/it] {'loss': 0.5613, 'learning_rate': 3.9883101513806696e-05, 'epoch': 3.17} 6%|▋ | 3218/50750 [8:54:32<78:23:59, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:37:16,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 01:37:16,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.12 | bwd_microstep: 3854.22 | bwd_inner_microstep: 3846.64 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.86 [2024-11-14 01:37:16,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.12 | bwd: 3854.24 | bwd_inner: 3846.64 | bwd_allreduce: 7.56 | step: 21.86 6%|▋ | 3219/50750 [8:54:38<78:22:40, 5.94s/it] {'loss': 0.0001, 'learning_rate': 3.988296367452544e-05, 'epoch': 3.17} 6%|▋ | 3219/50750 [8:54:38<78:22:40, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:37:22,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:37:22,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.64 | bwd_microstep: 3846.07 | bwd_inner_microstep: 3838.54 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.21 [2024-11-14 01:37:22,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.62 | bwd: 3846.09 | bwd_inner: 3838.54 | bwd_allreduce: 7.51 | step: 21.21 6%|▋ | 3220/50750 [8:54:44<78:19:52, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.9882825754264874e-05, 'epoch': 3.17} 6%|▋ | 3220/50750 [8:54:44<78:19:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:37:28,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-14 01:37:28,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.70 | bwd_microstep: 3843.15 | bwd_inner_microstep: 3835.49 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.78 [2024-11-14 01:37:28,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.70 | bwd: 3843.16 | bwd_inner: 3835.49 | bwd_allreduce: 7.63 | step: 21.78 6%|▋ | 3221/50750 [8:54:50<78:15:57, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.9882687753025546e-05, 'epoch': 3.17} 6%|▋ | 3221/50750 [8:54:50<78:15:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:37:34,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 01:37:34,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.84 | bwd_microstep: 3851.39 | bwd_inner_microstep: 3843.79 | bwd_allreduce_microstep: 7.56 | step_microstep: 22.14 [2024-11-14 01:37:34,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.84 | bwd: 3851.41 | bwd_inner: 3843.79 | bwd_allreduce: 7.58 | step: 22.15 6%|▋ | 3222/50750 [8:54:56<78:15:52, 5.93s/it] {'loss': 0.0001, 'learning_rate': 3.988254967080802e-05, 'epoch': 3.17} 6%|▋ | 3222/50750 [8:54:56<78:15:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:37:40,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:37:40,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.13 | bwd_microstep: 3846.99 | bwd_inner_microstep: 3839.51 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.11 [2024-11-14 01:37:40,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.13 | bwd: 3847.00 | bwd_inner: 3839.51 | bwd_allreduce: 7.45 | step: 21.11 6%|▋ | 3223/50750 [8:55:02<78:13:49, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.988241150761286e-05, 'epoch': 3.18} 6%|▋ | 3223/50750 [8:55:02<78:13:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:37:46,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 01:37:46,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.93 | bwd_microstep: 3847.20 | bwd_inner_microstep: 3839.68 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.07 [2024-11-14 01:37:46,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.93 | bwd: 3847.21 | bwd_inner: 3839.68 | bwd_allreduce: 7.49 | step: 21.08 6%|▋ | 3224/50750 [8:55:08<78:12:13, 5.92s/it] {'loss': 0.0872, 'learning_rate': 3.9882273263440625e-05, 'epoch': 3.18} 6%|▋ | 3224/50750 [8:55:08<78:12:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:37:52,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 01:37:52,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.45 | bwd_microstep: 3845.01 | bwd_inner_microstep: 3837.33 | bwd_allreduce_microstep: 7.64 | step_microstep: 22.27 [2024-11-14 01:37:52,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.46 | bwd: 3845.02 | bwd_inner: 3837.33 | bwd_allreduce: 7.66 | step: 22.27 6%|▋ | 3225/50750 [8:55:14<78:11:48, 5.92s/it] {'loss': 0.0068, 'learning_rate': 3.9882134938291886e-05, 'epoch': 3.18} 6%|▋ | 3225/50750 [8:55:14<78:11:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:37:58,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:37:58,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.00 | bwd_microstep: 3850.65 | bwd_inner_microstep: 3843.04 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.79 [2024-11-14 01:37:58,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.96 | bwd: 3850.67 | bwd_inner: 3843.04 | bwd_allreduce: 7.59 | step: 21.79 6%|▋ | 3226/50750 [8:55:19<78:15:52, 5.93s/it] {'loss': 0.0021, 'learning_rate': 3.9881996532167203e-05, 'epoch': 3.18} 6%|▋ | 3226/50750 [8:55:19<78:15:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:38:03,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-14 01:38:03,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.32 | bwd_microstep: 3863.00 | bwd_inner_microstep: 3855.43 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.66 [2024-11-14 01:38:03,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.31 | bwd: 3863.01 | bwd_inner: 3855.43 | bwd_allreduce: 7.54 | step: 21.66 6%|▋ | 3227/50750 [8:55:25<78:19:49, 5.93s/it] {'loss': 0.0, 'learning_rate': 3.9881858045067134e-05, 'epoch': 3.18} 6%|▋ | 3227/50750 [8:55:25<78:19:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:38:09,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 01:38:09,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.16 | bwd_microstep: 3848.00 | bwd_inner_microstep: 3840.50 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.07 [2024-11-14 01:38:09,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.16 | bwd: 3848.01 | bwd_inner: 3840.50 | bwd_allreduce: 7.47 | step: 21.07 6%|▋ | 3228/50750 [8:55:31<78:16:59, 5.93s/it] {'loss': 0.5033, 'learning_rate': 3.988171947699226e-05, 'epoch': 3.18} 6%|▋ | 3228/50750 [8:55:31<78:16:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:38:15,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 01:38:15,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.42 | bwd_microstep: 3847.62 | bwd_inner_microstep: 3840.15 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.31 [2024-11-14 01:38:15,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.42 | bwd: 3847.64 | bwd_inner: 3840.15 | bwd_allreduce: 7.45 | step: 21.31 6%|▋ | 3229/50750 [8:55:37<78:14:49, 5.93s/it] {'loss': 0.1951, 'learning_rate': 3.988158082794312e-05, 'epoch': 3.18} 6%|▋ | 3229/50750 [8:55:37<78:14:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:38:21,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 01:38:21,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.43 | bwd_microstep: 3847.26 | bwd_inner_microstep: 3839.72 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.82 [2024-11-14 01:38:21,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.41 | bwd: 3847.27 | bwd_inner: 3839.72 | bwd_allreduce: 7.51 | step: 21.84 6%|▋ | 3230/50750 [8:55:43<78:14:53, 5.93s/it] {'loss': 0.5643, 'learning_rate': 3.9881442097920304e-05, 'epoch': 3.18} 6%|▋ | 3230/50750 [8:55:43<78:14:53, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:38:27,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 01:38:27,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.56 | bwd_microstep: 3872.10 | bwd_inner_microstep: 3864.42 | bwd_allreduce_microstep: 7.63 | step_microstep: 23.80 [2024-11-14 01:38:27,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.54 | bwd: 3872.11 | bwd_inner: 3864.42 | bwd_allreduce: 7.65 | step: 23.81 6%|▋ | 3231/50750 [8:55:49<78:23:02, 5.94s/it] {'loss': 0.0001, 'learning_rate': 3.9881303286924356e-05, 'epoch': 3.18} 6%|▋ | 3231/50750 [8:55:49<78:23:02, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:38:33,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:38:33,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.97 | bwd_microstep: 3851.57 | bwd_inner_microstep: 3844.02 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.64 [2024-11-14 01:38:33,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.95 | bwd: 3851.58 | bwd_inner: 3844.02 | bwd_allreduce: 7.52 | step: 21.64 6%|▋ | 3232/50750 [8:55:55<78:22:50, 5.94s/it] {'loss': 0.0018, 'learning_rate': 3.988116439495586e-05, 'epoch': 3.18} 6%|▋ | 3232/50750 [8:55:55<78:22:50, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:38:39,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-14 01:38:39,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.60 | bwd_microstep: 3854.97 | bwd_inner_microstep: 3847.23 | bwd_allreduce_microstep: 7.70 | step_microstep: 21.19 [2024-11-14 01:38:39,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.60 | bwd: 3854.99 | bwd_inner: 3847.23 | bwd_allreduce: 7.72 | step: 21.20 6%|▋ | 3233/50750 [8:56:01<78:20:26, 5.94s/it] {'loss': 0.0002, 'learning_rate': 3.9881025422015363e-05, 'epoch': 3.19} 6%|▋ | 3233/50750 [8:56:01<78:20:26, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:38:45,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 01:38:45,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.20 | bwd_microstep: 3858.26 | bwd_inner_microstep: 3850.30 | bwd_allreduce_microstep: 7.92 | step_microstep: 21.23 [2024-11-14 01:38:45,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.20 | bwd: 3858.27 | bwd_inner: 3850.30 | bwd_allreduce: 7.93 | step: 21.24 6%|▋ | 3234/50750 [8:56:07<78:20:00, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.9880886368103444e-05, 'epoch': 3.19} 6%|▋ | 3234/50750 [8:56:07<78:20:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:38:51,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-14 01:38:51,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.28 | bwd_microstep: 3851.72 | bwd_inner_microstep: 3843.96 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.35 [2024-11-14 01:38:51,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.27 | bwd: 3851.74 | bwd_inner: 3843.96 | bwd_allreduce: 7.73 | step: 22.35 6%|▋ | 3235/50750 [8:56:13<78:22:31, 5.94s/it] {'loss': 0.0007, 'learning_rate': 3.9880747233220664e-05, 'epoch': 3.19} 6%|▋ | 3235/50750 [8:56:13<78:22:31, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 01:38:57,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.49 | optimizer_step: 4.93 [2024-11-14 01:38:57,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.63 | bwd_microstep: 3853.09 | bwd_inner_microstep: 3844.90 | bwd_allreduce_microstep: 8.13 | step_microstep: 27.29 [2024-11-14 01:38:57,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.61 | bwd: 3853.11 | bwd_inner: 3844.90 | bwd_allreduce: 8.16 | step: 27.30 6%|▋ | 3236/50750 [8:56:19<78:27:26, 5.94s/it] {'loss': 0.0009, 'learning_rate': 3.988060801736759e-05, 'epoch': 3.19} 6%|▋ | 3236/50750 [8:56:19<78:27:26, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:39:03,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:39:03,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.64 | bwd_microstep: 3851.15 | bwd_inner_microstep: 3843.67 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.13 [2024-11-14 01:39:03,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.62 | bwd: 3851.16 | bwd_inner: 3843.67 | bwd_allreduce: 7.46 | step: 21.13 6%|▋ | 3237/50750 [8:56:25<78:25:57, 5.94s/it] {'loss': 0.0003, 'learning_rate': 3.98804687205448e-05, 'epoch': 3.19} 6%|▋ | 3237/50750 [8:56:25<78:25:57, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:39:09,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:39:09,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.69 | bwd_microstep: 3850.32 | bwd_inner_microstep: 3842.49 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.13 [2024-11-14 01:39:09,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.67 | bwd: 3850.33 | bwd_inner: 3842.49 | bwd_allreduce: 7.80 | step: 22.13 6%|▋ | 3238/50750 [8:56:31<78:24:05, 5.94s/it] {'loss': 0.0002, 'learning_rate': 3.9880329342752835e-05, 'epoch': 3.19} 6%|▋ | 3238/50750 [8:56:31<78:24:05, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:39:15,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-14 01:39:15,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.14 | bwd_microstep: 3848.77 | bwd_inner_microstep: 3840.98 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.76 [2024-11-14 01:39:15,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.13 | bwd: 3848.78 | bwd_inner: 3840.98 | bwd_allreduce: 7.76 | step: 21.76 6%|▋ | 3239/50750 [8:56:37<78:22:19, 5.94s/it] {'loss': 0.0007, 'learning_rate': 3.988018988399229e-05, 'epoch': 3.19} 6%|▋ | 3239/50750 [8:56:37<78:22:19, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:39:21,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.81 | optimizer_step: 4.92 [2024-11-14 01:39:21,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.35 | bwd_microstep: 3843.59 | bwd_inner_microstep: 3835.79 | bwd_allreduce_microstep: 7.75 | step_microstep: 23.73 [2024-11-14 01:39:21,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.33 | bwd: 3843.60 | bwd_inner: 3835.79 | bwd_allreduce: 7.77 | step: 23.74 6%|▋ | 3240/50750 [8:56:43<78:20:17, 5.94s/it] {'loss': 0.0003, 'learning_rate': 3.988005034426373e-05, 'epoch': 3.19} 6%|▋ | 3240/50750 [8:56:43<78:20:17, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:39:27,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:39:27,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.11 | bwd_microstep: 3854.52 | bwd_inner_microstep: 3847.05 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.79 [2024-11-14 01:39:27,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.10 | bwd: 3854.53 | bwd_inner: 3847.05 | bwd_allreduce: 7.45 | step: 20.80 6%|▋ | 3241/50750 [8:56:49<78:20:00, 5.94s/it] {'loss': 0.0002, 'learning_rate': 3.98799107235677e-05, 'epoch': 3.19} 6%|▋ | 3241/50750 [8:56:49<78:20:00, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:39:33,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 01:39:33,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.83 | bwd_microstep: 3860.26 | bwd_inner_microstep: 3852.01 | bwd_allreduce_microstep: 8.18 | step_microstep: 27.41 [2024-11-14 01:39:33,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.83 | bwd: 3860.28 | bwd_inner: 3852.01 | bwd_allreduce: 8.21 | step: 27.40 6%|▋ | 3242/50750 [8:56:54<78:22:32, 5.94s/it] {'loss': 0.0001, 'learning_rate': 3.987977102190479e-05, 'epoch': 3.19} 6%|▋ | 3242/50750 [8:56:54<78:22:32, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:39:38,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-14 01:39:38,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.03 | bwd_microstep: 3850.46 | bwd_inner_microstep: 3842.62 | bwd_allreduce_microstep: 7.80 | step_microstep: 22.13 [2024-11-14 01:39:38,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.01 | bwd: 3850.47 | bwd_inner: 3842.62 | bwd_allreduce: 7.82 | step: 22.13 6%|▋ | 3243/50750 [8:57:00<78:22:12, 5.94s/it] {'loss': 0.6948, 'learning_rate': 3.987963123927556e-05, 'epoch': 3.2} 6%|▋ | 3243/50750 [8:57:00<78:22:12, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:39:44,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-14 01:39:44,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.29 | bwd_microstep: 3852.82 | bwd_inner_microstep: 3845.17 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.37 [2024-11-14 01:39:44,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.28 | bwd: 3852.83 | bwd_inner: 3845.17 | bwd_allreduce: 7.62 | step: 21.37 6%|▋ | 3244/50750 [8:57:06<78:20:34, 5.94s/it] {'loss': 0.3515, 'learning_rate': 3.987949137568059e-05, 'epoch': 3.2} 6%|▋ | 3244/50750 [8:57:06<78:20:34, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:39:50,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:39:50,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.72 | bwd_microstep: 3843.29 | bwd_inner_microstep: 3835.77 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.00 [2024-11-14 01:39:50,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.71 | bwd: 3843.30 | bwd_inner: 3835.77 | bwd_allreduce: 7.49 | step: 21.01 6%|▋ | 3245/50750 [8:57:12<78:16:17, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.987935143112044e-05, 'epoch': 3.2} 6%|▋ | 3245/50750 [8:57:12<78:16:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:39:56,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:39:56,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.05 | bwd_microstep: 3852.17 | bwd_inner_microstep: 3844.60 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.76 [2024-11-14 01:39:56,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.05 | bwd: 3852.19 | bwd_inner: 3844.60 | bwd_allreduce: 7.54 | step: 21.77 6%|▋ | 3246/50750 [8:57:18<78:14:31, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.987921140559568e-05, 'epoch': 3.2} 6%|▋ | 3246/50750 [8:57:18<78:14:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:40:02,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-14 01:40:02,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.65 | bwd_microstep: 3849.77 | bwd_inner_microstep: 3841.82 | bwd_allreduce_microstep: 7.90 | step_microstep: 22.31 [2024-11-14 01:40:02,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.64 | bwd: 3849.78 | bwd_inner: 3841.82 | bwd_allreduce: 7.92 | step: 22.31 6%|▋ | 3247/50750 [8:57:24<78:14:12, 5.93s/it] {'loss': 0.0015, 'learning_rate': 3.987907129910688e-05, 'epoch': 3.2} 6%|▋ | 3247/50750 [8:57:24<78:14:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:40:08,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-14 01:40:08,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.43 | bwd_microstep: 3850.87 | bwd_inner_microstep: 3843.04 | bwd_allreduce_microstep: 7.78 | step_microstep: 22.43 [2024-11-14 01:40:08,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.42 | bwd: 3850.88 | bwd_inner: 3843.04 | bwd_allreduce: 7.80 | step: 22.43 6%|▋ | 3248/50750 [8:57:30<78:15:14, 5.93s/it] {'loss': 0.009, 'learning_rate': 3.9878931111654616e-05, 'epoch': 3.2} 6%|▋ | 3248/50750 [8:57:30<78:15:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:40:14,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:40:14,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.80 | bwd_microstep: 3847.31 | bwd_inner_microstep: 3839.64 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.66 [2024-11-14 01:40:14,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.79 | bwd: 3847.32 | bwd_inner: 3839.64 | bwd_allreduce: 7.64 | step: 21.66 6%|▋ | 3249/50750 [8:57:36<78:13:48, 5.93s/it] {'loss': 0.0011, 'learning_rate': 3.9878790843239454e-05, 'epoch': 3.2} 6%|▋ | 3249/50750 [8:57:36<78:13:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:40:20,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 01:40:20,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.17 | bwd_microstep: 3846.09 | bwd_inner_microstep: 3838.53 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.98 [2024-11-14 01:40:20,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.15 | bwd: 3846.10 | bwd_inner: 3838.53 | bwd_allreduce: 7.53 | step: 21.98 6%|▋ | 3250/50750 [8:57:42<78:13:25, 5.93s/it] {'loss': 0.4768, 'learning_rate': 3.987865049386197e-05, 'epoch': 3.2} 6%|▋ | 3250/50750 [8:57:42<78:13:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:40:26,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:40:26,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.00 | bwd_microstep: 3854.83 | bwd_inner_microstep: 3845.70 | bwd_allreduce_microstep: 9.08 | step_microstep: 22.21 [2024-11-14 01:40:26,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.00 | bwd: 3854.85 | bwd_inner: 3845.70 | bwd_allreduce: 9.10 | step: 22.21 6%|▋ | 3251/50750 [8:57:48<78:13:48, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.987851006352273e-05, 'epoch': 3.2} 6%|▋ | 3251/50750 [8:57:48<78:13:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:40:32,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 01:40:32,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.03 | bwd_microstep: 3850.23 | bwd_inner_microstep: 3842.39 | bwd_allreduce_microstep: 7.79 | step_microstep: 21.95 [2024-11-14 01:40:32,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.01 | bwd: 3850.24 | bwd_inner: 3842.39 | bwd_allreduce: 7.81 | step: 21.96 6%|▋ | 3252/50750 [8:57:54<78:13:15, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.987836955222231e-05, 'epoch': 3.2} 6%|▋ | 3252/50750 [8:57:54<78:13:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:40:38,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:40:38,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.69 | bwd_microstep: 3859.35 | bwd_inner_microstep: 3851.66 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.22 [2024-11-14 01:40:38,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.68 | bwd: 3859.36 | bwd_inner: 3851.66 | bwd_allreduce: 7.67 | step: 21.22 6%|▋ | 3253/50750 [8:58:00<78:14:23, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.987822895996128e-05, 'epoch': 3.2} 6%|▋ | 3253/50750 [8:58:00<78:14:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:40:44,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 01:40:44,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.07 | bwd_microstep: 3855.46 | bwd_inner_microstep: 3847.91 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.52 [2024-11-14 01:40:44,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.05 | bwd: 3855.47 | bwd_inner: 3847.91 | bwd_allreduce: 7.51 | step: 21.54 6%|▋ | 3254/50750 [8:58:06<78:14:15, 5.93s/it] {'loss': 0.4421, 'learning_rate': 3.9878088286740216e-05, 'epoch': 3.21} 6%|▋ | 3254/50750 [8:58:06<78:14:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:40:50,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:40:50,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.39 | bwd_microstep: 3844.33 | bwd_inner_microstep: 3836.77 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.09 [2024-11-14 01:40:50,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.39 | bwd: 3844.34 | bwd_inner: 3836.77 | bwd_allreduce: 7.54 | step: 21.09 6%|▋ | 3255/50750 [8:58:12<78:10:24, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.987794753255969e-05, 'epoch': 3.21} 6%|▋ | 3255/50750 [8:58:12<78:10:24, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:40:56,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-14 01:40:56,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.39 | bwd_microstep: 3852.60 | bwd_inner_microstep: 3844.88 | bwd_allreduce_microstep: 7.68 | step_microstep: 22.33 [2024-11-14 01:40:56,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.39 | bwd: 3852.62 | bwd_inner: 3844.88 | bwd_allreduce: 7.70 | step: 22.33 6%|▋ | 3256/50750 [8:58:17<78:12:36, 5.93s/it] {'loss': 1.0722, 'learning_rate': 3.9877806697420274e-05, 'epoch': 3.21} 6%|▋ | 3256/50750 [8:58:17<78:12:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:41:01,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 01:41:01,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.82 | bwd_microstep: 3843.48 | bwd_inner_microstep: 3835.97 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.83 [2024-11-14 01:41:01,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.81 | bwd: 3843.49 | bwd_inner: 3835.97 | bwd_allreduce: 7.49 | step: 20.84 6%|▋ | 3257/50750 [8:58:23<78:11:46, 5.93s/it] {'loss': 0.0011, 'learning_rate': 3.987766578132254e-05, 'epoch': 3.21} 6%|▋ | 3257/50750 [8:58:23<78:11:46, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:41:07,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:41:07,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.82 | bwd_microstep: 3858.48 | bwd_inner_microstep: 3850.79 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.85 [2024-11-14 01:41:07,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.82 | bwd: 3858.49 | bwd_inner: 3850.79 | bwd_allreduce: 7.66 | step: 21.85 6%|▋ | 3258/50750 [8:58:29<78:12:56, 5.93s/it] {'loss': 0.0206, 'learning_rate': 3.987752478426706e-05, 'epoch': 3.21} 6%|▋ | 3258/50750 [8:58:29<78:12:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:41:13,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 01:41:13,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.68 | bwd_microstep: 3852.77 | bwd_inner_microstep: 3845.10 | bwd_allreduce_microstep: 7.62 | step_microstep: 22.06 [2024-11-14 01:41:13,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.68 | bwd: 3852.78 | bwd_inner: 3845.10 | bwd_allreduce: 7.63 | step: 22.06 6%|▋ | 3259/50750 [8:58:35<78:13:47, 5.93s/it] {'loss': 0.1406, 'learning_rate': 3.987738370625442e-05, 'epoch': 3.21} 6%|▋ | 3259/50750 [8:58:35<78:13:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:41:19,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:41:19,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.80 | bwd_microstep: 3851.39 | bwd_inner_microstep: 3843.56 | bwd_allreduce_microstep: 7.78 | step_microstep: 21.73 [2024-11-14 01:41:19,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.79 | bwd: 3851.40 | bwd_inner: 3843.56 | bwd_allreduce: 7.80 | step: 21.73 6%|▋ | 3260/50750 [8:58:41<78:15:07, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.987724254728518e-05, 'epoch': 3.21} 6%|▋ | 3260/50750 [8:58:41<78:15:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:41:25,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.29 | optimizer_step: 4.93 [2024-11-14 01:41:25,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.08 | bwd_microstep: 3855.05 | bwd_inner_microstep: 3846.63 | bwd_allreduce_microstep: 8.35 | step_microstep: 24.79 [2024-11-14 01:41:25,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.06 | bwd: 3855.07 | bwd_inner: 3846.63 | bwd_allreduce: 8.38 | step: 24.79 6%|▋ | 3261/50750 [8:58:47<78:18:49, 5.94s/it] {'loss': 0.0016, 'learning_rate': 3.987710130735992e-05, 'epoch': 3.21} 6%|▋ | 3261/50750 [8:58:47<78:18:49, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:41:31,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.45 | optimizer_step: 4.93 [2024-11-14 01:41:31,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.61 | bwd_microstep: 3863.73 | bwd_inner_microstep: 3855.56 | bwd_allreduce_microstep: 8.11 | step_microstep: 28.01 [2024-11-14 01:41:31,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.59 | bwd: 3863.75 | bwd_inner: 3855.56 | bwd_allreduce: 8.14 | step: 28.01 6%|▋ | 3262/50750 [8:58:53<78:23:27, 5.94s/it] {'loss': 1.0696, 'learning_rate': 3.9876959986479215e-05, 'epoch': 3.21} 6%|▋ | 3262/50750 [8:58:53<78:23:27, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:41:37,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 01:41:37,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.56 | bwd_microstep: 3851.08 | bwd_inner_microstep: 3842.65 | bwd_allreduce_microstep: 8.36 | step_microstep: 25.30 [2024-11-14 01:41:37,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.55 | bwd: 3851.10 | bwd_inner: 3842.65 | bwd_allreduce: 8.39 | step: 25.29 6%|▋ | 3263/50750 [8:58:59<78:23:51, 5.94s/it] {'loss': 0.0017, 'learning_rate': 3.987681858464364e-05, 'epoch': 3.21} 6%|▋ | 3263/50750 [8:58:59<78:23:51, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:41:43,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 01:41:43,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.20 | bwd_microstep: 3854.69 | bwd_inner_microstep: 3847.06 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.53 [2024-11-14 01:41:43,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.19 | bwd: 3854.71 | bwd_inner: 3847.06 | bwd_allreduce: 7.61 | step: 21.54 6%|▋ | 3264/50750 [8:59:05<78:22:36, 5.94s/it] {'loss': 0.0873, 'learning_rate': 3.987667710185378e-05, 'epoch': 3.22} 6%|▋ | 3264/50750 [8:59:05<78:22:36, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:41:49,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.33 | optimizer_step: 4.93 [2024-11-14 01:41:49,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.58 | bwd_microstep: 3851.35 | bwd_inner_microstep: 3843.42 | bwd_allreduce_microstep: 7.86 | step_microstep: 27.51 [2024-11-14 01:41:49,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.57 | bwd: 3851.37 | bwd_inner: 3843.42 | bwd_allreduce: 7.89 | step: 27.53 6%|▋ | 3265/50750 [8:59:11<78:21:44, 5.94s/it] {'loss': 0.0003, 'learning_rate': 3.9876535538110205e-05, 'epoch': 3.22} 6%|▋ | 3265/50750 [8:59:11<78:21:44, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 01:41:55,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.33 | optimizer_step: 4.93 [2024-11-14 01:41:55,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.23 | bwd_microstep: 3850.95 | bwd_inner_microstep: 3842.40 | bwd_allreduce_microstep: 8.49 | step_microstep: 23.32 [2024-11-14 01:41:55,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.20 | bwd: 3850.96 | bwd_inner: 3842.40 | bwd_allreduce: 8.51 | step: 23.33 6%|▋ | 3266/50750 [8:59:17<78:21:47, 5.94s/it] {'loss': 0.0087, 'learning_rate': 3.987639389341349e-05, 'epoch': 3.22} 6%|▋ | 3266/50750 [8:59:17<78:21:47, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:42:01,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:42:01,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.18 | bwd_microstep: 3848.06 | bwd_inner_microstep: 3840.46 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.80 [2024-11-14 01:42:01,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.16 | bwd: 3848.08 | bwd_inner: 3840.46 | bwd_allreduce: 7.57 | step: 21.81 6%|▋ | 3267/50750 [8:59:23<78:21:36, 5.94s/it] {'loss': 0.0055, 'learning_rate': 3.9876252167764204e-05, 'epoch': 3.22} 6%|▋ | 3267/50750 [8:59:23<78:21:36, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:42:07,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:42:07,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.10 | bwd_microstep: 3845.77 | bwd_inner_microstep: 3838.25 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.31 [2024-11-14 01:42:07,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.09 | bwd: 3845.78 | bwd_inner: 3838.25 | bwd_allreduce: 7.49 | step: 21.31 6%|▋ | 3268/50750 [8:59:29<78:15:54, 5.93s/it] {'loss': 0.658, 'learning_rate': 3.9876110361162946e-05, 'epoch': 3.22} 6%|▋ | 3268/50750 [8:59:29<78:15:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:42:13,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 01:42:13,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.57 | bwd_microstep: 3845.41 | bwd_inner_microstep: 3837.71 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.32 [2024-11-14 01:42:13,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.54 | bwd: 3845.42 | bwd_inner: 3837.71 | bwd_allreduce: 7.67 | step: 21.33 6%|▋ | 3269/50750 [8:59:35<78:12:50, 5.93s/it] {'loss': 0.0758, 'learning_rate': 3.987596847361027e-05, 'epoch': 3.22} 6%|▋ | 3269/50750 [8:59:35<78:12:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:42:19,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:42:19,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.87 | bwd_microstep: 3849.52 | bwd_inner_microstep: 3841.81 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.74 [2024-11-14 01:42:19,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.86 | bwd: 3849.53 | bwd_inner: 3841.81 | bwd_allreduce: 7.69 | step: 21.74 6%|▋ | 3270/50750 [8:59:41<78:11:23, 5.93s/it] {'loss': 0.0021, 'learning_rate': 3.9875826505106766e-05, 'epoch': 3.22} 6%|▋ | 3270/50750 [8:59:41<78:11:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:42:25,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-14 01:42:25,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.95 | bwd_microstep: 3853.96 | bwd_inner_microstep: 3846.06 | bwd_allreduce_microstep: 7.84 | step_microstep: 22.72 [2024-11-14 01:42:25,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.95 | bwd: 3853.97 | bwd_inner: 3846.06 | bwd_allreduce: 7.86 | step: 22.73 6%|▋ | 3271/50750 [8:59:47<78:13:26, 5.93s/it] {'loss': 0.0556, 'learning_rate': 3.987568445565301e-05, 'epoch': 3.22} 6%|▋ | 3271/50750 [8:59:47<78:13:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:42:30,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 01:42:30,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.33 | bwd_microstep: 3845.01 | bwd_inner_microstep: 3837.34 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.72 [2024-11-14 01:42:30,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.32 | bwd: 3845.02 | bwd_inner: 3837.34 | bwd_allreduce: 7.64 | step: 21.72 6%|▋ | 3272/50750 [8:59:52<78:12:25, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.987554232524958e-05, 'epoch': 3.22} 6%|▋ | 3272/50750 [8:59:52<78:12:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:42:36,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:42:36,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.64 | bwd_microstep: 3851.80 | bwd_inner_microstep: 3844.29 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.15 [2024-11-14 01:42:36,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.62 | bwd: 3851.81 | bwd_inner: 3844.29 | bwd_allreduce: 7.48 | step: 21.16 6%|▋ | 3273/50750 [8:59:58<78:14:09, 5.93s/it] {'loss': 0.0104, 'learning_rate': 3.9875400113897066e-05, 'epoch': 3.22} 6%|▋ | 3273/50750 [8:59:58<78:14:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:42:42,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:42:42,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.41 | bwd_microstep: 3844.67 | bwd_inner_microstep: 3837.14 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.25 [2024-11-14 01:42:42,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.41 | bwd: 3844.68 | bwd_inner: 3837.14 | bwd_allreduce: 7.50 | step: 21.25 6%|▋ | 3274/50750 [9:00:04<78:11:18, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.987525782159603e-05, 'epoch': 3.23} 6%|▋ | 3274/50750 [9:00:04<78:11:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:42:48,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 01:42:48,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.85 | bwd_microstep: 3856.24 | bwd_inner_microstep: 3848.51 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.79 [2024-11-14 01:42:48,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.83 | bwd: 3856.25 | bwd_inner: 3848.51 | bwd_allreduce: 7.70 | step: 21.79 6%|▋ | 3275/50750 [9:00:10<78:13:00, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.987511544834705e-05, 'epoch': 3.23} 6%|▋ | 3275/50750 [9:00:10<78:13:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:42:54,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:42:54,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.31 | bwd_microstep: 3849.88 | bwd_inner_microstep: 3842.41 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.01 [2024-11-14 01:42:54,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.28 | bwd: 3849.89 | bwd_inner: 3842.41 | bwd_allreduce: 7.44 | step: 21.01 6%|▋ | 3276/50750 [9:00:16<78:14:07, 5.93s/it] {'loss': 0.0025, 'learning_rate': 3.987497299415073e-05, 'epoch': 3.23} 6%|▋ | 3276/50750 [9:00:16<78:14:07, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:43:00,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:43:00,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.62 | bwd_microstep: 3841.02 | bwd_inner_microstep: 3833.46 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.72 [2024-11-14 01:43:00,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.62 | bwd: 3841.04 | bwd_inner: 3833.46 | bwd_allreduce: 7.53 | step: 21.72 6%|▋ | 3277/50750 [9:00:22<78:11:02, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.987483045900763e-05, 'epoch': 3.23} 6%|▋ | 3277/50750 [9:00:22<78:11:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:43:06,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:43:06,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.62 | bwd_microstep: 3846.48 | bwd_inner_microstep: 3838.96 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.32 [2024-11-14 01:43:06,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.60 | bwd: 3846.49 | bwd_inner: 3838.96 | bwd_allreduce: 7.49 | step: 21.33 6%|▋ | 3278/50750 [9:00:28<78:09:39, 5.93s/it] {'loss': 0.0176, 'learning_rate': 3.987468784291832e-05, 'epoch': 3.23} 6%|▋ | 3278/50750 [9:00:28<78:09:39, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:43:12,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-14 01:43:12,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.96 | bwd_microstep: 3849.96 | bwd_inner_microstep: 3842.45 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.27 [2024-11-14 01:43:12,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.94 | bwd: 3849.97 | bwd_inner: 3842.45 | bwd_allreduce: 7.48 | step: 21.27 6%|▋ | 3279/50750 [9:00:34<78:09:47, 5.93s/it] {'loss': 0.001, 'learning_rate': 3.9874545145883416e-05, 'epoch': 3.23} 6%|▋ | 3279/50750 [9:00:34<78:09:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:43:18,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.69 | optimizer_step: 4.93 [2024-11-14 01:43:18,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.47 | bwd_microstep: 3851.58 | bwd_inner_microstep: 3843.69 | bwd_allreduce_microstep: 7.84 | step_microstep: 22.71 [2024-11-14 01:43:18,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.46 | bwd: 3851.59 | bwd_inner: 3843.69 | bwd_allreduce: 7.86 | step: 22.73 6%|▋ | 3280/50750 [9:00:40<78:09:18, 5.93s/it] {'loss': 0.0025, 'learning_rate': 3.987440236790347e-05, 'epoch': 3.23} 6%|▋ | 3280/50750 [9:00:40<78:09:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:43:24,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 01:43:24,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.49 | bwd_microstep: 3853.69 | bwd_inner_microstep: 3845.01 | bwd_allreduce_microstep: 8.63 | step_microstep: 21.12 [2024-11-14 01:43:24,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.49 | bwd: 3853.71 | bwd_inner: 3845.01 | bwd_allreduce: 8.65 | step: 21.13 6%|▋ | 3281/50750 [9:00:46<78:08:28, 5.93s/it] {'loss': 0.0324, 'learning_rate': 3.9874259508979076e-05, 'epoch': 3.23} 6%|▋ | 3281/50750 [9:00:46<78:08:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:43:30,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:43:30,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.23 | bwd_microstep: 3851.05 | bwd_inner_microstep: 3843.58 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.73 [2024-11-14 01:43:30,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.23 | bwd: 3851.06 | bwd_inner: 3843.57 | bwd_allreduce: 7.45 | step: 20.73 6%|▋ | 3282/50750 [9:00:52<78:08:27, 5.93s/it] {'loss': 0.0871, 'learning_rate': 3.987411656911081e-05, 'epoch': 3.23} 6%|▋ | 3282/50750 [9:00:52<78:08:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:43:36,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:43:36,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.91 | bwd_microstep: 3855.23 | bwd_inner_microstep: 3847.74 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.00 [2024-11-14 01:43:36,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.91 | bwd: 3855.25 | bwd_inner: 3847.74 | bwd_allreduce: 7.46 | step: 21.00 6%|▋ | 3283/50750 [9:00:58<78:10:15, 5.93s/it] {'loss': 0.0871, 'learning_rate': 3.987397354829925e-05, 'epoch': 3.23} 6%|▋ | 3283/50750 [9:00:58<78:10:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:43:42,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.94 [2024-11-14 01:43:42,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.88 | bwd_microstep: 3860.32 | bwd_inner_microstep: 3852.74 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.66 [2024-11-14 01:43:42,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.88 | bwd: 3860.33 | bwd_inner: 3852.74 | bwd_allreduce: 7.56 | step: 21.67 6%|▋ | 3284/50750 [9:01:04<78:12:44, 5.93s/it] {'loss': 0.0021, 'learning_rate': 3.9873830446544995e-05, 'epoch': 3.24} 6%|▋ | 3284/50750 [9:01:04<78:12:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:43:48,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:43:48,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.64 | bwd_microstep: 3856.95 | bwd_inner_microstep: 3849.18 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.56 [2024-11-14 01:43:48,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.64 | bwd: 3856.96 | bwd_inner: 3849.18 | bwd_allreduce: 7.73 | step: 22.55 6%|▋ | 3285/50750 [9:01:10<78:14:04, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.987368726384862e-05, 'epoch': 3.24} 6%|▋ | 3285/50750 [9:01:10<78:14:04, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:43:54,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-14 01:43:54,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.46 | bwd_microstep: 3857.14 | bwd_inner_microstep: 3849.62 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.34 [2024-11-14 01:43:54,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.45 | bwd: 3857.16 | bwd_inner: 3849.62 | bwd_allreduce: 7.50 | step: 21.34 6%|▋ | 3286/50750 [9:01:15<78:15:03, 5.94s/it] {'loss': 0.3286, 'learning_rate': 3.98735440002107e-05, 'epoch': 3.24} 6%|▋ | 3286/50750 [9:01:15<78:15:03, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:43:59,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:43:59,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.88 | bwd_microstep: 3842.08 | bwd_inner_microstep: 3834.60 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.83 [2024-11-14 01:43:59,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.85 | bwd: 3842.09 | bwd_inner: 3834.60 | bwd_allreduce: 7.45 | step: 20.83 6%|▋ | 3287/50750 [9:01:21<78:10:40, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.987340065563182e-05, 'epoch': 3.24} 6%|▋ | 3287/50750 [9:01:21<78:10:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:44:05,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-14 01:44:05,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.34 | bwd_microstep: 3847.79 | bwd_inner_microstep: 3840.10 | bwd_allreduce_microstep: 7.65 | step_microstep: 21.77 [2024-11-14 01:44:05,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.34 | bwd: 3847.80 | bwd_inner: 3840.10 | bwd_allreduce: 7.66 | step: 21.77 6%|▋ | 3288/50750 [9:01:27<78:07:56, 5.93s/it] {'loss': 0.0208, 'learning_rate': 3.987325723011258e-05, 'epoch': 3.24} 6%|▋ | 3288/50750 [9:01:27<78:07:56, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:44:11,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:44:11,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.17 | bwd_microstep: 3847.78 | bwd_inner_microstep: 3835.68 | bwd_allreduce_microstep: 12.01 | step_microstep: 21.47 [2024-11-14 01:44:11,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.17 | bwd: 3847.80 | bwd_inner: 3835.68 | bwd_allreduce: 12.05 | step: 21.46 6%|▋ | 3289/50750 [9:01:33<78:06:26, 5.92s/it] {'loss': 0.0014, 'learning_rate': 3.9873113723653544e-05, 'epoch': 3.24} 6%|▋ | 3289/50750 [9:01:33<78:06:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:44:17,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:44:17,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.32 | bwd_microstep: 3843.85 | bwd_inner_microstep: 3836.31 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.75 [2024-11-14 01:44:17,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.30 | bwd: 3843.86 | bwd_inner: 3836.31 | bwd_allreduce: 7.51 | step: 21.76 6%|▋ | 3290/50750 [9:01:39<78:04:42, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.9872970136255306e-05, 'epoch': 3.24} 6%|▋ | 3290/50750 [9:01:39<78:04:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 01:44:23,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-14 01:44:23,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.17 | bwd_microstep: 3847.16 | bwd_inner_microstep: 3839.51 | bwd_allreduce_microstep: 7.60 | step_microstep: 22.31 [2024-11-14 01:44:23,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.17 | bwd: 3847.17 | bwd_inner: 3839.51 | bwd_allreduce: 7.62 | step: 22.31 6%|▋ | 3291/50750 [9:01:45<78:05:34, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.9872826467918456e-05, 'epoch': 3.24} 6%|▋ | 3291/50750 [9:01:45<78:05:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:44:29,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-14 01:44:29,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.44 | bwd_microstep: 3845.96 | bwd_inner_microstep: 3838.23 | bwd_allreduce_microstep: 7.69 | step_microstep: 22.26 [2024-11-14 01:44:29,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.43 | bwd: 3845.98 | bwd_inner: 3838.23 | bwd_allreduce: 7.70 | step: 22.27 6%|▋ | 3292/50750 [9:01:51<78:09:26, 5.93s/it] {'loss': 0.001, 'learning_rate': 3.987268271864356e-05, 'epoch': 3.24} 6%|▋ | 3292/50750 [9:01:51<78:09:26, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:44:35,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.22 | optimizer_step: 4.93 [2024-11-14 01:44:35,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.82 | bwd_microstep: 3845.81 | bwd_inner_microstep: 3837.90 | bwd_allreduce_microstep: 7.86 | step_microstep: 22.19 [2024-11-14 01:44:35,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.80 | bwd: 3845.83 | bwd_inner: 3837.90 | bwd_allreduce: 7.88 | step: 22.20 6%|▋ | 3293/50750 [9:01:57<78:10:51, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.987253888843123e-05, 'epoch': 3.24} 6%|▋ | 3293/50750 [9:01:57<78:10:51, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:44:41,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-14 01:44:41,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.12 | bwd_microstep: 3854.66 | bwd_inner_microstep: 3847.08 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.41 [2024-11-14 01:44:41,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.11 | bwd: 3854.67 | bwd_inner: 3847.08 | bwd_allreduce: 7.55 | step: 21.42 6%|▋ | 3294/50750 [9:02:03<78:11:09, 5.93s/it] {'loss': 0.1122, 'learning_rate': 3.987239497728203e-05, 'epoch': 3.25} 6%|▋ | 3294/50750 [9:02:03<78:11:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:44:47,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.92 [2024-11-14 01:44:47,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.07 | bwd_microstep: 3849.29 | bwd_inner_microstep: 3841.57 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.72 [2024-11-14 01:44:47,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.07 | bwd: 3849.30 | bwd_inner: 3841.57 | bwd_allreduce: 7.69 | step: 21.72 6%|▋ | 3295/50750 [9:02:09<78:10:02, 5.93s/it] {'loss': 0.1648, 'learning_rate': 3.987225098519656e-05, 'epoch': 3.25} 6%|▋ | 3295/50750 [9:02:09<78:10:02, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:44:53,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:44:53,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.74 | bwd_microstep: 3845.46 | bwd_inner_microstep: 3837.89 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.53 [2024-11-14 01:44:53,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.72 | bwd: 3845.47 | bwd_inner: 3837.89 | bwd_allreduce: 7.53 | step: 21.54 6%|▋ | 3296/50750 [9:02:15<78:07:49, 5.93s/it] {'loss': 0.01, 'learning_rate': 3.9872106912175395e-05, 'epoch': 3.25} 6%|▋ | 3296/50750 [9:02:15<78:07:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:44:59,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 01:44:59,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.77 | bwd_microstep: 3845.96 | bwd_inner_microstep: 3838.39 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.34 [2024-11-14 01:44:59,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.75 | bwd: 3845.97 | bwd_inner: 3838.39 | bwd_allreduce: 7.54 | step: 21.34 6%|▋ | 3297/50750 [9:02:21<78:06:37, 5.93s/it] {'loss': 0.0001, 'learning_rate': 3.9871962758219134e-05, 'epoch': 3.25} 6%|▋ | 3297/50750 [9:02:21<78:06:37, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:45:05,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:45:05,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.05 | bwd_microstep: 3851.27 | bwd_inner_microstep: 3843.72 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.43 [2024-11-14 01:45:05,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.04 | bwd: 3851.29 | bwd_inner: 3843.72 | bwd_allreduce: 7.53 | step: 21.44 6%|▋ | 3298/50750 [9:02:27<78:06:47, 5.93s/it] {'loss': 0.0041, 'learning_rate': 3.987181852332835e-05, 'epoch': 3.25} 6%|▋ | 3298/50750 [9:02:27<78:06:47, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:45:11,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:45:11,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.39 | bwd_microstep: 3846.81 | bwd_inner_microstep: 3839.26 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.42 [2024-11-14 01:45:11,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.39 | bwd: 3846.82 | bwd_inner: 3839.26 | bwd_allreduce: 7.52 | step: 21.43 7%|▋ | 3299/50750 [9:02:33<78:05:18, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.9871674207503647e-05, 'epoch': 3.25} 7%|▋ | 3299/50750 [9:02:33<78:05:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:45:16,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:45:16,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.10 | bwd_microstep: 3855.07 | bwd_inner_microstep: 3847.52 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.42 [2024-11-14 01:45:16,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.08 | bwd: 3855.08 | bwd_inner: 3847.52 | bwd_allreduce: 7.52 | step: 21.42 7%|▋ | 3300/50750 [9:02:38<78:07:00, 5.93s/it] {'loss': 0.8445, 'learning_rate': 3.98715298107456e-05, 'epoch': 3.25} 7%|▋ | 3300/50750 [9:02:38<78:07:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:45:22,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-14 01:45:22,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.70 | bwd_microstep: 3845.38 | bwd_inner_microstep: 3837.83 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.82 [2024-11-14 01:45:22,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.70 | bwd: 3845.39 | bwd_inner: 3837.83 | bwd_allreduce: 7.52 | step: 21.82 7%|▋ | 3301/50750 [9:02:44<78:05:57, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.9871385333054796e-05, 'epoch': 3.25} 7%|▋ | 3301/50750 [9:02:44<78:05:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:45:28,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-14 01:45:28,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.03 | bwd_microstep: 3849.51 | bwd_inner_microstep: 3842.00 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.44 [2024-11-14 01:45:28,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.03 | bwd: 3849.53 | bwd_inner: 3842.00 | bwd_allreduce: 7.49 | step: 21.44 7%|▋ | 3302/50750 [9:02:50<78:04:49, 5.92s/it] {'loss': 0.5335, 'learning_rate': 3.987124077443184e-05, 'epoch': 3.25} 7%|▋ | 3302/50750 [9:02:50<78:04:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:45:34,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 01:45:34,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.53 | bwd_microstep: 3845.55 | bwd_inner_microstep: 3838.01 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.32 [2024-11-14 01:45:34,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.53 | bwd: 3845.56 | bwd_inner: 3838.01 | bwd_allreduce: 7.50 | step: 21.33 7%|▋ | 3303/50750 [9:02:56<78:03:15, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.98710961348773e-05, 'epoch': 3.25} 7%|▋ | 3303/50750 [9:02:56<78:03:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:45:40,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:45:40,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.32 | bwd_microstep: 3843.61 | bwd_inner_microstep: 3836.05 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.39 [2024-11-14 01:45:40,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.30 | bwd: 3843.62 | bwd_inner: 3836.05 | bwd_allreduce: 7.53 | step: 21.39 7%|▋ | 3304/50750 [9:03:02<78:02:57, 5.92s/it] {'loss': 0.0011, 'learning_rate': 3.987095141439177e-05, 'epoch': 3.26} 7%|▋ | 3304/50750 [9:03:02<78:02:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:45:46,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:45:46,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.17 | bwd_microstep: 3843.96 | bwd_inner_microstep: 3836.39 | bwd_allreduce_microstep: 7.52 | step_microstep: 22.36 [2024-11-14 01:45:46,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.17 | bwd: 3843.97 | bwd_inner: 3836.39 | bwd_allreduce: 7.54 | step: 22.37 7%|▋ | 3305/50750 [9:03:08<78:02:44, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.987080661297585e-05, 'epoch': 3.26} 7%|▋ | 3305/50750 [9:03:08<78:02:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:45:52,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:45:52,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.78 | bwd_microstep: 3845.93 | bwd_inner_microstep: 3838.34 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.41 [2024-11-14 01:45:52,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.77 | bwd: 3845.94 | bwd_inner: 3838.34 | bwd_allreduce: 7.56 | step: 21.42 7%|▋ | 3306/50750 [9:03:14<78:02:06, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.9870661730630126e-05, 'epoch': 3.26} 7%|▋ | 3306/50750 [9:03:14<78:02:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:45:58,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:45:58,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.07 | bwd_microstep: 3848.82 | bwd_inner_microstep: 3840.77 | bwd_allreduce_microstep: 7.99 | step_microstep: 21.74 [2024-11-14 01:45:58,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.07 | bwd: 3848.84 | bwd_inner: 3840.77 | bwd_allreduce: 8.01 | step: 21.74 7%|▋ | 3307/50750 [9:03:20<78:02:01, 5.92s/it] {'loss': 0.0019, 'learning_rate': 3.987051676735518e-05, 'epoch': 3.26} 7%|▋ | 3307/50750 [9:03:20<78:02:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:46:04,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:46:04,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.79 | bwd_microstep: 3845.93 | bwd_inner_microstep: 3838.17 | bwd_allreduce_microstep: 7.72 | step_microstep: 21.44 [2024-11-14 01:46:04,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.78 | bwd: 3845.94 | bwd_inner: 3838.17 | bwd_allreduce: 7.73 | step: 21.45 7%|▋ | 3308/50750 [9:03:26<78:02:35, 5.92s/it] {'loss': 0.0012, 'learning_rate': 3.987037172315161e-05, 'epoch': 3.26} 7%|▋ | 3308/50750 [9:03:26<78:02:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:46:10,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 01:46:10,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.05 | bwd_microstep: 3844.20 | bwd_inner_microstep: 3836.64 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.35 [2024-11-14 01:46:10,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.04 | bwd: 3844.22 | bwd_inner: 3836.64 | bwd_allreduce: 7.53 | step: 21.35 7%|▋ | 3309/50750 [9:03:32<78:01:23, 5.92s/it] {'loss': 0.0, 'learning_rate': 3.987022659802001e-05, 'epoch': 3.26} 7%|▋ | 3309/50750 [9:03:32<78:01:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:46:16,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:46:16,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.08 | bwd_microstep: 3848.61 | bwd_inner_microstep: 3841.06 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.43 [2024-11-14 01:46:16,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.06 | bwd: 3848.62 | bwd_inner: 3841.06 | bwd_allreduce: 7.52 | step: 21.44 7%|▋ | 3310/50750 [9:03:38<78:01:28, 5.92s/it] {'loss': 1.5676, 'learning_rate': 3.987008139196096e-05, 'epoch': 3.26} 7%|▋ | 3310/50750 [9:03:38<78:01:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:46:22,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:46:22,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.17 | bwd_microstep: 3850.07 | bwd_inner_microstep: 3842.47 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.40 [2024-11-14 01:46:22,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.16 | bwd: 3850.08 | bwd_inner: 3842.47 | bwd_allreduce: 7.57 | step: 21.42 7%|▋ | 3311/50750 [9:03:44<78:03:02, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.986993610497505e-05, 'epoch': 3.26} 7%|▋ | 3311/50750 [9:03:44<78:03:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:46:28,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 01:46:28,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.64 | bwd_microstep: 3849.59 | bwd_inner_microstep: 3841.96 | bwd_allreduce_microstep: 7.58 | step_microstep: 22.01 [2024-11-14 01:46:28,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.63 | bwd: 3849.61 | bwd_inner: 3841.96 | bwd_allreduce: 7.60 | step: 22.01 7%|▋ | 3312/50750 [9:03:50<78:05:09, 5.93s/it] {'loss': 0.1448, 'learning_rate': 3.986979073706289e-05, 'epoch': 3.26} 7%|▋ | 3312/50750 [9:03:50<78:05:09, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:46:33,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:46:33,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.21 | bwd_microstep: 3845.71 | bwd_inner_microstep: 3838.17 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.46 [2024-11-14 01:46:33,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.18 | bwd: 3845.73 | bwd_inner: 3838.17 | bwd_allreduce: 7.51 | step: 21.46 7%|▋ | 3313/50750 [9:03:55<78:04:15, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.9869645288225056e-05, 'epoch': 3.26} 7%|▋ | 3313/50750 [9:03:55<78:04:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:46:39,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:46:39,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.23 | bwd_microstep: 3844.82 | bwd_inner_microstep: 3837.33 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.08 [2024-11-14 01:46:39,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.23 | bwd: 3844.83 | bwd_inner: 3837.33 | bwd_allreduce: 7.46 | step: 21.08 7%|▋ | 3314/50750 [9:04:01<78:02:05, 5.92s/it] {'loss': 0.0035, 'learning_rate': 3.9869499758462146e-05, 'epoch': 3.27} 7%|▋ | 3314/50750 [9:04:01<78:02:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:46:45,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:46:45,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.30 | bwd_microstep: 3847.07 | bwd_inner_microstep: 3839.58 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.10 [2024-11-14 01:46:45,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.30 | bwd: 3847.08 | bwd_inner: 3839.58 | bwd_allreduce: 7.46 | step: 21.10 7%|▋ | 3315/50750 [9:04:07<78:01:11, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.986935414777476e-05, 'epoch': 3.27} 7%|▋ | 3315/50750 [9:04:07<78:01:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:46:51,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:46:51,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.41 | bwd_microstep: 3845.05 | bwd_inner_microstep: 3837.56 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.97 [2024-11-14 01:46:51,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.41 | bwd: 3845.06 | bwd_inner: 3837.56 | bwd_allreduce: 7.46 | step: 20.97 7%|▋ | 3316/50750 [9:04:13<78:00:00, 5.92s/it] {'loss': 0.0242, 'learning_rate': 3.986920845616347e-05, 'epoch': 3.27} 7%|▋ | 3316/50750 [9:04:13<78:00:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:46:57,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 01:46:57,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.09 | bwd_microstep: 3851.12 | bwd_inner_microstep: 3843.48 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.81 [2024-11-14 01:46:57,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.09 | bwd: 3851.13 | bwd_inner: 3843.48 | bwd_allreduce: 7.61 | step: 21.81 7%|▋ | 3317/50750 [9:04:19<78:02:03, 5.92s/it] {'loss': 0.7214, 'learning_rate': 3.986906268362889e-05, 'epoch': 3.27} 7%|▋ | 3317/50750 [9:04:19<78:02:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:47:03,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-14 01:47:03,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.55 | bwd_microstep: 3850.80 | bwd_inner_microstep: 3843.32 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.13 [2024-11-14 01:47:03,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.54 | bwd: 3850.81 | bwd_inner: 3843.32 | bwd_allreduce: 7.45 | step: 21.13 7%|▋ | 3318/50750 [9:04:25<78:01:52, 5.92s/it] {'loss': 0.1156, 'learning_rate': 3.98689168301716e-05, 'epoch': 3.27} 7%|▋ | 3318/50750 [9:04:25<78:01:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:47:09,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:47:09,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.91 | bwd_microstep: 3841.24 | bwd_inner_microstep: 3833.78 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.94 [2024-11-14 01:47:09,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.91 | bwd: 3841.25 | bwd_inner: 3833.78 | bwd_allreduce: 7.43 | step: 20.94 7%|▋ | 3319/50750 [9:04:31<78:00:52, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.986877089579221e-05, 'epoch': 3.27} 7%|▋ | 3319/50750 [9:04:31<78:00:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:47:15,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:47:15,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.27 | bwd_microstep: 3846.91 | bwd_inner_microstep: 3839.45 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.20 [2024-11-14 01:47:15,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.27 | bwd: 3846.92 | bwd_inner: 3839.45 | bwd_allreduce: 7.44 | step: 21.20 7%|▋ | 3320/50750 [9:04:37<78:00:59, 5.92s/it] {'loss': 0.0013, 'learning_rate': 3.9868624880491296e-05, 'epoch': 3.27} 7%|▋ | 3320/50750 [9:04:37<78:00:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:47:21,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:47:21,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.46 | bwd_microstep: 3841.48 | bwd_inner_microstep: 3834.02 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.83 [2024-11-14 01:47:21,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.46 | bwd: 3841.49 | bwd_inner: 3834.02 | bwd_allreduce: 7.43 | step: 20.84 7%|▋ | 3321/50750 [9:04:43<77:58:48, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.9868478784269465e-05, 'epoch': 3.27} 7%|▋ | 3321/50750 [9:04:43<77:58:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:47:27,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:47:27,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.03 | bwd_microstep: 3850.74 | bwd_inner_microstep: 3841.93 | bwd_allreduce_microstep: 8.77 | step_microstep: 21.57 [2024-11-14 01:47:27,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.03 | bwd: 3850.76 | bwd_inner: 3841.93 | bwd_allreduce: 8.78 | step: 21.57 7%|▋ | 3322/50750 [9:04:49<77:59:46, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.9868332607127304e-05, 'epoch': 3.27} 7%|▋ | 3322/50750 [9:04:49<77:59:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:47:33,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:47:33,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.92 | bwd_microstep: 3857.20 | bwd_inner_microstep: 3849.69 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.01 [2024-11-14 01:47:33,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.92 | bwd: 3857.21 | bwd_inner: 3849.69 | bwd_allreduce: 7.48 | step: 21.01 7%|▋ | 3323/50750 [9:04:55<78:02:35, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.9868186349065416e-05, 'epoch': 3.27} 7%|▋ | 3323/50750 [9:04:55<78:02:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 01:47:39,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 01:47:39,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.58 | bwd_microstep: 3860.72 | bwd_inner_microstep: 3853.21 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.12 [2024-11-14 01:47:39,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.58 | bwd: 3860.74 | bwd_inner: 3853.21 | bwd_allreduce: 7.49 | step: 21.13 7%|▋ | 3324/50750 [9:05:01<78:04:12, 5.93s/it] {'loss': 0.2687, 'learning_rate': 3.986804001008439e-05, 'epoch': 3.27} 7%|▋ | 3324/50750 [9:05:01<78:04:12, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:47:45,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:47:45,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.31 | bwd_microstep: 3852.07 | bwd_inner_microstep: 3844.55 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.03 [2024-11-14 01:47:45,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.28 | bwd: 3852.08 | bwd_inner: 3844.55 | bwd_allreduce: 7.49 | step: 21.03 7%|▋ | 3325/50750 [9:05:07<78:03:40, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.9867893590184835e-05, 'epoch': 3.28} 7%|▋ | 3325/50750 [9:05:07<78:03:40, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:47:50,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:47:50,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.70 | bwd_microstep: 3853.90 | bwd_inner_microstep: 3846.38 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.27 [2024-11-14 01:47:50,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.70 | bwd: 3853.92 | bwd_inner: 3846.38 | bwd_allreduce: 7.49 | step: 21.28 7%|▋ | 3326/50750 [9:05:12<78:03:59, 5.93s/it] {'loss': 0.1438, 'learning_rate': 3.986774708936733e-05, 'epoch': 3.28} 7%|▋ | 3326/50750 [9:05:12<78:03:59, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:47:56,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:47:56,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.90 | bwd_microstep: 3854.15 | bwd_inner_microstep: 3846.60 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.43 [2024-11-14 01:47:56,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.90 | bwd: 3854.16 | bwd_inner: 3846.60 | bwd_allreduce: 7.52 | step: 21.43 7%|▋ | 3327/50750 [9:05:18<78:04:10, 5.93s/it] {'loss': 0.0085, 'learning_rate': 3.9867600507632474e-05, 'epoch': 3.28} 7%|▋ | 3327/50750 [9:05:18<78:04:10, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:48:02,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:48:02,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.05 | bwd_microstep: 3854.98 | bwd_inner_microstep: 3847.12 | bwd_allreduce_microstep: 7.81 | step_microstep: 21.69 [2024-11-14 01:48:02,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.05 | bwd: 3855.00 | bwd_inner: 3847.12 | bwd_allreduce: 7.83 | step: 21.69 7%|▋ | 3328/50750 [9:05:24<78:04:27, 5.93s/it] {'loss': 0.0107, 'learning_rate': 3.986745384498088e-05, 'epoch': 3.28} 7%|▋ | 3328/50750 [9:05:24<78:04:27, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:48:08,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 01:48:08,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.94 | bwd_microstep: 3846.35 | bwd_inner_microstep: 3838.66 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.05 [2024-11-14 01:48:08,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.94 | bwd: 3846.36 | bwd_inner: 3838.66 | bwd_allreduce: 7.66 | step: 21.06 7%|▋ | 3329/50750 [9:05:30<78:02:11, 5.92s/it] {'loss': 0.0955, 'learning_rate': 3.986730710141313e-05, 'epoch': 3.28} 7%|▋ | 3329/50750 [9:05:30<78:02:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 01:48:14,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.74 | optimizer_step: 4.93 [2024-11-14 01:48:14,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.89 | bwd_microstep: 3847.85 | bwd_inner_microstep: 3839.64 | bwd_allreduce_microstep: 8.14 | step_microstep: 30.62 [2024-11-14 01:48:14,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.89 | bwd: 3847.87 | bwd_inner: 3839.64 | bwd_allreduce: 8.17 | step: 30.62 7%|▋ | 3330/50750 [9:05:36<78:05:06, 5.93s/it] {'loss': 0.0806, 'learning_rate': 3.9867160276929827e-05, 'epoch': 3.28} 7%|▋ | 3330/50750 [9:05:36<78:05:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 01:48:20,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.39 | optimizer_step: 4.92 [2024-11-14 01:48:20,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.52 | bwd_microstep: 3848.71 | bwd_inner_microstep: 3840.44 | bwd_allreduce_microstep: 8.23 | step_microstep: 23.82 [2024-11-14 01:48:20,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.50 | bwd: 3848.73 | bwd_inner: 3840.44 | bwd_allreduce: 8.25 | step: 23.83 7%|▋ | 3331/50750 [9:05:42<78:12:17, 5.94s/it] {'loss': 0.0004, 'learning_rate': 3.9867013371531566e-05, 'epoch': 3.28} 7%|▋ | 3331/50750 [9:05:42<78:12:17, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:48:26,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:48:26,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.05 | bwd_microstep: 3847.47 | bwd_inner_microstep: 3839.97 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.98 [2024-11-14 01:48:26,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.01 | bwd: 3847.49 | bwd_inner: 3839.97 | bwd_allreduce: 7.48 | step: 20.98 7%|▋ | 3332/50750 [9:05:48<78:08:25, 5.93s/it] {'loss': 0.0084, 'learning_rate': 3.986686638521895e-05, 'epoch': 3.28} 7%|▋ | 3332/50750 [9:05:48<78:08:25, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 01:48:32,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-14 01:48:32,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.79 | bwd_microstep: 3850.39 | bwd_inner_microstep: 3842.86 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.82 [2024-11-14 01:48:32,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.79 | bwd: 3850.40 | bwd_inner: 3842.86 | bwd_allreduce: 7.50 | step: 21.82 7%|▋ | 3333/50750 [9:05:54<78:06:30, 5.93s/it] {'loss': 0.5182, 'learning_rate': 3.986671931799257e-05, 'epoch': 3.28} 7%|▋ | 3333/50750 [9:05:54<78:06:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:48:38,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 01:48:38,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.13 | bwd_microstep: 3844.54 | bwd_inner_microstep: 3837.06 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.01 [2024-11-14 01:48:38,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.13 | bwd: 3844.55 | bwd_inner: 3837.06 | bwd_allreduce: 7.45 | step: 21.01 7%|▋ | 3334/50750 [9:06:00<78:03:23, 5.93s/it] {'loss': 0.2663, 'learning_rate': 3.9866572169853036e-05, 'epoch': 3.28} 7%|▋ | 3334/50750 [9:06:00<78:03:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:48:44,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:48:44,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.23 | bwd_microstep: 3842.36 | bwd_inner_microstep: 3834.89 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.09 [2024-11-14 01:48:44,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.23 | bwd: 3842.37 | bwd_inner: 3834.89 | bwd_allreduce: 7.44 | step: 21.09 7%|▋ | 3335/50750 [9:06:06<78:00:39, 5.92s/it] {'loss': 0.3138, 'learning_rate': 3.986642494080094e-05, 'epoch': 3.29} 7%|▋ | 3335/50750 [9:06:06<78:00:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:48:50,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:48:50,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.54 | bwd_microstep: 3846.66 | bwd_inner_microstep: 3839.15 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.09 [2024-11-14 01:48:50,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.54 | bwd: 3846.67 | bwd_inner: 3839.15 | bwd_allreduce: 7.47 | step: 21.10 7%|▋ | 3336/50750 [9:06:12<78:00:36, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.9866277630836887e-05, 'epoch': 3.29} 7%|▋ | 3336/50750 [9:06:12<78:00:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:48:56,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 01:48:56,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.04 | bwd_microstep: 3845.90 | bwd_inner_microstep: 3838.44 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.87 [2024-11-14 01:48:56,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.04 | bwd: 3845.91 | bwd_inner: 3838.44 | bwd_allreduce: 7.44 | step: 20.87 7%|▋ | 3337/50750 [9:06:18<77:58:45, 5.92s/it] {'loss': 0.0011, 'learning_rate': 3.986613023996146e-05, 'epoch': 3.29} 7%|▋ | 3337/50750 [9:06:18<77:58:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:49:02,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:49:02,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.36 | bwd_microstep: 3844.48 | bwd_inner_microstep: 3836.97 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.99 [2024-11-14 01:49:02,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.36 | bwd: 3844.49 | bwd_inner: 3836.97 | bwd_allreduce: 7.48 | step: 20.99 7%|▋ | 3338/50750 [9:06:24<77:57:38, 5.92s/it] {'loss': 0.0013, 'learning_rate': 3.986598276817528e-05, 'epoch': 3.29} 7%|▋ | 3338/50750 [9:06:24<77:57:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:49:07,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:49:07,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.56 | bwd_microstep: 3850.46 | bwd_inner_microstep: 3842.97 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.00 [2024-11-14 01:49:07,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.56 | bwd: 3850.47 | bwd_inner: 3842.97 | bwd_allreduce: 7.46 | step: 21.01 7%|▋ | 3339/50750 [9:06:29<77:58:06, 5.92s/it] {'loss': 0.5527, 'learning_rate': 3.9865835215478944e-05, 'epoch': 3.29} 7%|▋ | 3339/50750 [9:06:29<77:58:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:49:13,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.99 [2024-11-14 01:49:13,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.66 | bwd_microstep: 3842.33 | bwd_inner_microstep: 3834.84 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.13 [2024-11-14 01:49:13,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.66 | bwd: 3842.34 | bwd_inner: 3834.84 | bwd_allreduce: 7.46 | step: 21.14 7%|▋ | 3340/50750 [9:06:35<77:56:12, 5.92s/it] {'loss': 0.2451, 'learning_rate': 3.986568758187304e-05, 'epoch': 3.29} 7%|▋ | 3340/50750 [9:06:35<77:56:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:49:19,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:49:19,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.52 | bwd_microstep: 3841.75 | bwd_inner_microstep: 3834.26 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.26 [2024-11-14 01:49:19,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.52 | bwd: 3841.76 | bwd_inner: 3834.26 | bwd_allreduce: 7.46 | step: 21.26 7%|▋ | 3341/50750 [9:06:41<77:55:30, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.986553986735819e-05, 'epoch': 3.29} 7%|▋ | 3341/50750 [9:06:41<77:55:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:49:25,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:49:25,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.80 | bwd_microstep: 3847.83 | bwd_inner_microstep: 3840.34 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.98 [2024-11-14 01:49:25,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.80 | bwd: 3847.84 | bwd_inner: 3840.34 | bwd_allreduce: 7.47 | step: 20.99 7%|▋ | 3342/50750 [9:06:47<77:55:11, 5.92s/it] {'loss': 0.041, 'learning_rate': 3.9865392071934975e-05, 'epoch': 3.29} 7%|▋ | 3342/50750 [9:06:47<77:55:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:49:31,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-14 01:49:31,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.37 | bwd_microstep: 3840.24 | bwd_inner_microstep: 3832.75 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.26 [2024-11-14 01:49:31,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.37 | bwd: 3840.25 | bwd_inner: 3832.75 | bwd_allreduce: 7.46 | step: 21.27 7%|▋ | 3343/50750 [9:06:53<77:53:46, 5.92s/it] {'loss': 0.0445, 'learning_rate': 3.986524419560401e-05, 'epoch': 3.29} 7%|▋ | 3343/50750 [9:06:53<77:53:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:49:37,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:49:37,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.66 | bwd_microstep: 3845.37 | bwd_inner_microstep: 3837.55 | bwd_allreduce_microstep: 7.77 | step_microstep: 21.95 [2024-11-14 01:49:37,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.66 | bwd: 3845.38 | bwd_inner: 3837.55 | bwd_allreduce: 7.79 | step: 21.95 7%|▋ | 3344/50750 [9:06:59<77:56:19, 5.92s/it] {'loss': 0.0016, 'learning_rate': 3.9865096238365894e-05, 'epoch': 3.29} 7%|▋ | 3344/50750 [9:06:59<77:56:19, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:49:43,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:49:43,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.82 | bwd_microstep: 3846.94 | bwd_inner_microstep: 3839.41 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.28 [2024-11-14 01:49:43,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.82 | bwd: 3846.95 | bwd_inner: 3839.41 | bwd_allreduce: 7.50 | step: 21.29 7%|▋ | 3345/50750 [9:07:05<77:56:48, 5.92s/it] {'loss': 0.0309, 'learning_rate': 3.9864948200221224e-05, 'epoch': 3.3} 7%|▋ | 3345/50750 [9:07:05<77:56:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:49:49,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:49:49,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.02 | bwd_microstep: 3844.74 | bwd_inner_microstep: 3837.21 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.26 [2024-11-14 01:49:49,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.00 | bwd: 3844.75 | bwd_inner: 3837.21 | bwd_allreduce: 7.50 | step: 21.27 7%|▋ | 3346/50750 [9:07:11<77:56:58, 5.92s/it] {'loss': 0.3851, 'learning_rate': 3.986480008117062e-05, 'epoch': 3.3} 7%|▋ | 3346/50750 [9:07:11<77:56:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:49:55,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:49:55,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.78 | bwd_microstep: 3861.30 | bwd_inner_microstep: 3852.74 | bwd_allreduce_microstep: 8.52 | step_microstep: 21.74 [2024-11-14 01:49:55,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.78 | bwd: 3861.31 | bwd_inner: 3852.74 | bwd_allreduce: 8.53 | step: 21.74 7%|▋ | 3347/50750 [9:07:17<78:01:14, 5.93s/it] {'loss': 0.0298, 'learning_rate': 3.9864651881214654e-05, 'epoch': 3.3} 7%|▋ | 3347/50750 [9:07:17<78:01:14, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:50:01,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:50:01,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.86 | bwd_microstep: 3857.35 | bwd_inner_microstep: 3849.80 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.60 [2024-11-14 01:50:01,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.86 | bwd: 3857.36 | bwd_inner: 3849.80 | bwd_allreduce: 7.52 | step: 21.60 7%|▋ | 3348/50750 [9:07:23<78:02:41, 5.93s/it] {'loss': 0.0112, 'learning_rate': 3.9864503600353953e-05, 'epoch': 3.3} 7%|▋ | 3348/50750 [9:07:23<78:02:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:50:07,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:50:07,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.21 | bwd_microstep: 3856.46 | bwd_inner_microstep: 3848.96 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.97 [2024-11-14 01:50:07,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.22 | bwd: 3856.48 | bwd_inner: 3848.96 | bwd_allreduce: 7.48 | step: 20.97 7%|▋ | 3349/50750 [9:07:29<78:02:28, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.9864355238589124e-05, 'epoch': 3.3} 7%|▋ | 3349/50750 [9:07:29<78:02:28, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:50:13,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:50:13,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.33 | bwd_microstep: 3858.62 | bwd_inner_microstep: 3851.09 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.26 [2024-11-14 01:50:13,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.33 | bwd: 3858.63 | bwd_inner: 3851.09 | bwd_allreduce: 7.50 | step: 21.26 7%|▋ | 3350/50750 [9:07:35<78:03:50, 5.93s/it] {'loss': 0.1099, 'learning_rate': 3.9864206795920763e-05, 'epoch': 3.3} 7%|▋ | 3350/50750 [9:07:35<78:03:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:50:19,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.92 [2024-11-14 01:50:19,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.59 | bwd_microstep: 3859.53 | bwd_inner_microstep: 3851.00 | bwd_allreduce_microstep: 8.46 | step_microstep: 25.37 [2024-11-14 01:50:19,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.59 | bwd: 3859.55 | bwd_inner: 3851.00 | bwd_allreduce: 8.49 | step: 25.38 7%|▋ | 3351/50750 [9:07:41<78:07:45, 5.93s/it] {'loss': 0.5976, 'learning_rate': 3.9864058272349473e-05, 'epoch': 3.3} 7%|▋ | 3351/50750 [9:07:41<78:07:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:50:25,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.94 [2024-11-14 01:50:25,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.19 | bwd_microstep: 3848.62 | bwd_inner_microstep: 3840.84 | bwd_allreduce_microstep: 7.73 | step_microstep: 27.75 [2024-11-14 01:50:25,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.17 | bwd: 3848.64 | bwd_inner: 3840.84 | bwd_allreduce: 7.75 | step: 27.77 7%|▋ | 3352/50750 [9:07:46<78:09:18, 5.94s/it] {'loss': 0.0525, 'learning_rate': 3.986390966787586e-05, 'epoch': 3.3} 7%|▋ | 3352/50750 [9:07:46<78:09:18, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:50:30,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:50:30,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.92 | bwd_microstep: 3844.19 | bwd_inner_microstep: 3836.68 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.12 [2024-11-14 01:50:30,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.92 | bwd: 3844.21 | bwd_inner: 3836.68 | bwd_allreduce: 7.49 | step: 21.12 7%|▋ | 3353/50750 [9:07:52<78:04:49, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.9863760982500526e-05, 'epoch': 3.3} 7%|▋ | 3353/50750 [9:07:52<78:04:49, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:50:36,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:50:36,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.99 | bwd_microstep: 3845.09 | bwd_inner_microstep: 3837.49 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.62 [2024-11-14 01:50:36,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.99 | bwd: 3845.10 | bwd_inner: 3837.49 | bwd_allreduce: 7.57 | step: 21.62 7%|▋ | 3354/50750 [9:07:58<78:03:01, 5.93s/it] {'loss': 0.4065, 'learning_rate': 3.986361221622408e-05, 'epoch': 3.3} 7%|▋ | 3354/50750 [9:07:58<78:03:01, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:50:42,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:50:42,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.44 | bwd_microstep: 3848.44 | bwd_inner_microstep: 3840.61 | bwd_allreduce_microstep: 7.79 | step_microstep: 21.74 [2024-11-14 01:50:42,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.44 | bwd: 3848.46 | bwd_inner: 3840.61 | bwd_allreduce: 7.80 | step: 21.75 7%|▋ | 3355/50750 [9:08:04<78:02:36, 5.93s/it] {'loss': 0.0019, 'learning_rate': 3.986346336904713e-05, 'epoch': 3.31} 7%|▋ | 3355/50750 [9:08:04<78:02:36, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:50:48,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 01:50:48,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.62 | bwd_microstep: 3841.60 | bwd_inner_microstep: 3834.08 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.17 [2024-11-14 01:50:48,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.62 | bwd: 3841.62 | bwd_inner: 3834.08 | bwd_allreduce: 7.50 | step: 21.17 7%|▋ | 3356/50750 [9:08:10<78:00:43, 5.93s/it] {'loss': 0.3126, 'learning_rate': 3.986331444097029e-05, 'epoch': 3.31} 7%|▋ | 3356/50750 [9:08:10<78:00:43, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:50:54,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:50:54,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.25 | bwd_microstep: 3846.20 | bwd_inner_microstep: 3838.67 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.58 [2024-11-14 01:50:54,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.22 | bwd: 3846.21 | bwd_inner: 3838.67 | bwd_allreduce: 7.51 | step: 21.58 7%|▋ | 3357/50750 [9:08:16<77:58:43, 5.92s/it] {'loss': 0.0381, 'learning_rate': 3.9863165431994154e-05, 'epoch': 3.31} 7%|▋ | 3357/50750 [9:08:16<77:58:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:51:00,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:51:00,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.29 | bwd_microstep: 3846.58 | bwd_inner_microstep: 3839.08 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.23 [2024-11-14 01:51:00,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.29 | bwd: 3846.60 | bwd_inner: 3839.08 | bwd_allreduce: 7.48 | step: 21.24 7%|▋ | 3358/50750 [9:08:22<77:57:04, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.986301634211933e-05, 'epoch': 3.31} 7%|▋ | 3358/50750 [9:08:22<77:57:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:51:06,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 01:51:06,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.61 | bwd_microstep: 3858.78 | bwd_inner_microstep: 3849.24 | bwd_allreduce_microstep: 9.50 | step_microstep: 21.94 [2024-11-14 01:51:06,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.60 | bwd: 3858.79 | bwd_inner: 3849.24 | bwd_allreduce: 9.52 | step: 21.95 7%|▋ | 3359/50750 [9:08:28<78:00:45, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.986286717134643e-05, 'epoch': 3.31} 7%|▋ | 3359/50750 [9:08:28<78:00:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:51:12,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 01:51:12,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.65 | bwd_microstep: 3850.41 | bwd_inner_microstep: 3842.56 | bwd_allreduce_microstep: 7.79 | step_microstep: 22.17 [2024-11-14 01:51:12,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.65 | bwd: 3850.43 | bwd_inner: 3842.56 | bwd_allreduce: 7.82 | step: 22.17 7%|▋ | 3360/50750 [9:08:34<78:00:44, 5.93s/it] {'loss': 0.0209, 'learning_rate': 3.986271791967606e-05, 'epoch': 3.31} 7%|▋ | 3360/50750 [9:08:34<78:00:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:51:18,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-14 01:51:18,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.07 | bwd_microstep: 3842.55 | bwd_inner_microstep: 3834.99 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.30 [2024-11-14 01:51:18,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.05 | bwd: 3842.57 | bwd_inner: 3834.99 | bwd_allreduce: 7.53 | step: 21.31 7%|▋ | 3361/50750 [9:08:40<77:58:58, 5.92s/it] {'loss': 0.0011, 'learning_rate': 3.9862568587108825e-05, 'epoch': 3.31} 7%|▋ | 3361/50750 [9:08:40<77:58:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:51:24,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.94 [2024-11-14 01:51:24,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.66 | bwd_microstep: 3841.70 | bwd_inner_microstep: 3834.08 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.73 [2024-11-14 01:51:24,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.64 | bwd: 3841.71 | bwd_inner: 3834.08 | bwd_allreduce: 7.59 | step: 21.73 7%|▋ | 3362/50750 [9:08:46<77:57:01, 5.92s/it] {'loss': 0.7497, 'learning_rate': 3.986241917364533e-05, 'epoch': 3.31} 7%|▋ | 3362/50750 [9:08:46<77:57:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:51:30,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-14 01:51:30,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.91 | bwd_microstep: 3850.19 | bwd_inner_microstep: 3842.61 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.59 [2024-11-14 01:51:30,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.91 | bwd: 3850.21 | bwd_inner: 3842.62 | bwd_allreduce: 7.55 | step: 21.60 7%|▋ | 3363/50750 [9:08:52<77:59:03, 5.92s/it] {'loss': 0.0741, 'learning_rate': 3.98622696792862e-05, 'epoch': 3.31} 7%|▋ | 3363/50750 [9:08:52<77:59:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:51:36,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:51:36,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.59 | bwd_microstep: 3847.88 | bwd_inner_microstep: 3840.38 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.12 [2024-11-14 01:51:36,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.59 | bwd: 3847.90 | bwd_inner: 3840.38 | bwd_allreduce: 7.48 | step: 21.12 7%|▋ | 3364/50750 [9:08:58<77:58:55, 5.92s/it] {'loss': 0.0022, 'learning_rate': 3.9862120104032025e-05, 'epoch': 3.31} 7%|▋ | 3364/50750 [9:08:58<77:58:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:51:42,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:51:42,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.96 | bwd_microstep: 3847.86 | bwd_inner_microstep: 3840.33 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.07 [2024-11-14 01:51:42,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.96 | bwd: 3847.87 | bwd_inner: 3840.33 | bwd_allreduce: 7.51 | step: 21.08 7%|▋ | 3365/50750 [9:09:03<77:57:34, 5.92s/it] {'loss': 0.0169, 'learning_rate': 3.986197044788342e-05, 'epoch': 3.32} 7%|▋ | 3365/50750 [9:09:03<77:57:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:51:47,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:51:47,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.17 | bwd_microstep: 3846.83 | bwd_inner_microstep: 3839.30 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.35 [2024-11-14 01:51:47,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.16 | bwd: 3846.85 | bwd_inner: 3839.30 | bwd_allreduce: 7.51 | step: 21.35 7%|▋ | 3366/50750 [9:09:09<77:56:43, 5.92s/it] {'loss': 0.0016, 'learning_rate': 3.9861820710841e-05, 'epoch': 3.32} 7%|▋ | 3366/50750 [9:09:09<77:56:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:51:53,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:51:53,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.14 | bwd_microstep: 3855.88 | bwd_inner_microstep: 3848.27 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.43 [2024-11-14 01:51:53,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.14 | bwd: 3855.89 | bwd_inner: 3848.27 | bwd_allreduce: 7.58 | step: 21.43 7%|▋ | 3367/50750 [9:09:15<77:58:23, 5.92s/it] {'loss': 0.0438, 'learning_rate': 3.986167089290537e-05, 'epoch': 3.32} 7%|▋ | 3367/50750 [9:09:15<77:58:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:51:59,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:51:59,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.61 | bwd_microstep: 3843.24 | bwd_inner_microstep: 3835.72 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.53 [2024-11-14 01:51:59,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.61 | bwd: 3843.25 | bwd_inner: 3835.72 | bwd_allreduce: 7.49 | step: 21.53 7%|▋ | 3368/50750 [9:09:21<77:56:36, 5.92s/it] {'loss': 0.002, 'learning_rate': 3.986152099407714e-05, 'epoch': 3.32} 7%|▋ | 3368/50750 [9:09:21<77:56:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:52:05,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:52:05,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.75 | bwd_microstep: 3846.16 | bwd_inner_microstep: 3838.62 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.48 [2024-11-14 01:52:05,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.73 | bwd: 3846.17 | bwd_inner: 3838.62 | bwd_allreduce: 7.51 | step: 21.49 7%|▋ | 3369/50750 [9:09:27<77:58:00, 5.92s/it] {'loss': 0.0057, 'learning_rate': 3.986137101435692e-05, 'epoch': 3.32} 7%|▋ | 3369/50750 [9:09:27<77:58:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:52:11,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 01:52:11,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.28 | bwd_microstep: 3842.96 | bwd_inner_microstep: 3835.46 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.35 [2024-11-14 01:52:11,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.28 | bwd: 3842.97 | bwd_inner: 3835.46 | bwd_allreduce: 7.47 | step: 21.36 7%|▋ | 3370/50750 [9:09:33<77:56:01, 5.92s/it] {'loss': 0.0059, 'learning_rate': 3.986122095374533e-05, 'epoch': 3.32} 7%|▋ | 3370/50750 [9:09:33<77:56:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:52:17,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:52:17,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.11 | bwd_microstep: 3850.22 | bwd_inner_microstep: 3842.67 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.36 [2024-11-14 01:52:17,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.11 | bwd: 3850.23 | bwd_inner: 3842.67 | bwd_allreduce: 7.52 | step: 21.36 7%|▋ | 3371/50750 [9:09:39<77:56:48, 5.92s/it] {'loss': 0.0274, 'learning_rate': 3.986107081224297e-05, 'epoch': 3.32} 7%|▋ | 3371/50750 [9:09:39<77:56:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:52:23,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:52:23,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.29 | bwd_microstep: 3847.23 | bwd_inner_microstep: 3839.75 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.98 [2024-11-14 01:52:23,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.29 | bwd: 3847.24 | bwd_inner: 3839.75 | bwd_allreduce: 7.46 | step: 20.99 7%|▋ | 3372/50750 [9:09:45<77:55:34, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.986092058985046e-05, 'epoch': 3.32} 7%|▋ | 3372/50750 [9:09:45<77:55:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:52:29,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 01:52:29,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.94 | bwd_microstep: 3846.89 | bwd_inner_microstep: 3839.42 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.10 [2024-11-14 01:52:29,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.94 | bwd: 3846.90 | bwd_inner: 3839.42 | bwd_allreduce: 7.45 | step: 21.10 7%|▋ | 3373/50750 [9:09:51<77:56:31, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.98607702865684e-05, 'epoch': 3.32} 7%|▋ | 3373/50750 [9:09:51<77:56:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 01:52:35,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.95 [2024-11-14 01:52:35,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.10 | bwd_microstep: 3851.31 | bwd_inner_microstep: 3843.83 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.96 [2024-11-14 01:52:35,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.10 | bwd: 3851.32 | bwd_inner: 3843.84 | bwd_allreduce: 7.45 | step: 20.97 7%|▋ | 3374/50750 [9:09:57<77:56:09, 5.92s/it] {'loss': 0.3912, 'learning_rate': 3.9860619902397414e-05, 'epoch': 3.32} 7%|▋ | 3374/50750 [9:09:57<77:56:09, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:52:41,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:52:41,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.86 | bwd_microstep: 3842.25 | bwd_inner_microstep: 3834.80 | bwd_allreduce_microstep: 7.41 | step_microstep: 21.02 [2024-11-14 01:52:41,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.86 | bwd: 3842.26 | bwd_inner: 3834.80 | bwd_allreduce: 7.43 | step: 21.02 7%|▋ | 3375/50750 [9:10:03<77:54:39, 5.92s/it] {'loss': 0.0014, 'learning_rate': 3.9860469437338104e-05, 'epoch': 3.33} 7%|▋ | 3375/50750 [9:10:03<77:54:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 01:52:47,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:52:47,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.99 | bwd_microstep: 3845.31 | bwd_inner_microstep: 3837.83 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.90 [2024-11-14 01:52:47,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.99 | bwd: 3845.32 | bwd_inner: 3837.83 | bwd_allreduce: 7.45 | step: 20.90 7%|▋ | 3376/50750 [9:10:09<77:53:43, 5.92s/it] {'loss': 0.0011, 'learning_rate': 3.986031889139109e-05, 'epoch': 3.33} 7%|▋ | 3376/50750 [9:10:09<77:53:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:52:53,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:52:53,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.61 | bwd_microstep: 3846.67 | bwd_inner_microstep: 3839.21 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.81 [2024-11-14 01:52:53,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.61 | bwd: 3846.68 | bwd_inner: 3839.21 | bwd_allreduce: 7.44 | step: 20.81 7%|▋ | 3377/50750 [9:10:15<77:53:12, 5.92s/it] {'loss': 0.0142, 'learning_rate': 3.986016826455699e-05, 'epoch': 3.33} 7%|▋ | 3377/50750 [9:10:15<77:53:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:52:59,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 01:52:59,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.28 | bwd_microstep: 3851.09 | bwd_inner_microstep: 3843.63 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.88 [2024-11-14 01:52:59,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.28 | bwd: 3851.10 | bwd_inner: 3843.63 | bwd_allreduce: 7.43 | step: 20.88 7%|▋ | 3378/50750 [9:10:20<77:53:40, 5.92s/it] {'loss': 0.0132, 'learning_rate': 3.9860017556836404e-05, 'epoch': 3.33} 7%|▋ | 3378/50750 [9:10:20<77:53:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:53:04,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 01:53:04,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.37 | bwd_microstep: 3854.20 | bwd_inner_microstep: 3846.58 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.24 [2024-11-14 01:53:04,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.37 | bwd: 3854.22 | bwd_inner: 3846.58 | bwd_allreduce: 7.59 | step: 21.24 7%|▋ | 3379/50750 [9:10:26<77:54:47, 5.92s/it] {'loss': 0.1264, 'learning_rate': 3.985986676822996e-05, 'epoch': 3.33} 7%|▋ | 3379/50750 [9:10:26<77:54:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 01:53:10,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.48 | optimizer_step: 4.93 [2024-11-14 01:53:10,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.24 | bwd_microstep: 3847.97 | bwd_inner_microstep: 3840.22 | bwd_allreduce_microstep: 7.70 | step_microstep: 28.89 [2024-11-14 01:53:10,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.24 | bwd: 3847.99 | bwd_inner: 3840.22 | bwd_allreduce: 7.72 | step: 28.88 7%|▋ | 3380/50750 [9:10:32<77:57:26, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.9859715898738266e-05, 'epoch': 3.33} 7%|▋ | 3380/50750 [9:10:32<77:57:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:53:16,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:53:16,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.19 | bwd_microstep: 3848.69 | bwd_inner_microstep: 3841.20 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.91 [2024-11-14 01:53:16,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.19 | bwd: 3848.70 | bwd_inner: 3841.20 | bwd_allreduce: 7.46 | step: 20.91 7%|▋ | 3381/50750 [9:10:38<77:57:28, 5.92s/it] {'loss': 0.0051, 'learning_rate': 3.985956494836193e-05, 'epoch': 3.33} 7%|▋ | 3381/50750 [9:10:38<77:57:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:53:22,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 01:53:22,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.47 | bwd_microstep: 3847.35 | bwd_inner_microstep: 3839.84 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.13 [2024-11-14 01:53:22,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.47 | bwd: 3847.36 | bwd_inner: 3839.84 | bwd_allreduce: 7.49 | step: 21.13 7%|▋ | 3382/50750 [9:10:44<77:55:47, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.9859413917101575e-05, 'epoch': 3.33} 7%|▋ | 3382/50750 [9:10:44<77:55:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:53:28,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 01:53:28,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.59 | bwd_microstep: 3847.69 | bwd_inner_microstep: 3840.09 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.43 [2024-11-14 01:53:28,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.57 | bwd: 3847.71 | bwd_inner: 3840.09 | bwd_allreduce: 7.58 | step: 21.43 7%|▋ | 3383/50750 [9:10:50<77:57:06, 5.92s/it] {'loss': 0.0336, 'learning_rate': 3.985926280495781e-05, 'epoch': 3.33} 7%|▋ | 3383/50750 [9:10:50<77:57:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:53:34,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:53:34,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.28 | bwd_microstep: 3845.69 | bwd_inner_microstep: 3838.13 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.37 [2024-11-14 01:53:34,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.26 | bwd: 3845.70 | bwd_inner: 3838.13 | bwd_allreduce: 7.53 | step: 21.37 7%|▋ | 3384/50750 [9:10:56<77:58:15, 5.93s/it] {'loss': 0.001, 'learning_rate': 3.985911161193126e-05, 'epoch': 3.33} 7%|▋ | 3384/50750 [9:10:56<77:58:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:53:40,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:53:40,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.45 | bwd_microstep: 3841.67 | bwd_inner_microstep: 3834.13 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.09 [2024-11-14 01:53:40,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.45 | bwd: 3841.68 | bwd_inner: 3834.13 | bwd_allreduce: 7.51 | step: 21.09 7%|▋ | 3385/50750 [9:11:02<77:55:11, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.985896033802253e-05, 'epoch': 3.33} 7%|▋ | 3385/50750 [9:11:02<77:55:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:53:46,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:53:46,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.77 | bwd_microstep: 3845.84 | bwd_inner_microstep: 3838.37 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.95 [2024-11-14 01:53:46,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.77 | bwd: 3845.86 | bwd_inner: 3838.37 | bwd_allreduce: 7.45 | step: 20.95 7%|▋ | 3386/50750 [9:11:08<77:54:16, 5.92s/it] {'loss': 0.0721, 'learning_rate': 3.985880898323224e-05, 'epoch': 3.34} 7%|▋ | 3386/50750 [9:11:08<77:54:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:53:52,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 01:53:52,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.16 | bwd_microstep: 3845.72 | bwd_inner_microstep: 3838.23 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.90 [2024-11-14 01:53:52,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.15 | bwd: 3845.73 | bwd_inner: 3838.23 | bwd_allreduce: 7.46 | step: 20.90 7%|▋ | 3387/50750 [9:11:14<77:54:23, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.985865754756101e-05, 'epoch': 3.34} 7%|▋ | 3387/50750 [9:11:14<77:54:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:53:58,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:53:58,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.32 | bwd_microstep: 3844.22 | bwd_inner_microstep: 3836.75 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.91 [2024-11-14 01:53:58,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.31 | bwd: 3844.23 | bwd_inner: 3836.75 | bwd_allreduce: 7.43 | step: 20.91 7%|▋ | 3388/50750 [9:11:20<77:54:28, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.985850603100946e-05, 'epoch': 3.34} 7%|▋ | 3388/50750 [9:11:20<77:54:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:54:04,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.31 | optimizer_step: 4.93 [2024-11-14 01:54:04,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.10 | bwd_microstep: 3850.52 | bwd_inner_microstep: 3842.49 | bwd_allreduce_microstep: 7.97 | step_microstep: 25.46 [2024-11-14 01:54:04,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.09 | bwd: 3850.54 | bwd_inner: 3842.49 | bwd_allreduce: 8.00 | step: 25.48 7%|▋ | 3389/50750 [9:11:26<77:57:42, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.9858354433578194e-05, 'epoch': 3.34} 7%|▋ | 3389/50750 [9:11:26<77:57:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:54:10,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:54:10,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.19 | bwd_microstep: 3851.23 | bwd_inner_microstep: 3843.74 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.95 [2024-11-14 01:54:10,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.17 | bwd: 3851.24 | bwd_inner: 3843.74 | bwd_allreduce: 7.46 | step: 20.96 7%|▋ | 3390/50750 [9:11:32<77:58:13, 5.93s/it] {'loss': 0.003, 'learning_rate': 3.985820275526783e-05, 'epoch': 3.34} 7%|▋ | 3390/50750 [9:11:32<77:58:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:54:16,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-14 01:54:16,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.16 | bwd_microstep: 3846.46 | bwd_inner_microstep: 3838.92 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.42 [2024-11-14 01:54:16,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.16 | bwd: 3846.47 | bwd_inner: 3838.92 | bwd_allreduce: 7.51 | step: 21.42 7%|▋ | 3391/50750 [9:11:37<77:56:00, 5.92s/it] {'loss': 0.0197, 'learning_rate': 3.9858050996079e-05, 'epoch': 3.34} 7%|▋ | 3391/50750 [9:11:37<77:56:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:54:21,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:54:21,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.21 | bwd_microstep: 3854.59 | bwd_inner_microstep: 3847.14 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.90 [2024-11-14 01:54:21,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.21 | bwd: 3854.60 | bwd_inner: 3847.14 | bwd_allreduce: 7.43 | step: 20.91 7%|▋ | 3392/50750 [9:11:43<77:57:19, 5.93s/it] {'loss': 0.19, 'learning_rate': 3.985789915601232e-05, 'epoch': 3.34} 7%|▋ | 3392/50750 [9:11:43<77:57:19, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:54:27,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:54:27,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.57 | bwd_microstep: 3843.71 | bwd_inner_microstep: 3836.24 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.49 [2024-11-14 01:54:27,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.57 | bwd: 3843.72 | bwd_inner: 3836.24 | bwd_allreduce: 7.44 | step: 21.50 7%|▋ | 3393/50750 [9:11:49<77:53:59, 5.92s/it] {'loss': 0.0408, 'learning_rate': 3.98577472350684e-05, 'epoch': 3.34} 7%|▋ | 3393/50750 [9:11:49<77:53:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:54:33,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:54:33,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.89 | bwd_microstep: 3846.09 | bwd_inner_microstep: 3838.62 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.15 [2024-11-14 01:54:33,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.89 | bwd: 3846.10 | bwd_inner: 3838.62 | bwd_allreduce: 7.44 | step: 21.15 7%|▋ | 3394/50750 [9:11:55<77:52:10, 5.92s/it] {'loss': 0.2475, 'learning_rate': 3.985759523324785e-05, 'epoch': 3.34} 7%|▋ | 3394/50750 [9:11:55<77:52:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:54:39,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 01:54:39,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.69 | bwd_microstep: 3848.64 | bwd_inner_microstep: 3840.83 | bwd_allreduce_microstep: 7.76 | step_microstep: 22.12 [2024-11-14 01:54:39,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.69 | bwd: 3848.65 | bwd_inner: 3840.83 | bwd_allreduce: 7.78 | step: 22.13 7%|▋ | 3395/50750 [9:12:01<77:53:40, 5.92s/it] {'loss': 0.0008, 'learning_rate': 3.985744315055131e-05, 'epoch': 3.34} 7%|▋ | 3395/50750 [9:12:01<77:53:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 01:54:45,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:54:45,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.38 | bwd_microstep: 3868.81 | bwd_inner_microstep: 3861.35 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.07 [2024-11-14 01:54:45,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.36 | bwd: 3868.83 | bwd_inner: 3861.35 | bwd_allreduce: 7.44 | step: 21.08 7%|▋ | 3396/50750 [9:12:07<77:58:50, 5.93s/it] {'loss': 0.0001, 'learning_rate': 3.985729098697939e-05, 'epoch': 3.35} 7%|▋ | 3396/50750 [9:12:07<77:58:50, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:54:51,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:54:51,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.44 | bwd_microstep: 3843.11 | bwd_inner_microstep: 3835.45 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.12 [2024-11-14 01:54:51,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.44 | bwd: 3843.12 | bwd_inner: 3835.45 | bwd_allreduce: 7.64 | step: 21.13 7%|▋ | 3397/50750 [9:12:13<77:54:58, 5.92s/it] {'loss': 0.0043, 'learning_rate': 3.9857138742532716e-05, 'epoch': 3.35} 7%|▋ | 3397/50750 [9:12:13<77:54:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:54:57,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:54:57,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.16 | bwd_microstep: 3848.28 | bwd_inner_microstep: 3840.62 | bwd_allreduce_microstep: 7.62 | step_microstep: 20.93 [2024-11-14 01:54:57,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.15 | bwd: 3848.29 | bwd_inner: 3840.62 | bwd_allreduce: 7.64 | step: 20.94 7%|▋ | 3398/50750 [9:12:19<77:54:39, 5.92s/it] {'loss': 0.4085, 'learning_rate': 3.9856986417211895e-05, 'epoch': 3.35} 7%|▋ | 3398/50750 [9:12:19<77:54:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:55:03,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:55:03,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.70 | bwd_microstep: 3839.54 | bwd_inner_microstep: 3832.05 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.04 [2024-11-14 01:55:03,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.70 | bwd: 3839.55 | bwd_inner: 3832.05 | bwd_allreduce: 7.46 | step: 21.04 7%|▋ | 3399/50750 [9:12:25<77:51:01, 5.92s/it] {'loss': 0.0013, 'learning_rate': 3.9856834011017555e-05, 'epoch': 3.35} 7%|▋ | 3399/50750 [9:12:25<77:51:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:55:09,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 01:55:09,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.21 | bwd_microstep: 3842.52 | bwd_inner_microstep: 3835.01 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.52 [2024-11-14 01:55:09,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.21 | bwd: 3842.53 | bwd_inner: 3835.01 | bwd_allreduce: 7.48 | step: 21.52 7%|▋ | 3400/50750 [9:12:31<77:49:37, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.985668152395032e-05, 'epoch': 3.35} 7%|▋ | 3400/50750 [9:12:31<77:49:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:55:15,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 01:55:15,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.76 | bwd_microstep: 3846.17 | bwd_inner_microstep: 3838.66 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.08 [2024-11-14 01:55:15,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.76 | bwd: 3846.18 | bwd_inner: 3838.66 | bwd_allreduce: 7.48 | step: 21.08 7%|▋ | 3401/50750 [9:12:37<77:48:53, 5.92s/it] {'loss': 0.6829, 'learning_rate': 3.98565289560108e-05, 'epoch': 3.35} 7%|▋ | 3401/50750 [9:12:37<77:48:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:55:21,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:55:21,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.27 | bwd_microstep: 3849.33 | bwd_inner_microstep: 3841.59 | bwd_allreduce_microstep: 7.68 | step_microstep: 22.68 [2024-11-14 01:55:21,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.25 | bwd: 3849.34 | bwd_inner: 3841.59 | bwd_allreduce: 7.70 | step: 22.68 7%|▋ | 3402/50750 [9:12:43<77:51:37, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.985637630719963e-05, 'epoch': 3.35} 7%|▋ | 3402/50750 [9:12:43<77:51:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:55:27,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-14 01:55:27,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.65 | bwd_microstep: 3845.77 | bwd_inner_microstep: 3838.29 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.86 [2024-11-14 01:55:27,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.65 | bwd: 3845.78 | bwd_inner: 3838.29 | bwd_allreduce: 7.45 | step: 20.86 7%|▋ | 3403/50750 [9:12:49<77:52:31, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.9856223577517413e-05, 'epoch': 3.35} 7%|▋ | 3403/50750 [9:12:49<77:52:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:55:32,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 01:55:32,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.02 | bwd_microstep: 3844.84 | bwd_inner_microstep: 3837.37 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.96 [2024-11-14 01:55:32,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.02 | bwd: 3844.85 | bwd_inner: 3837.37 | bwd_allreduce: 7.44 | step: 20.97 7%|▋ | 3404/50750 [9:12:54<77:50:45, 5.92s/it] {'loss': 0.0, 'learning_rate': 3.98560707669648e-05, 'epoch': 3.35} 7%|▋ | 3404/50750 [9:12:54<77:50:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:55:38,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:55:38,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.64 | bwd_microstep: 3843.18 | bwd_inner_microstep: 3835.47 | bwd_allreduce_microstep: 7.66 | step_microstep: 22.43 [2024-11-14 01:55:38,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.64 | bwd: 3843.20 | bwd_inner: 3835.47 | bwd_allreduce: 7.68 | step: 22.43 7%|▋ | 3405/50750 [9:13:00<77:49:33, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.985591787554239e-05, 'epoch': 3.35} 7%|▋ | 3405/50750 [9:13:00<77:49:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:55:44,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 01:55:44,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.18 | bwd_microstep: 3839.82 | bwd_inner_microstep: 3832.35 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.37 [2024-11-14 01:55:44,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.18 | bwd: 3839.84 | bwd_inner: 3832.35 | bwd_allreduce: 7.45 | step: 21.38 7%|▋ | 3406/50750 [9:13:06<77:47:29, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.98557649032508e-05, 'epoch': 3.36} 7%|▋ | 3406/50750 [9:13:06<77:47:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:55:50,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:55:50,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.64 | bwd_microstep: 3843.97 | bwd_inner_microstep: 3836.26 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.23 [2024-11-14 01:55:50,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.64 | bwd: 3843.98 | bwd_inner: 3836.26 | bwd_allreduce: 7.68 | step: 21.23 7%|▋ | 3407/50750 [9:13:12<77:46:56, 5.91s/it] {'loss': 0.1059, 'learning_rate': 3.9855611850090684e-05, 'epoch': 3.36} 7%|▋ | 3407/50750 [9:13:12<77:46:56, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:55:56,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:55:56,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.63 | bwd_microstep: 3858.25 | bwd_inner_microstep: 3850.72 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.36 [2024-11-14 01:55:56,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.63 | bwd: 3858.26 | bwd_inner: 3850.72 | bwd_allreduce: 7.50 | step: 21.37 7%|▋ | 3408/50750 [9:13:18<77:50:59, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.985545871606263e-05, 'epoch': 3.36} 7%|▋ | 3408/50750 [9:13:18<77:50:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:56:02,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:56:02,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.29 | bwd_microstep: 3854.72 | bwd_inner_microstep: 3847.23 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.05 [2024-11-14 01:56:02,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.29 | bwd: 3854.73 | bwd_inner: 3847.23 | bwd_allreduce: 7.46 | step: 21.05 7%|▋ | 3409/50750 [9:13:24<77:52:57, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.985530550116729e-05, 'epoch': 3.36} 7%|▋ | 3409/50750 [9:13:24<77:52:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:56:08,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:56:08,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.90 | bwd_microstep: 3852.68 | bwd_inner_microstep: 3844.80 | bwd_allreduce_microstep: 7.81 | step_microstep: 21.60 [2024-11-14 01:56:08,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.90 | bwd: 3852.69 | bwd_inner: 3844.79 | bwd_allreduce: 7.85 | step: 21.61 7%|▋ | 3410/50750 [9:13:30<77:53:10, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.985515220540527e-05, 'epoch': 3.36} 7%|▋ | 3410/50750 [9:13:30<77:53:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:56:14,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:56:14,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.18 | bwd_microstep: 3852.62 | bwd_inner_microstep: 3845.12 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.67 [2024-11-14 01:56:14,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.16 | bwd: 3852.64 | bwd_inner: 3845.11 | bwd_allreduce: 7.48 | step: 21.68 7%|▋ | 3411/50750 [9:13:36<77:54:57, 5.93s/it] {'loss': 0.003, 'learning_rate': 3.9854998828777204e-05, 'epoch': 3.36} 7%|▋ | 3411/50750 [9:13:36<77:54:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:56:20,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:56:20,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.45 | bwd_microstep: 3841.23 | bwd_inner_microstep: 3833.54 | bwd_allreduce_microstep: 7.63 | step_microstep: 21.93 [2024-11-14 01:56:20,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.45 | bwd: 3841.25 | bwd_inner: 3833.54 | bwd_allreduce: 7.65 | step: 21.93 7%|▋ | 3412/50750 [9:13:42<77:52:42, 5.92s/it] {'loss': 0.008, 'learning_rate': 3.985484537128372e-05, 'epoch': 3.36} 7%|▋ | 3412/50750 [9:13:42<77:52:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:56:26,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 01:56:26,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.49 | bwd_microstep: 3841.56 | bwd_inner_microstep: 3833.82 | bwd_allreduce_microstep: 7.69 | step_microstep: 23.50 [2024-11-14 01:56:26,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.48 | bwd: 3841.58 | bwd_inner: 3833.82 | bwd_allreduce: 7.71 | step: 23.49 7%|▋ | 3413/50750 [9:13:48<77:50:31, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.985469183292542e-05, 'epoch': 3.36} 7%|▋ | 3413/50750 [9:13:48<77:50:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:56:32,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:56:32,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.78 | bwd_microstep: 3841.75 | bwd_inner_microstep: 3833.69 | bwd_allreduce_microstep: 8.00 | step_microstep: 24.18 [2024-11-14 01:56:32,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.78 | bwd: 3841.77 | bwd_inner: 3833.69 | bwd_allreduce: 8.03 | step: 24.18 7%|▋ | 3414/50750 [9:13:54<77:49:49, 5.92s/it] {'loss': 0.0012, 'learning_rate': 3.9854538213702963e-05, 'epoch': 3.36} 7%|▋ | 3414/50750 [9:13:54<77:49:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:56:38,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:56:38,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.52 | bwd_microstep: 3844.25 | bwd_inner_microstep: 3836.68 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.27 [2024-11-14 01:56:38,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.52 | bwd: 3844.26 | bwd_inner: 3836.68 | bwd_allreduce: 7.54 | step: 21.28 7%|▋ | 3415/50750 [9:14:00<77:48:58, 5.92s/it] {'loss': 0.0252, 'learning_rate': 3.9854384513616946e-05, 'epoch': 3.36} 7%|▋ | 3415/50750 [9:14:00<77:48:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:56:44,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:56:44,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.43 | bwd_microstep: 3844.89 | bwd_inner_microstep: 3837.24 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.22 [2024-11-14 01:56:44,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.43 | bwd: 3844.90 | bwd_inner: 3837.24 | bwd_allreduce: 7.63 | step: 21.23 7%|▋ | 3416/50750 [9:14:05<77:48:34, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.985423073266801e-05, 'epoch': 3.37} 7%|▋ | 3416/50750 [9:14:05<77:48:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:56:49,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 01:56:49,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.77 | bwd_microstep: 3844.95 | bwd_inner_microstep: 3836.83 | bwd_allreduce_microstep: 8.07 | step_microstep: 21.81 [2024-11-14 01:56:49,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.77 | bwd: 3844.96 | bwd_inner: 3836.83 | bwd_allreduce: 8.09 | step: 21.81 7%|▋ | 3417/50750 [9:14:11<77:49:08, 5.92s/it] {'loss': 1.583, 'learning_rate': 3.985407687085678e-05, 'epoch': 3.37} 7%|▋ | 3417/50750 [9:14:11<77:49:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:56:55,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 01:56:55,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.23 | bwd_microstep: 3846.10 | bwd_inner_microstep: 3838.64 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.01 [2024-11-14 01:56:55,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.23 | bwd: 3846.12 | bwd_inner: 3838.64 | bwd_allreduce: 7.44 | step: 21.02 7%|▋ | 3418/50750 [9:14:17<77:51:01, 5.92s/it] {'loss': 0.0012, 'learning_rate': 3.9853922928183875e-05, 'epoch': 3.37} 7%|▋ | 3418/50750 [9:14:17<77:51:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:57:01,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.92 [2024-11-14 01:57:01,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.84 | bwd_microstep: 3847.22 | bwd_inner_microstep: 3839.67 | bwd_allreduce_microstep: 7.50 | step_microstep: 22.25 [2024-11-14 01:57:01,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.84 | bwd: 3847.23 | bwd_inner: 3839.67 | bwd_allreduce: 7.52 | step: 22.06 7%|▋ | 3419/50750 [9:14:23<77:51:32, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.985376890464993e-05, 'epoch': 3.37} 7%|▋ | 3419/50750 [9:14:23<77:51:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:57:07,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 01:57:07,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.64 | bwd_microstep: 3847.41 | bwd_inner_microstep: 3839.89 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.21 [2024-11-14 01:57:07,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.63 | bwd: 3847.43 | bwd_inner: 3839.89 | bwd_allreduce: 7.50 | step: 21.22 7%|▋ | 3420/50750 [9:14:29<77:51:22, 5.92s/it] {'loss': 0.1112, 'learning_rate': 3.985361480025558e-05, 'epoch': 3.37} 7%|▋ | 3420/50750 [9:14:29<77:51:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:57:13,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:57:13,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.77 | bwd_microstep: 3848.37 | bwd_inner_microstep: 3840.83 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.13 [2024-11-14 01:57:13,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.77 | bwd: 3848.39 | bwd_inner: 3840.83 | bwd_allreduce: 7.51 | step: 21.13 7%|▋ | 3421/50750 [9:14:35<77:50:34, 5.92s/it] {'loss': 0.0019, 'learning_rate': 3.985346061500142e-05, 'epoch': 3.37} 7%|▋ | 3421/50750 [9:14:35<77:50:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:57:19,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 01:57:19,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.36 | bwd_microstep: 3848.33 | bwd_inner_microstep: 3840.74 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.32 [2024-11-14 01:57:19,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.36 | bwd: 3848.34 | bwd_inner: 3840.74 | bwd_allreduce: 7.56 | step: 21.33 7%|▋ | 3422/50750 [9:14:41<77:50:41, 5.92s/it] {'loss': 0.0021, 'learning_rate': 3.985330634888812e-05, 'epoch': 3.37} 7%|▋ | 3422/50750 [9:14:41<77:50:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:57:25,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:57:25,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.82 | bwd_microstep: 3842.78 | bwd_inner_microstep: 3835.28 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.28 [2024-11-14 01:57:25,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.82 | bwd: 3842.79 | bwd_inner: 3835.28 | bwd_allreduce: 7.47 | step: 21.28 7%|▋ | 3423/50750 [9:14:47<77:49:05, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.985315200191628e-05, 'epoch': 3.37} 7%|▋ | 3423/50750 [9:14:47<77:49:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:57:31,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-14 01:57:31,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.99 | bwd_microstep: 3846.56 | bwd_inner_microstep: 3838.91 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.70 [2024-11-14 01:57:31,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.99 | bwd: 3846.57 | bwd_inner: 3838.91 | bwd_allreduce: 7.62 | step: 21.70 7%|▋ | 3424/50750 [9:14:53<77:51:38, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.985299757408653e-05, 'epoch': 3.37} 7%|▋ | 3424/50750 [9:14:53<77:51:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:57:37,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:57:37,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.32 | bwd_microstep: 3850.17 | bwd_inner_microstep: 3842.70 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.04 [2024-11-14 01:57:37,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.31 | bwd: 3850.19 | bwd_inner: 3842.70 | bwd_allreduce: 7.45 | step: 21.04 7%|▋ | 3425/50750 [9:14:59<77:54:05, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.985284306539952e-05, 'epoch': 3.37} 7%|▋ | 3425/50750 [9:14:59<77:54:05, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:57:43,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 01:57:43,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.55 | bwd_microstep: 3847.54 | bwd_inner_microstep: 3839.98 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.32 [2024-11-14 01:57:43,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.53 | bwd: 3847.55 | bwd_inner: 3839.98 | bwd_allreduce: 7.53 | step: 21.33 7%|▋ | 3426/50750 [9:15:05<77:52:34, 5.92s/it] {'loss': 0.6414, 'learning_rate': 3.985268847585586e-05, 'epoch': 3.38} 7%|▋ | 3426/50750 [9:15:05<77:52:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:57:49,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-14 01:57:49,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.78 | bwd_microstep: 3848.58 | bwd_inner_microstep: 3840.96 | bwd_allreduce_microstep: 7.57 | step_microstep: 22.11 [2024-11-14 01:57:49,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.78 | bwd: 3848.59 | bwd_inner: 3840.96 | bwd_allreduce: 7.59 | step: 22.12 7%|▋ | 3427/50750 [9:15:11<77:52:54, 5.92s/it] {'loss': 0.9689, 'learning_rate': 3.985253380545618e-05, 'epoch': 3.38} 7%|▋ | 3427/50750 [9:15:11<77:52:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:57:55,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.93 [2024-11-14 01:57:55,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.03 | bwd_microstep: 3847.96 | bwd_inner_microstep: 3840.18 | bwd_allreduce_microstep: 7.72 | step_microstep: 24.52 [2024-11-14 01:57:55,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.03 | bwd: 3847.98 | bwd_inner: 3840.18 | bwd_allreduce: 7.74 | step: 24.53 7%|▋ | 3428/50750 [9:15:17<77:52:52, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.9852379054201124e-05, 'epoch': 3.38} 7%|▋ | 3428/50750 [9:15:17<77:52:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:58:01,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 01:58:01,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.06 | bwd_microstep: 3845.24 | bwd_inner_microstep: 3837.49 | bwd_allreduce_microstep: 7.70 | step_microstep: 23.01 [2024-11-14 01:58:01,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.06 | bwd: 3845.26 | bwd_inner: 3837.48 | bwd_allreduce: 7.72 | step: 23.01 7%|▋ | 3429/50750 [9:15:22<77:52:02, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.9852224222091304e-05, 'epoch': 3.38} 7%|▋ | 3429/50750 [9:15:22<77:52:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:58:06,960] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.39 | optimizer_step: 4.92 [2024-11-14 01:58:06,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.94 | bwd_microstep: 3849.65 | bwd_inner_microstep: 3841.56 | bwd_allreduce_microstep: 8.03 | step_microstep: 25.31 [2024-11-14 01:58:06,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.93 | bwd: 3849.67 | bwd_inner: 3841.56 | bwd_allreduce: 8.06 | step: 25.31 7%|▋ | 3430/50750 [9:15:28<77:55:16, 5.93s/it] {'loss': 0.004, 'learning_rate': 3.9852069309127364e-05, 'epoch': 3.38} 7%|▋ | 3430/50750 [9:15:28<77:55:16, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:58:12,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:58:12,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.11 | bwd_microstep: 3844.18 | bwd_inner_microstep: 3836.65 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.13 [2024-11-14 01:58:12,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.10 | bwd: 3844.19 | bwd_inner: 3836.65 | bwd_allreduce: 7.50 | step: 21.14 7%|▋ | 3431/50750 [9:15:34<77:52:25, 5.92s/it] {'loss': 0.0, 'learning_rate': 3.985191431530993e-05, 'epoch': 3.38} 7%|▋ | 3431/50750 [9:15:34<77:52:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:58:18,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 01:58:18,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.74 | bwd_microstep: 3850.93 | bwd_inner_microstep: 3843.38 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.26 [2024-11-14 01:58:18,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.74 | bwd: 3850.94 | bwd_inner: 3843.38 | bwd_allreduce: 7.52 | step: 21.26 7%|▋ | 3432/50750 [9:15:40<77:52:03, 5.92s/it] {'loss': 0.0194, 'learning_rate': 3.985175924063963e-05, 'epoch': 3.38} 7%|▋ | 3432/50750 [9:15:40<77:52:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 01:58:24,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 01:58:24,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.74 | bwd_microstep: 3861.60 | bwd_inner_microstep: 3853.72 | bwd_allreduce_microstep: 7.83 | step_microstep: 22.37 [2024-11-14 01:58:24,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.74 | bwd: 3861.61 | bwd_inner: 3853.72 | bwd_allreduce: 7.85 | step: 22.37 7%|▋ | 3433/50750 [9:15:46<77:55:48, 5.93s/it] {'loss': 0.0001, 'learning_rate': 3.985160408511711e-05, 'epoch': 3.38} 7%|▋ | 3433/50750 [9:15:46<77:55:48, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:58:30,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 01:58:30,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.61 | bwd_microstep: 3857.37 | bwd_inner_microstep: 3849.87 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.04 [2024-11-14 01:58:30,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.59 | bwd: 3857.39 | bwd_inner: 3849.87 | bwd_allreduce: 7.48 | step: 21.05 7%|▋ | 3434/50750 [9:15:52<77:56:54, 5.93s/it] {'loss': 0.0025, 'learning_rate': 3.9851448848742985e-05, 'epoch': 3.38} 7%|▋ | 3434/50750 [9:15:52<77:56:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:58:36,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:58:36,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.73 | bwd_microstep: 3854.84 | bwd_inner_microstep: 3847.28 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.19 [2024-11-14 01:58:36,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.73 | bwd: 3854.85 | bwd_inner: 3847.28 | bwd_allreduce: 7.54 | step: 21.19 7%|▋ | 3435/50750 [9:15:58<77:55:54, 5.93s/it] {'loss': 0.0037, 'learning_rate': 3.9851293531517894e-05, 'epoch': 3.38} 7%|▋ | 3435/50750 [9:15:58<77:55:54, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:58:42,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-14 01:58:42,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.31 | bwd_microstep: 3850.40 | bwd_inner_microstep: 3842.76 | bwd_allreduce_microstep: 7.60 | step_microstep: 22.38 [2024-11-14 01:58:42,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.32 | bwd: 3850.42 | bwd_inner: 3842.76 | bwd_allreduce: 7.62 | step: 22.39 7%|▋ | 3436/50750 [9:16:04<77:55:17, 5.93s/it] {'loss': 0.1226, 'learning_rate': 3.9851138133442464e-05, 'epoch': 3.39} 7%|▋ | 3436/50750 [9:16:04<77:55:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:58:48,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:58:48,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.33 | bwd_microstep: 3849.26 | bwd_inner_microstep: 3841.61 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.13 [2024-11-14 01:58:48,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.31 | bwd: 3849.27 | bwd_inner: 3841.61 | bwd_allreduce: 7.62 | step: 21.14 7%|▋ | 3437/50750 [9:16:10<77:54:55, 5.93s/it] {'loss': 0.0002, 'learning_rate': 3.985098265451735e-05, 'epoch': 3.39} 7%|▋ | 3437/50750 [9:16:10<77:54:55, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:58:54,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.99 [2024-11-14 01:58:54,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.07 | bwd_microstep: 3846.38 | bwd_inner_microstep: 3838.90 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.28 [2024-11-14 01:58:54,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.07 | bwd: 3846.39 | bwd_inner: 3838.91 | bwd_allreduce: 7.45 | step: 21.28 7%|▋ | 3438/50750 [9:16:16<77:52:44, 5.93s/it] {'loss': 0.0879, 'learning_rate': 3.985082709474315e-05, 'epoch': 3.39} 7%|▋ | 3438/50750 [9:16:16<77:52:44, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:59:00,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:59:00,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.06 | bwd_microstep: 3845.69 | bwd_inner_microstep: 3838.22 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.88 [2024-11-14 01:59:00,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.05 | bwd: 3845.70 | bwd_inner: 3838.22 | bwd_allreduce: 7.45 | step: 20.88 7%|▋ | 3439/50750 [9:16:22<77:51:35, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.985067145412052e-05, 'epoch': 3.39} 7%|▋ | 3439/50750 [9:16:22<77:51:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:59:06,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 01:59:06,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.41 | bwd_microstep: 3853.79 | bwd_inner_microstep: 3846.32 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.96 [2024-11-14 01:59:06,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.41 | bwd: 3853.80 | bwd_inner: 3846.32 | bwd_allreduce: 7.44 | step: 20.97 7%|▋ | 3440/50750 [9:16:28<77:51:43, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.9850515732650094e-05, 'epoch': 3.39} 7%|▋ | 3440/50750 [9:16:28<77:51:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:59:12,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 01:59:12,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.07 | bwd_microstep: 3849.60 | bwd_inner_microstep: 3842.14 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.83 [2024-11-14 01:59:12,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.05 | bwd: 3849.61 | bwd_inner: 3842.14 | bwd_allreduce: 7.44 | step: 20.83 7%|▋ | 3441/50750 [9:16:34<77:50:40, 5.92s/it] {'loss': 0.0014, 'learning_rate': 3.98503599303325e-05, 'epoch': 3.39} 7%|▋ | 3441/50750 [9:16:34<77:50:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 01:59:18,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 01:59:18,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.45 | bwd_microstep: 3843.33 | bwd_inner_microstep: 3835.86 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.01 [2024-11-14 01:59:18,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.45 | bwd: 3843.34 | bwd_inner: 3835.86 | bwd_allreduce: 7.45 | step: 21.02 7%|▋ | 3442/50750 [9:16:40<77:49:39, 5.92s/it] {'loss': 0.1357, 'learning_rate': 3.9850204047168376e-05, 'epoch': 3.39} 7%|▋ | 3442/50750 [9:16:40<77:49:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:59:23,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 01:59:23,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.31 | bwd_microstep: 3846.29 | bwd_inner_microstep: 3838.80 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.35 [2024-11-14 01:59:23,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.29 | bwd: 3846.31 | bwd_inner: 3838.80 | bwd_allreduce: 7.46 | step: 21.36 7%|▋ | 3443/50750 [9:16:45<77:49:25, 5.92s/it] {'loss': 0.0116, 'learning_rate': 3.985004808315835e-05, 'epoch': 3.39} 7%|▋ | 3443/50750 [9:16:45<77:49:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 01:59:29,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 01:59:29,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.80 | bwd_microstep: 3846.50 | bwd_inner_microstep: 3839.01 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.00 [2024-11-14 01:59:29,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.80 | bwd: 3846.51 | bwd_inner: 3839.01 | bwd_allreduce: 7.46 | step: 21.00 7%|▋ | 3444/50750 [9:16:51<77:48:05, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.984989203830307e-05, 'epoch': 3.39} 7%|▋ | 3444/50750 [9:16:51<77:48:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:59:35,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 01:59:35,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.11 | bwd_microstep: 3842.48 | bwd_inner_microstep: 3835.01 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.86 [2024-11-14 01:59:35,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.11 | bwd: 3842.49 | bwd_inner: 3835.01 | bwd_allreduce: 7.44 | step: 20.86 7%|▋ | 3445/50750 [9:16:57<77:46:40, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.9849735912603165e-05, 'epoch': 3.39} 7%|▋ | 3445/50750 [9:16:57<77:46:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:59:41,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 5.00 [2024-11-14 01:59:41,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.74 | bwd_microstep: 3849.75 | bwd_inner_microstep: 3842.27 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.52 [2024-11-14 01:59:41,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.74 | bwd: 3849.76 | bwd_inner: 3842.27 | bwd_allreduce: 7.45 | step: 21.53 7%|▋ | 3446/50750 [9:17:03<77:46:53, 5.92s/it] {'loss': 0.0034, 'learning_rate': 3.984957970605927e-05, 'epoch': 3.4} 7%|▋ | 3446/50750 [9:17:03<77:46:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 01:59:47,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 01:59:47,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.17 | bwd_microstep: 3847.55 | bwd_inner_microstep: 3839.77 | bwd_allreduce_microstep: 7.72 | step_microstep: 27.02 [2024-11-14 01:59:47,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.17 | bwd: 3847.57 | bwd_inner: 3839.77 | bwd_allreduce: 7.74 | step: 27.01 7%|▋ | 3447/50750 [9:17:09<77:48:33, 5.92s/it] {'loss': 0.0043, 'learning_rate': 3.9849423418672016e-05, 'epoch': 3.4} 7%|▋ | 3447/50750 [9:17:09<77:48:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 01:59:53,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 4.93 [2024-11-14 01:59:53,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.12 | bwd_microstep: 3857.49 | bwd_inner_microstep: 3849.74 | bwd_allreduce_microstep: 7.70 | step_microstep: 25.53 [2024-11-14 01:59:53,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.12 | bwd: 3857.51 | bwd_inner: 3849.74 | bwd_allreduce: 7.72 | step: 25.53 7%|▋ | 3448/50750 [9:17:15<77:52:34, 5.93s/it] {'loss': 0.0291, 'learning_rate': 3.9849267050442046e-05, 'epoch': 3.4} 7%|▋ | 3448/50750 [9:17:15<77:52:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 01:59:59,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 01:59:59,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.91 | bwd_microstep: 3843.18 | bwd_inner_microstep: 3835.65 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.20 [2024-11-14 01:59:59,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.91 | bwd: 3843.19 | bwd_inner: 3835.65 | bwd_allreduce: 7.50 | step: 21.21 7%|▋ | 3449/50750 [9:17:21<77:50:13, 5.92s/it] {'loss': 0.012, 'learning_rate': 3.984911060137e-05, 'epoch': 3.4} 7%|▋ | 3449/50750 [9:17:21<77:50:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:00:05,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:00:05,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.00 | bwd_microstep: 3852.03 | bwd_inner_microstep: 3844.50 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.49 [2024-11-14 02:00:05,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.00 | bwd: 3852.04 | bwd_inner: 3844.51 | bwd_allreduce: 7.50 | step: 21.49 7%|▋ | 3450/50750 [9:17:27<77:50:57, 5.93s/it] {'loss': 0.0018, 'learning_rate': 3.9848954071456504e-05, 'epoch': 3.4} 7%|▋ | 3450/50750 [9:17:27<77:50:57, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:00:11,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 02:00:11,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.47 | bwd_microstep: 3852.07 | bwd_inner_microstep: 3844.49 | bwd_allreduce_microstep: 7.54 | step_microstep: 21.63 [2024-11-14 02:00:11,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.47 | bwd: 3852.09 | bwd_inner: 3844.49 | bwd_allreduce: 7.56 | step: 21.63 7%|▋ | 3451/50750 [9:17:33<77:51:42, 5.93s/it] {'loss': 0.1439, 'learning_rate': 3.984879746070221e-05, 'epoch': 3.4} 7%|▋ | 3451/50750 [9:17:33<77:51:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:00:17,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 02:00:17,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.35 | bwd_microstep: 3849.13 | bwd_inner_microstep: 3841.63 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.01 [2024-11-14 02:00:17,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.35 | bwd: 3849.14 | bwd_inner: 3841.63 | bwd_allreduce: 7.47 | step: 21.02 7%|▋ | 3452/50750 [9:17:39<77:53:15, 5.93s/it] {'loss': 0.3814, 'learning_rate': 3.984864076910774e-05, 'epoch': 3.4} 7%|▋ | 3452/50750 [9:17:39<77:53:15, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:00:23,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:00:23,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.85 | bwd_microstep: 3844.88 | bwd_inner_microstep: 3837.38 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.93 [2024-11-14 02:00:23,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.85 | bwd: 3844.89 | bwd_inner: 3837.39 | bwd_allreduce: 7.47 | step: 21.93 7%|▋ | 3453/50750 [9:17:45<77:50:32, 5.92s/it] {'loss': 0.7067, 'learning_rate': 3.984848399667375e-05, 'epoch': 3.4} 7%|▋ | 3453/50750 [9:17:45<77:50:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:00:29,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.32 | optimizer_step: 4.93 [2024-11-14 02:00:29,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.76 | bwd_microstep: 3853.15 | bwd_inner_microstep: 3845.37 | bwd_allreduce_microstep: 7.74 | step_microstep: 21.88 [2024-11-14 02:00:29,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.76 | bwd: 3853.16 | bwd_inner: 3845.37 | bwd_allreduce: 7.76 | step: 21.88 7%|▋ | 3454/50750 [9:17:51<77:51:06, 5.93s/it] {'loss': 0.0001, 'learning_rate': 3.984832714340086e-05, 'epoch': 3.4} 7%|▋ | 3454/50750 [9:17:51<77:51:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:00:35,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 5.05 [2024-11-14 02:00:35,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.94 | bwd_microstep: 3847.68 | bwd_inner_microstep: 3840.20 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.24 [2024-11-14 02:00:35,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.93 | bwd: 3847.70 | bwd_inner: 3840.20 | bwd_allreduce: 7.46 | step: 21.24 7%|▋ | 3455/50750 [9:17:57<77:50:01, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.9848170209289726e-05, 'epoch': 3.4} 7%|▋ | 3455/50750 [9:17:57<77:50:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:00:41,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 02:00:41,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.98 | bwd_microstep: 3847.59 | bwd_inner_microstep: 3840.08 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.19 [2024-11-14 02:00:41,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.97 | bwd: 3847.60 | bwd_inner: 3840.08 | bwd_allreduce: 7.48 | step: 21.19 7%|▋ | 3456/50750 [9:18:02<77:50:13, 5.92s/it] {'loss': 0.0015, 'learning_rate': 3.9848013194340974e-05, 'epoch': 3.4} 7%|▋ | 3456/50750 [9:18:02<77:50:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:00:46,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 4.93 [2024-11-14 02:00:46,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.28 | bwd_microstep: 3845.04 | bwd_inner_microstep: 3837.22 | bwd_allreduce_microstep: 7.77 | step_microstep: 22.55 [2024-11-14 02:00:46,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.28 | bwd: 3845.05 | bwd_inner: 3837.22 | bwd_allreduce: 7.79 | step: 22.55 7%|▋ | 3457/50750 [9:18:08<77:48:59, 5.92s/it] {'loss': 0.2397, 'learning_rate': 3.984785609855525e-05, 'epoch': 3.41} 7%|▋ | 3457/50750 [9:18:08<77:48:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:00:52,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:00:52,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.18 | bwd_microstep: 3849.18 | bwd_inner_microstep: 3841.66 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.18 [2024-11-14 02:00:52,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.16 | bwd: 3849.20 | bwd_inner: 3841.66 | bwd_allreduce: 7.50 | step: 21.19 7%|▋ | 3458/50750 [9:18:14<77:49:01, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.984769892193319e-05, 'epoch': 3.41} 7%|▋ | 3458/50750 [9:18:14<77:49:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:00:58,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:00:58,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.19 | bwd_microstep: 3841.81 | bwd_inner_microstep: 3834.35 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.05 [2024-11-14 02:00:58,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.19 | bwd: 3841.83 | bwd_inner: 3834.35 | bwd_allreduce: 7.43 | step: 21.05 7%|▋ | 3459/50750 [9:18:20<77:46:32, 5.92s/it] {'loss': 0.0291, 'learning_rate': 3.984754166447544e-05, 'epoch': 3.41} 7%|▋ | 3459/50750 [9:18:20<77:46:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:01:04,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-14 02:01:04,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.59 | bwd_microstep: 3842.34 | bwd_inner_microstep: 3834.35 | bwd_allreduce_microstep: 7.94 | step_microstep: 22.05 [2024-11-14 02:01:04,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.59 | bwd: 3842.35 | bwd_inner: 3834.34 | bwd_allreduce: 7.96 | step: 22.05 7%|▋ | 3460/50750 [9:18:26<77:45:15, 5.92s/it] {'loss': 0.003, 'learning_rate': 3.9847384326182634e-05, 'epoch': 3.41} 7%|▋ | 3460/50750 [9:18:26<77:45:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:01:10,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:01:10,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.23 | bwd_microstep: 3840.76 | bwd_inner_microstep: 3833.23 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.13 [2024-11-14 02:01:10,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.22 | bwd: 3840.77 | bwd_inner: 3833.23 | bwd_allreduce: 7.51 | step: 21.14 7%|▋ | 3461/50750 [9:18:32<77:45:14, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.9847226907055414e-05, 'epoch': 3.41} 7%|▋ | 3461/50750 [9:18:32<77:45:14, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:01:16,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:01:16,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.16 | bwd_microstep: 3848.27 | bwd_inner_microstep: 3840.75 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.07 [2024-11-14 02:01:16,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.17 | bwd: 3848.28 | bwd_inner: 3840.75 | bwd_allreduce: 7.49 | step: 21.07 7%|▋ | 3462/50750 [9:18:38<77:45:00, 5.92s/it] {'loss': 0.002, 'learning_rate': 3.9847069407094424e-05, 'epoch': 3.41} 7%|▋ | 3462/50750 [9:18:38<77:45:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:01:22,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.17 | optimizer_step: 4.92 [2024-11-14 02:01:22,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.63 | bwd_microstep: 3844.44 | bwd_inner_microstep: 3836.71 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.97 [2024-11-14 02:01:22,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.63 | bwd: 3844.45 | bwd_inner: 3836.71 | bwd_allreduce: 7.70 | step: 21.98 7%|▋ | 3463/50750 [9:18:44<77:44:18, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.984691182630031e-05, 'epoch': 3.41} 7%|▋ | 3463/50750 [9:18:44<77:44:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:01:28,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:01:28,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.66 | bwd_microstep: 3842.00 | bwd_inner_microstep: 3834.52 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.92 [2024-11-14 02:01:28,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.65 | bwd: 3842.01 | bwd_inner: 3834.52 | bwd_allreduce: 7.46 | step: 20.93 7%|▋ | 3464/50750 [9:18:50<77:43:16, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.98467541646737e-05, 'epoch': 3.41} 7%|▋ | 3464/50750 [9:18:50<77:43:16, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:01:34,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 02:01:34,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.82 | bwd_microstep: 3845.51 | bwd_inner_microstep: 3838.02 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.95 [2024-11-14 02:01:34,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.82 | bwd: 3845.52 | bwd_inner: 3838.02 | bwd_allreduce: 7.46 | step: 20.95 7%|▋ | 3465/50750 [9:18:56<77:43:28, 5.92s/it] {'loss': 0.7661, 'learning_rate': 3.984659642221525e-05, 'epoch': 3.41} 7%|▋ | 3465/50750 [9:18:56<77:43:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:01:40,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:01:40,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.46 | bwd_microstep: 3844.50 | bwd_inner_microstep: 3837.02 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.16 [2024-11-14 02:01:40,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.46 | bwd: 3844.51 | bwd_inner: 3837.02 | bwd_allreduce: 7.45 | step: 21.17 7%|▋ | 3466/50750 [9:19:02<77:42:36, 5.92s/it] {'loss': 0.3561, 'learning_rate': 3.9846438598925595e-05, 'epoch': 3.41} 7%|▋ | 3466/50750 [9:19:02<77:42:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:01:46,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.94 [2024-11-14 02:01:46,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.40 | bwd_microstep: 3840.46 | bwd_inner_microstep: 3833.01 | bwd_allreduce_microstep: 7.40 | step_microstep: 20.86 [2024-11-14 02:01:46,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.40 | bwd: 3840.47 | bwd_inner: 3833.01 | bwd_allreduce: 7.42 | step: 20.87 7%|▋ | 3467/50750 [9:19:08<77:40:34, 5.91s/it] {'loss': 0.0408, 'learning_rate': 3.984628069480538e-05, 'epoch': 3.42} 7%|▋ | 3467/50750 [9:19:08<77:40:34, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:01:52,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 02:01:52,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.25 | bwd_microstep: 3846.83 | bwd_inner_microstep: 3838.65 | bwd_allreduce_microstep: 8.12 | step_microstep: 24.16 [2024-11-14 02:01:52,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.26 | bwd: 3846.85 | bwd_inner: 3838.65 | bwd_allreduce: 8.14 | step: 24.16 7%|▋ | 3468/50750 [9:19:13<77:42:32, 5.92s/it] {'loss': 0.0018, 'learning_rate': 3.984612270985524e-05, 'epoch': 3.42} 7%|▋ | 3468/50750 [9:19:13<77:42:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:01:57,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:01:57,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.94 | bwd_microstep: 3853.42 | bwd_inner_microstep: 3845.96 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.91 [2024-11-14 02:01:57,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.93 | bwd: 3853.43 | bwd_inner: 3845.96 | bwd_allreduce: 7.43 | step: 20.92 7%|▋ | 3469/50750 [9:19:19<77:47:12, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.9845964644075836e-05, 'epoch': 3.42} 7%|▋ | 3469/50750 [9:19:19<77:47:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:02:03,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:02:03,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.94 | bwd_microstep: 3839.78 | bwd_inner_microstep: 3832.27 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.06 [2024-11-14 02:02:03,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.94 | bwd: 3839.79 | bwd_inner: 3832.27 | bwd_allreduce: 7.48 | step: 21.06 7%|▋ | 3470/50750 [9:19:25<77:43:44, 5.92s/it] {'loss': 0.4112, 'learning_rate': 3.98458064974678e-05, 'epoch': 3.42} 7%|▋ | 3470/50750 [9:19:25<77:43:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:02:09,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:02:09,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.48 | bwd_microstep: 3840.79 | bwd_inner_microstep: 3833.31 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.96 [2024-11-14 02:02:09,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.48 | bwd: 3840.80 | bwd_inner: 3833.31 | bwd_allreduce: 7.45 | step: 20.96 7%|▋ | 3471/50750 [9:19:31<77:41:28, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.9845648270031774e-05, 'epoch': 3.42} 7%|▋ | 3471/50750 [9:19:31<77:41:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 02:02:15,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:02:15,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.32 | bwd_microstep: 3843.21 | bwd_inner_microstep: 3835.72 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.00 [2024-11-14 02:02:15,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.32 | bwd: 3843.22 | bwd_inner: 3835.72 | bwd_allreduce: 7.46 | step: 21.01 7%|▋ | 3472/50750 [9:19:37<77:40:23, 5.91s/it] {'loss': 0.0064, 'learning_rate': 3.984548996176841e-05, 'epoch': 3.42} 7%|▋ | 3472/50750 [9:19:37<77:40:23, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:02:21,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:02:21,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.09 | bwd_microstep: 3843.83 | bwd_inner_microstep: 3836.15 | bwd_allreduce_microstep: 7.64 | step_microstep: 21.65 [2024-11-14 02:02:21,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.09 | bwd: 3843.84 | bwd_inner: 3836.15 | bwd_allreduce: 7.66 | step: 21.66 7%|▋ | 3473/50750 [9:19:43<77:40:14, 5.91s/it] {'loss': 0.0012, 'learning_rate': 3.9845331572678346e-05, 'epoch': 3.42} 7%|▋ | 3473/50750 [9:19:43<77:40:14, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:02:27,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 02:02:27,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.63 | bwd_microstep: 3842.34 | bwd_inner_microstep: 3834.86 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.97 [2024-11-14 02:02:27,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.63 | bwd: 3842.35 | bwd_inner: 3834.86 | bwd_allreduce: 7.45 | step: 20.97 7%|▋ | 3474/50750 [9:19:49<77:39:31, 5.91s/it] {'loss': 0.0004, 'learning_rate': 3.984517310276223e-05, 'epoch': 3.42} 7%|▋ | 3474/50750 [9:19:49<77:39:31, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:02:33,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:02:33,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.95 | bwd_microstep: 3846.67 | bwd_inner_microstep: 3839.21 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.91 [2024-11-14 02:02:33,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.95 | bwd: 3846.69 | bwd_inner: 3839.21 | bwd_allreduce: 7.44 | step: 20.91 7%|▋ | 3475/50750 [9:19:55<77:39:28, 5.91s/it] {'loss': 0.0009, 'learning_rate': 3.984501455202071e-05, 'epoch': 3.42} 7%|▋ | 3475/50750 [9:19:55<77:39:28, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:02:39,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 02:02:39,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.95 | bwd_microstep: 3840.51 | bwd_inner_microstep: 3833.02 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.98 [2024-11-14 02:02:39,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.95 | bwd: 3840.52 | bwd_inner: 3833.02 | bwd_allreduce: 7.47 | step: 20.98 7%|▋ | 3476/50750 [9:20:01<77:38:48, 5.91s/it] {'loss': 0.0002, 'learning_rate': 3.9844855920454426e-05, 'epoch': 3.42} 7%|▋ | 3476/50750 [9:20:01<77:38:48, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:02:45,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 02:02:45,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.71 | bwd_microstep: 3841.80 | bwd_inner_microstep: 3834.31 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.91 [2024-11-14 02:02:45,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.71 | bwd: 3841.81 | bwd_inner: 3834.31 | bwd_allreduce: 7.46 | step: 20.91 7%|▋ | 3477/50750 [9:20:07<77:40:24, 5.92s/it] {'loss': 0.0034, 'learning_rate': 3.9844697208064035e-05, 'epoch': 3.43} 7%|▋ | 3477/50750 [9:20:07<77:40:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:02:51,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:02:51,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.96 | bwd_microstep: 3839.87 | bwd_inner_microstep: 3832.36 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.14 [2024-11-14 02:02:51,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.96 | bwd: 3839.88 | bwd_inner: 3832.36 | bwd_allreduce: 7.48 | step: 21.15 7%|▋ | 3478/50750 [9:20:13<77:41:04, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.9844538414850166e-05, 'epoch': 3.43} 7%|▋ | 3478/50750 [9:20:13<77:41:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:02:57,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:02:57,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.73 | bwd_microstep: 3834.34 | bwd_inner_microstep: 3826.80 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.11 [2024-11-14 02:02:57,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.73 | bwd: 3834.35 | bwd_inner: 3826.80 | bwd_allreduce: 7.50 | step: 21.11 7%|▋ | 3479/50750 [9:20:19<77:39:42, 5.91s/it] {'loss': 0.0002, 'learning_rate': 3.9844379540813484e-05, 'epoch': 3.43} 7%|▋ | 3479/50750 [9:20:19<77:39:42, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:03:02,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.20 | optimizer_step: 4.93 [2024-11-14 02:03:02,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.49 | bwd_microstep: 3842.00 | bwd_inner_microstep: 3834.12 | bwd_allreduce_microstep: 7.82 | step_microstep: 21.92 [2024-11-14 02:03:02,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.49 | bwd: 3842.01 | bwd_inner: 3834.12 | bwd_allreduce: 7.84 | step: 21.93 7%|▋ | 3480/50750 [9:20:24<77:40:31, 5.92s/it] {'loss': 0.7047, 'learning_rate': 3.984422058595462e-05, 'epoch': 3.43} 7%|▋ | 3480/50750 [9:20:24<77:40:31, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:03:08,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:03:08,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.54 | bwd_microstep: 3840.54 | bwd_inner_microstep: 3833.03 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.04 [2024-11-14 02:03:08,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.52 | bwd: 3840.55 | bwd_inner: 3833.03 | bwd_allreduce: 7.48 | step: 21.04 7%|▋ | 3481/50750 [9:20:30<77:41:25, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.984406155027423e-05, 'epoch': 3.43} 7%|▋ | 3481/50750 [9:20:30<77:41:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:03:14,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:03:14,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.54 | bwd_microstep: 3843.61 | bwd_inner_microstep: 3836.07 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.24 [2024-11-14 02:03:14,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.54 | bwd: 3843.62 | bwd_inner: 3836.07 | bwd_allreduce: 7.52 | step: 21.25 7%|▋ | 3482/50750 [9:20:36<77:40:26, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.984390243377297e-05, 'epoch': 3.43} 7%|▋ | 3482/50750 [9:20:36<77:40:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:03:20,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 02:03:20,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.06 | bwd_microstep: 3843.48 | bwd_inner_microstep: 3836.03 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.91 [2024-11-14 02:03:20,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.06 | bwd: 3843.49 | bwd_inner: 3836.03 | bwd_allreduce: 7.43 | step: 20.91 7%|▋ | 3483/50750 [9:20:42<77:39:30, 5.91s/it] {'loss': 0.0014, 'learning_rate': 3.984374323645147e-05, 'epoch': 3.43} 7%|▋ | 3483/50750 [9:20:42<77:39:30, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:03:26,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:03:26,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.09 | bwd_microstep: 3842.52 | bwd_inner_microstep: 3835.04 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.98 [2024-11-14 02:03:26,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.09 | bwd: 3842.53 | bwd_inner: 3835.04 | bwd_allreduce: 7.46 | step: 20.99 7%|▋ | 3484/50750 [9:20:48<77:38:51, 5.91s/it] {'loss': 0.2676, 'learning_rate': 3.984358395831039e-05, 'epoch': 3.43} 7%|▋ | 3484/50750 [9:20:48<77:38:51, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:03:32,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.94 [2024-11-14 02:03:32,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.44 | bwd_microstep: 3844.04 | bwd_inner_microstep: 3836.51 | bwd_allreduce_microstep: 7.50 | step_microstep: 20.93 [2024-11-14 02:03:32,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.44 | bwd: 3844.06 | bwd_inner: 3836.51 | bwd_allreduce: 7.51 | step: 20.93 7%|▋ | 3485/50750 [9:20:54<77:38:32, 5.91s/it] {'loss': 0.0015, 'learning_rate': 3.9843424599350374e-05, 'epoch': 3.43} 7%|▋ | 3485/50750 [9:20:54<77:38:32, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:03:38,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:03:38,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.76 | bwd_microstep: 3840.65 | bwd_inner_microstep: 3833.20 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.96 [2024-11-14 02:03:38,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.76 | bwd: 3840.66 | bwd_inner: 3833.20 | bwd_allreduce: 7.42 | step: 20.96 7%|▋ | 3486/50750 [9:21:00<77:37:57, 5.91s/it] {'loss': 0.0038, 'learning_rate': 3.984326515957207e-05, 'epoch': 3.43} 7%|▋ | 3486/50750 [9:21:00<77:37:57, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:03:44,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:03:44,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.30 | bwd_microstep: 3848.75 | bwd_inner_microstep: 3841.23 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.29 [2024-11-14 02:03:44,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.30 | bwd: 3848.76 | bwd_inner: 3841.23 | bwd_allreduce: 7.49 | step: 21.30 7%|▋ | 3487/50750 [9:21:06<77:39:34, 5.92s/it] {'loss': 0.0763, 'learning_rate': 3.9843105638976134e-05, 'epoch': 3.44} 7%|▋ | 3487/50750 [9:21:06<77:39:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:03:50,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 02:03:50,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.35 | bwd_microstep: 3852.35 | bwd_inner_microstep: 3844.81 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.26 [2024-11-14 02:03:50,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.35 | bwd: 3852.36 | bwd_inner: 3844.81 | bwd_allreduce: 7.51 | step: 21.26 7%|▋ | 3488/50750 [9:21:12<77:40:56, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.984294603756322e-05, 'epoch': 3.44} 7%|▋ | 3488/50750 [9:21:12<77:40:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:03:56,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:03:56,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.44 | bwd_microstep: 3851.49 | bwd_inner_microstep: 3843.97 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.60 [2024-11-14 02:03:56,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.44 | bwd: 3851.50 | bwd_inner: 3843.97 | bwd_allreduce: 7.49 | step: 21.60 7%|▋ | 3489/50750 [9:21:18<77:43:10, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.984278635533396e-05, 'epoch': 3.44} 7%|▋ | 3489/50750 [9:21:18<77:43:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:04:02,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:04:02,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.41 | bwd_microstep: 3841.56 | bwd_inner_microstep: 3834.01 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.07 [2024-11-14 02:04:02,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.41 | bwd: 3841.57 | bwd_inner: 3834.01 | bwd_allreduce: 7.52 | step: 21.07 7%|▋ | 3490/50750 [9:21:24<77:41:59, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.9842626592289016e-05, 'epoch': 3.44} 7%|▋ | 3490/50750 [9:21:24<77:41:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:04:08,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.92 [2024-11-14 02:04:08,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.34 | bwd_microstep: 3844.88 | bwd_inner_microstep: 3835.28 | bwd_allreduce_microstep: 9.51 | step_microstep: 25.40 [2024-11-14 02:04:08,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.32 | bwd: 3844.91 | bwd_inner: 3835.28 | bwd_allreduce: 9.55 | step: 25.39 7%|▋ | 3491/50750 [9:21:30<77:44:22, 5.92s/it] {'loss': 0.0091, 'learning_rate': 3.984246674842903e-05, 'epoch': 3.44} 7%|▋ | 3491/50750 [9:21:30<77:44:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:04:14,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.29 | optimizer_step: 4.93 [2024-11-14 02:04:14,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.79 | bwd_microstep: 3845.27 | bwd_inner_microstep: 3837.09 | bwd_allreduce_microstep: 8.13 | step_microstep: 22.67 [2024-11-14 02:04:14,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.78 | bwd: 3845.28 | bwd_inner: 3837.09 | bwd_allreduce: 8.15 | step: 22.68 7%|▋ | 3492/50750 [9:21:35<77:45:35, 5.92s/it] {'loss': 0.0065, 'learning_rate': 3.9842306823754665e-05, 'epoch': 3.44} 7%|▋ | 3492/50750 [9:21:35<77:45:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:04:19,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.24 | optimizer_step: 4.93 [2024-11-14 02:04:19,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2032.18 | bwd_microstep: 3845.76 | bwd_inner_microstep: 3837.88 | bwd_allreduce_microstep: 7.84 | step_microstep: 22.33 [2024-11-14 02:04:19,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2032.16 | bwd: 3845.78 | bwd_inner: 3837.88 | bwd_allreduce: 7.86 | step: 22.33 7%|▋ | 3493/50750 [9:21:41<77:49:18, 5.93s/it] {'loss': 0.0009, 'learning_rate': 3.984214681826657e-05, 'epoch': 3.44} 7%|▋ | 3493/50750 [9:21:41<77:49:18, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:04:25,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.43 | optimizer_step: 4.93 [2024-11-14 02:04:25,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.18 | bwd_microstep: 3842.98 | bwd_inner_microstep: 3835.47 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.86 [2024-11-14 02:04:25,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.17 | bwd: 3843.00 | bwd_inner: 3835.47 | bwd_allreduce: 7.49 | step: 21.86 7%|▋ | 3494/50750 [9:21:47<77:47:34, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.9841986731965396e-05, 'epoch': 3.44} 7%|▋ | 3494/50750 [9:21:47<77:47:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:04:31,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:04:31,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1992.45 | bwd_microstep: 3781.65 | bwd_inner_microstep: 3774.15 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.01 [2024-11-14 02:04:31,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 1992.45 | bwd: 3781.66 | bwd_inner: 3774.15 | bwd_allreduce: 7.47 | step: 21.01 7%|▋ | 3495/50750 [9:21:53<77:22:46, 5.89s/it] {'loss': 0.0256, 'learning_rate': 3.984182656485179e-05, 'epoch': 3.44} 7%|▋ | 3495/50750 [9:21:53<77:22:46, 5.89s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:04:37,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:04:37,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.98 | bwd_microstep: 3862.00 | bwd_inner_microstep: 3854.43 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.47 [2024-11-14 02:04:37,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.98 | bwd: 3862.01 | bwd_inner: 3854.43 | bwd_allreduce: 7.54 | step: 21.47 7%|▋ | 3496/50750 [9:21:59<77:31:44, 5.91s/it] {'loss': 0.0108, 'learning_rate': 3.9841666316926405e-05, 'epoch': 3.44} 7%|▋ | 3496/50750 [9:21:59<77:31:44, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 02:04:43,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 02:04:43,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.46 | bwd_microstep: 3851.43 | bwd_inner_microstep: 3842.70 | bwd_allreduce_microstep: 8.68 | step_microstep: 21.73 [2024-11-14 02:04:43,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.44 | bwd: 3851.44 | bwd_inner: 3842.70 | bwd_allreduce: 8.70 | step: 21.74 7%|▋ | 3497/50750 [9:22:05<77:40:13, 5.92s/it] {'loss': 0.0166, 'learning_rate': 3.9841505988189905e-05, 'epoch': 3.45} 7%|▋ | 3497/50750 [9:22:05<77:40:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:04:49,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-14 02:04:49,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.83 | bwd_microstep: 3845.94 | bwd_inner_microstep: 3838.34 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.48 [2024-11-14 02:04:49,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.81 | bwd: 3845.95 | bwd_inner: 3838.34 | bwd_allreduce: 7.57 | step: 21.49 7%|▋ | 3498/50750 [9:22:11<77:42:07, 5.92s/it] {'loss': 0.0209, 'learning_rate': 3.984134557864293e-05, 'epoch': 3.45} 7%|▋ | 3498/50750 [9:22:11<77:42:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:04:55,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.94 [2024-11-14 02:04:55,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.66 | bwd_microstep: 3844.13 | bwd_inner_microstep: 3836.63 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.42 [2024-11-14 02:04:55,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.66 | bwd: 3844.14 | bwd_inner: 3836.63 | bwd_allreduce: 7.48 | step: 21.43 7%|▋ | 3499/50750 [9:22:17<77:42:21, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.984118508828613e-05, 'epoch': 3.45} 7%|▋ | 3499/50750 [9:22:17<77:42:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:05:01,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 02:05:01,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.86 | bwd_microstep: 3857.96 | bwd_inner_microstep: 3850.23 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.67 [2024-11-14 02:05:01,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.86 | bwd: 3857.97 | bwd_inner: 3850.23 | bwd_allreduce: 7.70 | step: 21.68 7%|▋ | 3500/50750 [9:22:23<77:44:37, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.984102451712017e-05, 'epoch': 3.45} 7%|▋ | 3500/50750 [9:22:23<77:44:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:05:07,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.30 | optimizer_step: 4.93 [2024-11-14 02:05:07,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.00 | bwd_microstep: 3857.28 | bwd_inner_microstep: 3849.34 | bwd_allreduce_microstep: 7.89 | step_microstep: 22.10 [2024-11-14 02:05:07,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.98 | bwd: 3857.29 | bwd_inner: 3849.34 | bwd_allreduce: 7.91 | step: 22.10 7%|▋ | 3501/50750 [9:22:29<77:48:17, 5.93s/it] {'loss': 0.8231, 'learning_rate': 3.98408638651457e-05, 'epoch': 3.45} 7%|▋ | 3501/50750 [9:22:29<77:48:17, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:05:13,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:05:13,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.89 | bwd_microstep: 3857.17 | bwd_inner_microstep: 3849.65 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.26 [2024-11-14 02:05:13,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.89 | bwd: 3857.18 | bwd_inner: 3849.65 | bwd_allreduce: 7.49 | step: 21.27 7%|▋ | 3502/50750 [9:22:35<77:49:11, 5.93s/it] {'loss': 0.0007, 'learning_rate': 3.984070313236337e-05, 'epoch': 3.45} 7%|▋ | 3502/50750 [9:22:35<77:49:11, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:05:19,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:05:19,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.90 | bwd_microstep: 3838.96 | bwd_inner_microstep: 3831.41 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.18 [2024-11-14 02:05:19,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.90 | bwd: 3838.97 | bwd_inner: 3831.41 | bwd_allreduce: 7.51 | step: 21.18 7%|▋ | 3503/50750 [9:22:41<77:43:57, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.984054231877385e-05, 'epoch': 3.45} 7%|▋ | 3503/50750 [9:22:41<77:43:57, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:05:25,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.23 | optimizer_step: 4.93 [2024-11-14 02:05:25,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.25 | bwd_microstep: 3850.31 | bwd_inner_microstep: 3842.58 | bwd_allreduce_microstep: 7.68 | step_microstep: 22.18 [2024-11-14 02:05:25,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.25 | bwd: 3850.33 | bwd_inner: 3842.58 | bwd_allreduce: 7.70 | step: 22.19 7%|▋ | 3504/50750 [9:22:47<77:44:30, 5.92s/it] {'loss': 0.0012, 'learning_rate': 3.9840381424377784e-05, 'epoch': 3.45} 7%|▋ | 3504/50750 [9:22:47<77:44:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:05:30,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-14 02:05:30,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.81 | bwd_microstep: 3847.30 | bwd_inner_microstep: 3839.58 | bwd_allreduce_microstep: 7.67 | step_microstep: 21.68 [2024-11-14 02:05:30,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.80 | bwd: 3847.32 | bwd_inner: 3839.58 | bwd_allreduce: 7.69 | step: 21.69 7%|▋ | 3505/50750 [9:22:52<77:46:33, 5.93s/it] {'loss': 0.0025, 'learning_rate': 3.984022044917581e-05, 'epoch': 3.45} 7%|▋ | 3505/50750 [9:22:52<77:46:33, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:05:36,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:05:36,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.01 | bwd_microstep: 3840.98 | bwd_inner_microstep: 3833.45 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.11 [2024-11-14 02:05:36,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.00 | bwd: 3840.99 | bwd_inner: 3833.45 | bwd_allreduce: 7.51 | step: 21.12 7%|▋ | 3506/50750 [9:22:58<77:43:54, 5.92s/it] {'loss': 0.0113, 'learning_rate': 3.984005939316862e-05, 'epoch': 3.45} 7%|▋ | 3506/50750 [9:22:58<77:43:54, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:05:42,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.49 | optimizer_step: 4.93 [2024-11-14 02:05:42,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.64 | bwd_microstep: 3840.57 | bwd_inner_microstep: 3832.83 | bwd_allreduce_microstep: 7.69 | step_microstep: 25.11 [2024-11-14 02:05:42,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.65 | bwd: 3840.59 | bwd_inner: 3832.83 | bwd_allreduce: 7.71 | step: 25.13 7%|▋ | 3507/50750 [9:23:04<77:42:55, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.983989825635684e-05, 'epoch': 3.46} 7%|▋ | 3507/50750 [9:23:04<77:42:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:05:48,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 02:05:48,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.36 | bwd_microstep: 3842.28 | bwd_inner_microstep: 3834.75 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.38 [2024-11-14 02:05:48,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.36 | bwd: 3842.29 | bwd_inner: 3834.75 | bwd_allreduce: 7.50 | step: 21.38 7%|▋ | 3508/50750 [9:23:10<77:40:24, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.983973703874114e-05, 'epoch': 3.46} 7%|▋ | 3508/50750 [9:23:10<77:40:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:05:54,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:05:54,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.45 | bwd_microstep: 3845.53 | bwd_inner_microstep: 3838.00 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.26 [2024-11-14 02:05:54,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.45 | bwd: 3845.54 | bwd_inner: 3838.00 | bwd_allreduce: 7.50 | step: 21.27 7%|▋ | 3509/50750 [9:23:16<77:40:52, 5.92s/it] {'loss': 0.0462, 'learning_rate': 3.983957574032217e-05, 'epoch': 3.46} 7%|▋ | 3509/50750 [9:23:16<77:40:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:06:00,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.94 [2024-11-14 02:06:00,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.25 | bwd_microstep: 3842.55 | bwd_inner_microstep: 3834.93 | bwd_allreduce_microstep: 7.58 | step_microstep: 20.99 [2024-11-14 02:06:00,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.25 | bwd: 3842.57 | bwd_inner: 3834.93 | bwd_allreduce: 7.60 | step: 21.00 7%|▋ | 3510/50750 [9:23:22<77:39:09, 5.92s/it] {'loss': 0.0022, 'learning_rate': 3.983941436110059e-05, 'epoch': 3.46} 7%|▋ | 3510/50750 [9:23:22<77:39:09, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:06:06,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:06:06,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.77 | bwd_microstep: 3841.12 | bwd_inner_microstep: 3833.51 | bwd_allreduce_microstep: 7.56 | step_microstep: 21.17 [2024-11-14 02:06:06,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.77 | bwd: 3841.13 | bwd_inner: 3833.51 | bwd_allreduce: 7.57 | step: 21.18 7%|▋ | 3511/50750 [9:23:28<77:38:01, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.983925290107706e-05, 'epoch': 3.46} 7%|▋ | 3511/50750 [9:23:28<77:38:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:06:12,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:06:12,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.54 | bwd_microstep: 3844.17 | bwd_inner_microstep: 3836.70 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.11 [2024-11-14 02:06:12,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.53 | bwd: 3844.19 | bwd_inner: 3836.70 | bwd_allreduce: 7.45 | step: 21.12 7%|▋ | 3512/50750 [9:23:34<77:37:56, 5.92s/it] {'loss': 0.6028, 'learning_rate': 3.983909136025224e-05, 'epoch': 3.46} 7%|▋ | 3512/50750 [9:23:34<77:37:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:06:18,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-14 02:06:18,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.28 | bwd_microstep: 3844.12 | bwd_inner_microstep: 3836.62 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.79 [2024-11-14 02:06:18,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.28 | bwd: 3844.14 | bwd_inner: 3836.62 | bwd_allreduce: 7.47 | step: 20.79 7%|▋ | 3513/50750 [9:23:40<77:38:42, 5.92s/it] {'loss': 0.0, 'learning_rate': 3.983892973862678e-05, 'epoch': 3.46} 7%|▋ | 3513/50750 [9:23:40<77:38:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:06:24,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.92 | optimizer_step: 4.94 [2024-11-14 02:06:24,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.29 | bwd_microstep: 3843.60 | bwd_inner_microstep: 3836.00 | bwd_allreduce_microstep: 7.55 | step_microstep: 23.72 [2024-11-14 02:06:24,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.29 | bwd: 3843.61 | bwd_inner: 3836.00 | bwd_allreduce: 7.56 | step: 23.74 7%|▋ | 3514/50750 [9:23:46<77:39:03, 5.92s/it] {'loss': 0.0321, 'learning_rate': 3.983876803620134e-05, 'epoch': 3.46} 7%|▋ | 3514/50750 [9:23:46<77:39:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:06:30,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.38 | optimizer_step: 4.93 [2024-11-14 02:06:30,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.82 | bwd_microstep: 3841.80 | bwd_inner_microstep: 3834.18 | bwd_allreduce_microstep: 7.57 | step_microstep: 22.07 [2024-11-14 02:06:30,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.82 | bwd: 3841.81 | bwd_inner: 3834.18 | bwd_allreduce: 7.59 | step: 22.08 7%|▋ | 3515/50750 [9:23:52<77:38:37, 5.92s/it] {'loss': 0.0011, 'learning_rate': 3.9838606252976586e-05, 'epoch': 3.46} 7%|▋ | 3515/50750 [9:23:52<77:38:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:06:36,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:06:36,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.50 | bwd_microstep: 3846.36 | bwd_inner_microstep: 3838.87 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.08 [2024-11-14 02:06:36,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.49 | bwd: 3846.38 | bwd_inner: 3838.87 | bwd_allreduce: 7.47 | step: 21.09 7%|▋ | 3516/50750 [9:23:58<77:38:51, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.983844438895316e-05, 'epoch': 3.46} 7%|▋ | 3516/50750 [9:23:58<77:38:51, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:06:42,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 02:06:42,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.26 | bwd_microstep: 3848.38 | bwd_inner_microstep: 3840.82 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.54 [2024-11-14 02:06:42,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.26 | bwd: 3848.40 | bwd_inner: 3840.82 | bwd_allreduce: 7.53 | step: 21.55 7%|▋ | 3517/50750 [9:24:03<77:39:49, 5.92s/it] {'loss': 0.0053, 'learning_rate': 3.983828244413174e-05, 'epoch': 3.47} 7%|▋ | 3517/50750 [9:24:03<77:39:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:06:47,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.19 | optimizer_step: 4.93 [2024-11-14 02:06:47,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.33 | bwd_microstep: 3846.41 | bwd_inner_microstep: 3838.51 | bwd_allreduce_microstep: 7.84 | step_microstep: 22.68 [2024-11-14 02:06:47,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.31 | bwd: 3846.43 | bwd_inner: 3838.51 | bwd_allreduce: 7.87 | step: 22.68 7%|▋ | 3518/50750 [9:24:09<77:41:52, 5.92s/it] {'loss': 0.0019, 'learning_rate': 3.9838120418512984e-05, 'epoch': 3.47} 7%|▋ | 3518/50750 [9:24:09<77:41:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:06:53,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:06:53,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.31 | bwd_microstep: 3840.59 | bwd_inner_microstep: 3833.13 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.21 [2024-11-14 02:06:53,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.29 | bwd: 3840.61 | bwd_inner: 3833.13 | bwd_allreduce: 7.44 | step: 21.21 7%|▋ | 3519/50750 [9:24:15<77:40:12, 5.92s/it] {'loss': 0.0008, 'learning_rate': 3.9837958312097535e-05, 'epoch': 3.47} 7%|▋ | 3519/50750 [9:24:15<77:40:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:06:59,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:06:59,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.89 | bwd_microstep: 3846.77 | bwd_inner_microstep: 3839.30 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.05 [2024-11-14 02:06:59,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.89 | bwd: 3846.78 | bwd_inner: 3839.30 | bwd_allreduce: 7.44 | step: 21.06 7%|▋ | 3520/50750 [9:24:21<77:39:39, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.9837796124886066e-05, 'epoch': 3.47} 7%|▋ | 3520/50750 [9:24:21<77:39:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:07:05,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 02:07:05,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.56 | bwd_microstep: 3839.40 | bwd_inner_microstep: 3831.95 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.94 [2024-11-14 02:07:05,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.56 | bwd: 3839.41 | bwd_inner: 3831.95 | bwd_allreduce: 7.43 | step: 20.95 7%|▋ | 3521/50750 [9:24:27<77:36:41, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.9837633856879234e-05, 'epoch': 3.47} 7%|▋ | 3521/50750 [9:24:27<77:36:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:07:11,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:07:11,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.75 | bwd_microstep: 3839.43 | bwd_inner_microstep: 3831.94 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.04 [2024-11-14 02:07:11,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.74 | bwd: 3839.44 | bwd_inner: 3831.94 | bwd_allreduce: 7.46 | step: 21.05 7%|▋ | 3522/50750 [9:24:33<77:34:44, 5.91s/it] {'loss': 0.0003, 'learning_rate': 3.9837471508077704e-05, 'epoch': 3.47} 7%|▋ | 3522/50750 [9:24:33<77:34:44, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:07:17,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:07:17,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.31 | bwd_microstep: 3836.29 | bwd_inner_microstep: 3828.85 | bwd_allreduce_microstep: 7.40 | step_microstep: 21.02 [2024-11-14 02:07:17,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.31 | bwd: 3836.30 | bwd_inner: 3828.85 | bwd_allreduce: 7.41 | step: 21.03 7%|▋ | 3523/50750 [9:24:39<77:33:02, 5.91s/it] {'loss': 0.0002, 'learning_rate': 3.983730907848213e-05, 'epoch': 3.47} 7%|▋ | 3523/50750 [9:24:39<77:33:02, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:07:23,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:07:23,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.67 | bwd_microstep: 3841.43 | bwd_inner_microstep: 3833.93 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.68 [2024-11-14 02:07:23,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.67 | bwd: 3841.44 | bwd_inner: 3833.93 | bwd_allreduce: 7.47 | step: 21.68 7%|▋ | 3524/50750 [9:24:45<77:32:23, 5.91s/it] {'loss': 0.1223, 'learning_rate': 3.983714656809319e-05, 'epoch': 3.47} 7%|▋ | 3524/50750 [9:24:45<77:32:23, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:07:29,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.92 [2024-11-14 02:07:29,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.27 | bwd_microstep: 3845.57 | bwd_inner_microstep: 3838.09 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.20 [2024-11-14 02:07:29,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.27 | bwd: 3845.58 | bwd_inner: 3838.09 | bwd_allreduce: 7.45 | step: 21.20 7%|▋ | 3525/50750 [9:24:51<77:35:07, 5.91s/it] {'loss': 0.0004, 'learning_rate': 3.983698397691152e-05, 'epoch': 3.47} 7%|▋ | 3525/50750 [9:24:51<77:35:07, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:07:35,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:07:35,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.52 | bwd_microstep: 3850.79 | bwd_inner_microstep: 3842.87 | bwd_allreduce_microstep: 7.85 | step_microstep: 23.95 [2024-11-14 02:07:35,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.52 | bwd: 3850.81 | bwd_inner: 3842.87 | bwd_allreduce: 7.88 | step: 23.94 7%|▋ | 3526/50750 [9:24:57<77:36:45, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.9836821304937806e-05, 'epoch': 3.47} 7%|▋ | 3526/50750 [9:24:57<77:36:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:07:41,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:07:41,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.78 | bwd_microstep: 3842.48 | bwd_inner_microstep: 3835.00 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.99 [2024-11-14 02:07:41,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.78 | bwd: 3842.49 | bwd_inner: 3835.00 | bwd_allreduce: 7.45 | step: 20.99 7%|▋ | 3527/50750 [9:25:03<77:35:10, 5.91s/it] {'loss': 0.0009, 'learning_rate': 3.98366585521727e-05, 'epoch': 3.47} 7%|▋ | 3527/50750 [9:25:03<77:35:10, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:07:47,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:07:47,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.04 | bwd_microstep: 3839.39 | bwd_inner_microstep: 3831.90 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.12 [2024-11-14 02:07:47,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.04 | bwd: 3839.40 | bwd_inner: 3831.90 | bwd_allreduce: 7.46 | step: 21.13 7%|▋ | 3528/50750 [9:25:09<77:33:55, 5.91s/it] {'loss': 0.0575, 'learning_rate': 3.983649571861686e-05, 'epoch': 3.48} 7%|▋ | 3528/50750 [9:25:09<77:33:55, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:07:52,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:07:52,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.76 | bwd_microstep: 3841.36 | bwd_inner_microstep: 3833.87 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.00 [2024-11-14 02:07:52,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.76 | bwd: 3841.37 | bwd_inner: 3833.87 | bwd_allreduce: 7.46 | step: 21.00 7%|▋ | 3529/50750 [9:25:14<77:32:54, 5.91s/it] {'loss': 0.0002, 'learning_rate': 3.983633280427096e-05, 'epoch': 3.48} 7%|▋ | 3529/50750 [9:25:14<77:32:54, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:07:58,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.49 | optimizer_step: 4.93 [2024-11-14 02:07:58,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.44 | bwd_microstep: 3844.42 | bwd_inner_microstep: 3836.82 | bwd_allreduce_microstep: 7.55 | step_microstep: 22.72 [2024-11-14 02:07:58,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.45 | bwd: 3844.44 | bwd_inner: 3836.82 | bwd_allreduce: 7.57 | step: 22.72 7%|▋ | 3530/50750 [9:25:20<77:33:58, 5.91s/it] {'loss': 0.0049, 'learning_rate': 3.9836169809135655e-05, 'epoch': 3.48} 7%|▋ | 3530/50750 [9:25:20<77:33:58, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:08:04,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.15 | optimizer_step: 4.93 [2024-11-14 02:08:04,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.89 | bwd_microstep: 3842.33 | bwd_inner_microstep: 3834.29 | bwd_allreduce_microstep: 7.99 | step_microstep: 21.86 [2024-11-14 02:08:04,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.89 | bwd: 3842.35 | bwd_inner: 3834.29 | bwd_allreduce: 8.01 | step: 21.86 7%|▋ | 3531/50750 [9:25:26<77:33:38, 5.91s/it] {'loss': 0.0063, 'learning_rate': 3.9836006733211613e-05, 'epoch': 3.48} 7%|▋ | 3531/50750 [9:25:26<77:33:38, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2192 [2024-11-14 02:08:10,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 02:08:10,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.68 | bwd_microstep: 3840.16 | bwd_inner_microstep: 3832.63 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.12 [2024-11-14 02:08:10,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.67 | bwd: 3840.17 | bwd_inner: 3832.64 | bwd_allreduce: 7.50 | step: 21.12 7%|▋ | 3532/50750 [9:25:32<77:32:56, 5.91s/it] {'loss': 0.0, 'learning_rate': 3.9835843576499494e-05, 'epoch': 3.48} 7%|▋ | 3532/50750 [9:25:32<77:32:56, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:08:16,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-14 02:08:16,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.32 | bwd_microstep: 3839.82 | bwd_inner_microstep: 3832.06 | bwd_allreduce_microstep: 7.70 | step_microstep: 24.14 [2024-11-14 02:08:16,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.32 | bwd: 3839.83 | bwd_inner: 3832.06 | bwd_allreduce: 7.72 | step: 24.13 7%|▋ | 3533/50750 [9:25:38<77:33:18, 5.91s/it] {'loss': 0.0005, 'learning_rate': 3.9835680338999974e-05, 'epoch': 3.48} 7%|▋ | 3533/50750 [9:25:38<77:33:18, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:08:22,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:08:22,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.31 | bwd_microstep: 3839.53 | bwd_inner_microstep: 3831.99 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.13 [2024-11-14 02:08:22,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.29 | bwd: 3839.54 | bwd_inner: 3831.99 | bwd_allreduce: 7.51 | step: 21.13 7%|▋ | 3534/50750 [9:25:44<77:33:18, 5.91s/it] {'loss': 0.0005, 'learning_rate': 3.9835517020713704e-05, 'epoch': 3.48} 7%|▋ | 3534/50750 [9:25:44<77:33:18, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:08:28,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-14 02:08:28,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.84 | bwd_microstep: 3835.06 | bwd_inner_microstep: 3827.53 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.17 [2024-11-14 02:08:28,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.82 | bwd: 3835.08 | bwd_inner: 3827.53 | bwd_allreduce: 7.50 | step: 21.17 7%|▋ | 3535/50750 [9:25:50<77:31:30, 5.91s/it] {'loss': 0.2404, 'learning_rate': 3.9835353621641356e-05, 'epoch': 3.48} 7%|▋ | 3535/50750 [9:25:50<77:31:30, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:08:34,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:08:34,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.31 | bwd_microstep: 3840.08 | bwd_inner_microstep: 3832.55 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.40 [2024-11-14 02:08:34,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.31 | bwd: 3840.10 | bwd_inner: 3832.55 | bwd_allreduce: 7.51 | step: 21.40 7%|▋ | 3536/50750 [9:25:56<77:31:35, 5.91s/it] {'loss': 0.5145, 'learning_rate': 3.983519014178359e-05, 'epoch': 3.48} 7%|▋ | 3536/50750 [9:25:56<77:31:35, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:08:40,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:08:40,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.72 | bwd_microstep: 3843.31 | bwd_inner_microstep: 3835.71 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.86 [2024-11-14 02:08:40,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.72 | bwd: 3843.32 | bwd_inner: 3835.71 | bwd_allreduce: 7.56 | step: 21.86 7%|▋ | 3537/50750 [9:26:02<77:33:15, 5.91s/it] {'loss': 0.0012, 'learning_rate': 3.9835026581141084e-05, 'epoch': 3.48} 7%|▋ | 3537/50750 [9:26:02<77:33:15, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:08:46,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 4.93 [2024-11-14 02:08:46,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.63 | bwd_microstep: 3839.61 | bwd_inner_microstep: 3831.90 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.51 [2024-11-14 02:08:46,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.63 | bwd: 3839.62 | bwd_inner: 3831.90 | bwd_allreduce: 7.68 | step: 21.52 7%|▋ | 3538/50750 [9:26:08<77:32:59, 5.91s/it] {'loss': 0.564, 'learning_rate': 3.9834862939714496e-05, 'epoch': 3.49} 7%|▋ | 3538/50750 [9:26:08<77:32:59, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:08:52,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.92 [2024-11-14 02:08:52,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.39 | bwd_microstep: 3842.87 | bwd_inner_microstep: 3835.31 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.49 [2024-11-14 02:08:52,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.38 | bwd: 3842.88 | bwd_inner: 3835.31 | bwd_allreduce: 7.53 | step: 21.50 7%|▋ | 3539/50750 [9:26:14<77:34:15, 5.92s/it] {'loss': 0.1425, 'learning_rate': 3.983469921750449e-05, 'epoch': 3.49} 7%|▋ | 3539/50750 [9:26:14<77:34:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:08:58,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:08:58,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.09 | bwd_microstep: 3840.67 | bwd_inner_microstep: 3833.16 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.22 [2024-11-14 02:08:58,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.09 | bwd: 3840.68 | bwd_inner: 3833.16 | bwd_allreduce: 7.48 | step: 21.22 7%|▋ | 3540/50750 [9:26:19<77:33:45, 5.91s/it] {'loss': 0.0065, 'learning_rate': 3.983453541451173e-05, 'epoch': 3.49} 7%|▋ | 3540/50750 [9:26:19<77:33:45, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 02:09:03,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:09:03,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.32 | bwd_microstep: 3846.35 | bwd_inner_microstep: 3838.64 | bwd_allreduce_microstep: 7.66 | step_microstep: 22.55 [2024-11-14 02:09:03,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.32 | bwd: 3846.37 | bwd_inner: 3838.64 | bwd_allreduce: 7.68 | step: 22.55 7%|▋ | 3541/50750 [9:26:25<77:35:05, 5.92s/it] {'loss': 0.3579, 'learning_rate': 3.983437153073689e-05, 'epoch': 3.49} 7%|▋ | 3541/50750 [9:26:25<77:35:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:09:09,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:09:09,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.49 | bwd_microstep: 3850.47 | bwd_inner_microstep: 3842.99 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.09 [2024-11-14 02:09:09,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.49 | bwd: 3850.48 | bwd_inner: 3842.99 | bwd_allreduce: 7.45 | step: 21.10 7%|▋ | 3542/50750 [9:26:31<77:37:24, 5.92s/it] {'loss': 0.0095, 'learning_rate': 3.9834207566180645e-05, 'epoch': 3.49} 7%|▋ | 3542/50750 [9:26:31<77:37:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:09:15,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:09:15,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.97 | bwd_microstep: 3842.60 | bwd_inner_microstep: 3835.13 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.01 [2024-11-14 02:09:15,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.97 | bwd: 3842.61 | bwd_inner: 3835.13 | bwd_allreduce: 7.44 | step: 21.02 7%|▋ | 3543/50750 [9:26:37<77:36:47, 5.92s/it] {'loss': 0.008, 'learning_rate': 3.9834043520843645e-05, 'epoch': 3.49} 7%|▋ | 3543/50750 [9:26:37<77:36:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2203 [2024-11-14 02:09:21,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:09:21,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.68 | bwd_microstep: 3846.21 | bwd_inner_microstep: 3838.74 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.17 [2024-11-14 02:09:21,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.67 | bwd: 3846.22 | bwd_inner: 3838.74 | bwd_allreduce: 7.44 | step: 21.17 7%|▋ | 3544/50750 [9:26:43<77:36:30, 5.92s/it] {'loss': 0.357, 'learning_rate': 3.983387939472658e-05, 'epoch': 3.49} 7%|▋ | 3544/50750 [9:26:43<77:36:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:09:27,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:09:27,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.48 | bwd_microstep: 3841.54 | bwd_inner_microstep: 3834.06 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.97 [2024-11-14 02:09:27,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.48 | bwd: 3841.56 | bwd_inner: 3834.06 | bwd_allreduce: 7.45 | step: 20.97 7%|▋ | 3545/50750 [9:26:49<77:34:41, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.983371518783009e-05, 'epoch': 3.49} 7%|▋ | 3545/50750 [9:26:49<77:34:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:09:33,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:09:33,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.55 | bwd_microstep: 3848.14 | bwd_inner_microstep: 3840.48 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.01 [2024-11-14 02:09:33,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.52 | bwd: 3848.15 | bwd_inner: 3840.48 | bwd_allreduce: 7.64 | step: 21.01 7%|▋ | 3546/50750 [9:26:55<77:36:44, 5.92s/it] {'loss': 0.0606, 'learning_rate': 3.983355090015487e-05, 'epoch': 3.49} 7%|▋ | 3546/50750 [9:26:55<77:36:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:09:39,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:09:39,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.83 | bwd_microstep: 3844.64 | bwd_inner_microstep: 3837.15 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.80 [2024-11-14 02:09:39,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.83 | bwd: 3844.65 | bwd_inner: 3837.15 | bwd_allreduce: 7.46 | step: 20.81 7%|▋ | 3547/50750 [9:27:01<77:35:23, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.9833386531701574e-05, 'epoch': 3.49} 7%|▋ | 3547/50750 [9:27:01<77:35:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 02:09:45,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.85 | optimizer_step: 4.93 [2024-11-14 02:09:45,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.52 | bwd_microstep: 3849.88 | bwd_inner_microstep: 3842.28 | bwd_allreduce_microstep: 7.56 | step_microstep: 23.73 [2024-11-14 02:09:45,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.52 | bwd: 3849.90 | bwd_inner: 3842.28 | bwd_allreduce: 7.58 | step: 23.75 7%|▋ | 3548/50750 [9:27:07<77:36:40, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.983322208247088e-05, 'epoch': 3.5} 7%|▋ | 3548/50750 [9:27:07<77:36:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:09:51,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 02:09:51,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.38 | bwd_microstep: 3845.88 | bwd_inner_microstep: 3838.36 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.31 [2024-11-14 02:09:51,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.38 | bwd: 3845.90 | bwd_inner: 3838.36 | bwd_allreduce: 7.50 | step: 21.31 7%|▋ | 3549/50750 [9:27:13<77:37:16, 5.92s/it] {'loss': 0.2129, 'learning_rate': 3.983305755246344e-05, 'epoch': 3.5} 7%|▋ | 3549/50750 [9:27:13<77:37:16, 5.92s/it]evaluate! dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2190 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2193 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2192 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2194 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2191 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2195 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 C dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2197 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2202 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2200 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2198 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2196 A dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 E dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2201 D dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2199 B Results saved to qa_abcd_lora.csv Accuracy: 0.9153543307086615 dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:45:05,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.94 [2024-11-14 02:45:05,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2008.30 | bwd_microstep: 3831.85 | bwd_inner_microstep: 3824.28 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.92 [2024-11-14 02:45:05,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2008.29 | bwd: 3831.87 | bwd_inner: 3824.28 | bwd_allreduce: 7.55 | step: 21.92 7%|▋ | 3550/50750 [10:02:27<8368:35:49, 638.28s/it] {'loss': 0.0014, 'learning_rate': 3.9832892941679956e-05, 'epoch': 3.5} 7%|▋ | 3550/50750 [10:02:27<8368:35:49, 638.28s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:45:10,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:45:10,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2013.36 | bwd_microstep: 3830.39 | bwd_inner_microstep: 3822.91 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.24 [2024-11-14 02:45:10,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2013.35 | bwd: 3830.41 | bwd_inner: 3822.91 | bwd_allreduce: 7.45 | step: 21.24 7%|▋ | 3551/50750 [10:02:32<5881:04:40, 448.57s/it] {'loss': 0.0004, 'learning_rate': 3.9832728250121076e-05, 'epoch': 3.5} 7%|▋ | 3551/50750 [10:02:32<5881:04:40, 448.57s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:45:16,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 02:45:16,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2015.68 | bwd_microstep: 3838.64 | bwd_inner_microstep: 3830.88 | bwd_allreduce_microstep: 7.71 | step_microstep: 22.11 [2024-11-14 02:45:16,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2015.68 | bwd: 3838.66 | bwd_inner: 3830.88 | bwd_allreduce: 7.73 | step: 22.12 7%|▋ | 3552/50750 [10:02:38<4139:54:50, 315.77s/it] {'loss': 0.0019, 'learning_rate': 3.983256347778747e-05, 'epoch': 3.5} 7%|▋ | 3552/50750 [10:02:38<4139:54:50, 315.77s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:45:22,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:45:22,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.06 | bwd_microstep: 3843.06 | bwd_inner_microstep: 3835.57 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.32 [2024-11-14 02:45:22,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.04 | bwd: 3843.07 | bwd_inner: 3835.57 | bwd_allreduce: 7.47 | step: 21.33 7%|▋ | 3553/50750 [10:02:44<2921:09:26, 222.81s/it] {'loss': 0.0001, 'learning_rate': 3.983239862467981e-05, 'epoch': 3.5} 7%|▋ | 3553/50750 [10:02:44<2921:09:26, 222.81s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:45:28,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.13 | optimizer_step: 4.93 [2024-11-14 02:45:28,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.54 | bwd_microstep: 3846.59 | bwd_inner_microstep: 3838.75 | bwd_allreduce_microstep: 7.79 | step_microstep: 22.49 [2024-11-14 02:45:28,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.54 | bwd: 3846.60 | bwd_inner: 3838.75 | bwd_allreduce: 7.81 | step: 22.50 7%|▋ | 3554/50750 [10:02:50<2068:04:59, 157.75s/it] {'loss': 0.0001, 'learning_rate': 3.9832233690798775e-05, 'epoch': 3.5} 7%|▋ | 3554/50750 [10:02:50<2068:04:59, 157.75s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:45:34,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:45:34,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2034.03 | bwd_microstep: 3857.58 | bwd_inner_microstep: 3850.02 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.73 [2024-11-14 02:45:34,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2034.02 | bwd: 3857.59 | bwd_inner: 3850.02 | bwd_allreduce: 7.54 | step: 21.73 7%|▋ | 3555/50750 [10:02:56<1471:00:22, 112.21s/it] {'loss': 0.0035, 'learning_rate': 3.983206867614503e-05, 'epoch': 3.5} 7%|▋ | 3555/50750 [10:02:56<1471:00:22, 112.21s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:45:40,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 02:45:40,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2033.39 | bwd_microstep: 3821.35 | bwd_inner_microstep: 3813.81 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.20 [2024-11-14 02:45:40,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2033.39 | bwd: 3821.38 | bwd_inner: 3813.81 | bwd_allreduce: 7.52 | step: 21.20 7%|▋ | 3556/50750 [10:03:02<1052:53:52, 80.32s/it] {'loss': 0.0001, 'learning_rate': 3.983190358071926e-05, 'epoch': 3.5} 7%|▋ | 3556/50750 [10:03:02<1052:53:52, 80.32s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:45:46,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 02:45:46,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2015.14 | bwd_microstep: 3834.71 | bwd_inner_microstep: 3826.75 | bwd_allreduce_microstep: 7.90 | step_microstep: 23.98 [2024-11-14 02:45:46,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2015.14 | bwd: 3834.73 | bwd_inner: 3826.75 | bwd_allreduce: 7.93 | step: 23.98 7%|▋ | 3557/50750 [10:03:08<760:13:10, 57.99s/it] {'loss': 0.0002, 'learning_rate': 3.983173840452212e-05, 'epoch': 3.5} 7%|▋ | 3557/50750 [10:03:08<760:13:10, 57.99s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:45:52,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 02:45:52,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.37 | bwd_microstep: 3824.82 | bwd_inner_microstep: 3817.09 | bwd_allreduce_microstep: 7.68 | step_microstep: 21.96 [2024-11-14 02:45:52,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.36 | bwd: 3824.84 | bwd_inner: 3817.09 | bwd_allreduce: 7.70 | step: 21.96 7%|▋ | 3558/50750 [10:03:14<555:21:39, 42.37s/it] {'loss': 0.0133, 'learning_rate': 3.98315731475543e-05, 'epoch': 3.51} 7%|▋ | 3558/50750 [10:03:14<555:21:39, 42.37s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:45:58,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.92 [2024-11-14 02:45:58,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.61 | bwd_microstep: 3833.92 | bwd_inner_microstep: 3826.21 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.40 [2024-11-14 02:45:58,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.59 | bwd: 3833.93 | bwd_inner: 3826.21 | bwd_allreduce: 7.68 | step: 21.41 7%|▋ | 3559/50750 [10:03:20<411:58:02, 31.43s/it] {'loss': 0.0009, 'learning_rate': 3.983140780981645e-05, 'epoch': 3.51} 7%|▋ | 3559/50750 [10:03:20<411:58:02, 31.43s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:46:04,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.92 [2024-11-14 02:46:04,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.98 | bwd_microstep: 3837.77 | bwd_inner_microstep: 3830.28 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.11 [2024-11-14 02:46:04,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.97 | bwd: 3837.79 | bwd_inner: 3830.28 | bwd_allreduce: 7.47 | step: 21.11 7%|▋ | 3560/50750 [10:03:26<311:36:27, 23.77s/it] {'loss': 0.0005, 'learning_rate': 3.9831242391309266e-05, 'epoch': 3.51} 7%|▋ | 3560/50750 [10:03:26<311:36:27, 23.77s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:46:10,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.18 | optimizer_step: 4.93 [2024-11-14 02:46:10,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2016.35 | bwd_microstep: 3834.71 | bwd_inner_microstep: 3827.19 | bwd_allreduce_microstep: 7.48 | step_microstep: 22.53 [2024-11-14 02:46:10,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2016.35 | bwd: 3834.72 | bwd_inner: 3827.19 | bwd_allreduce: 7.50 | step: 22.54 7%|▋ | 3561/50750 [10:03:32<241:20:03, 18.41s/it] {'loss': 0.0002, 'learning_rate': 3.983107689203341e-05, 'epoch': 3.51} 7%|▋ | 3561/50750 [10:03:32<241:20:03, 18.41s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:46:16,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:46:16,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.59 | bwd_microstep: 3835.29 | bwd_inner_microstep: 3827.81 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.66 [2024-11-14 02:46:16,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.58 | bwd: 3835.30 | bwd_inner: 3827.81 | bwd_allreduce: 7.45 | step: 20.66 7%|▋ | 3562/50750 [10:03:37<192:11:52, 14.66s/it] {'loss': 0.0001, 'learning_rate': 3.983091131198956e-05, 'epoch': 3.51} 7%|▋ | 3562/50750 [10:03:37<192:11:52, 14.66s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:46:21,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:46:21,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.11 | bwd_microstep: 3835.72 | bwd_inner_microstep: 3828.26 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.82 [2024-11-14 02:46:21,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.09 | bwd: 3835.73 | bwd_inner: 3828.26 | bwd_allreduce: 7.44 | step: 20.83 7%|▋ | 3563/50750 [10:03:43<157:46:43, 12.04s/it] {'loss': 0.0001, 'learning_rate': 3.9830745651178385e-05, 'epoch': 3.51} 7%|▋ | 3563/50750 [10:03:43<157:46:43, 12.04s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 02:46:27,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 5.06 [2024-11-14 02:46:27,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.69 | bwd_microstep: 3834.12 | bwd_inner_microstep: 3826.60 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.50 [2024-11-14 02:46:27,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.69 | bwd: 3834.14 | bwd_inner: 3826.60 | bwd_allreduce: 7.49 | step: 21.50 7%|▋ | 3564/50750 [10:03:49<133:38:50, 10.20s/it] {'loss': 0.0392, 'learning_rate': 3.983057990960056e-05, 'epoch': 3.51} 7%|▋ | 3564/50750 [10:03:49<133:38:50, 10.20s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:46:33,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 02:46:33,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.77 | bwd_microstep: 3836.44 | bwd_inner_microstep: 3828.90 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.25 [2024-11-14 02:46:33,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.77 | bwd: 3836.45 | bwd_inner: 3828.90 | bwd_allreduce: 7.51 | step: 21.25 7%|▋ | 3565/50750 [10:03:55<116:46:04, 8.91s/it] {'loss': 0.657, 'learning_rate': 3.9830414087256776e-05, 'epoch': 3.51} 7%|▋ | 3565/50750 [10:03:55<116:46:04, 8.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2195 [2024-11-14 02:46:39,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:46:39,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2016.09 | bwd_microstep: 3828.36 | bwd_inner_microstep: 3820.85 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.04 [2024-11-14 02:46:39,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2016.09 | bwd: 3828.37 | bwd_inner: 3820.85 | bwd_allreduce: 7.48 | step: 21.05 7%|▋ | 3566/50750 [10:04:01<104:54:00, 8.00s/it] {'loss': 0.0, 'learning_rate': 3.983024818414769e-05, 'epoch': 3.51} 7%|▋ | 3566/50750 [10:04:01<104:54:00, 8.00s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:46:45,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 02:46:45,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.16 | bwd_microstep: 3832.20 | bwd_inner_microstep: 3824.70 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.17 [2024-11-14 02:46:45,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.16 | bwd: 3832.22 | bwd_inner: 3824.70 | bwd_allreduce: 7.47 | step: 21.17 7%|▋ | 3567/50750 [10:04:07<96:37:12, 7.37s/it] {'loss': 1.2193, 'learning_rate': 3.983008220027398e-05, 'epoch': 3.51} 7%|▋ | 3567/50750 [10:04:07<96:37:12, 7.37s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:46:51,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:46:51,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.46 | bwd_microstep: 3843.69 | bwd_inner_microstep: 3836.21 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.96 [2024-11-14 02:46:51,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.46 | bwd: 3843.70 | bwd_inner: 3836.21 | bwd_allreduce: 7.46 | step: 20.96 7%|▋ | 3568/50750 [10:04:13<90:52:04, 6.93s/it] {'loss': 0.0101, 'learning_rate': 3.9829916135636335e-05, 'epoch': 3.52} 7%|▋ | 3568/50750 [10:04:13<90:52:04, 6.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:46:57,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:46:57,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.36 | bwd_microstep: 3831.99 | bwd_inner_microstep: 3824.48 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.10 [2024-11-14 02:46:57,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.36 | bwd: 3832.00 | bwd_inner: 3824.48 | bwd_allreduce: 7.48 | step: 21.10 7%|▋ | 3569/50750 [10:04:19<86:47:38, 6.62s/it] {'loss': 0.9726, 'learning_rate': 3.982974999023541e-05, 'epoch': 3.52} 7%|▋ | 3569/50750 [10:04:19<86:47:38, 6.62s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:47:03,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 5.36 [2024-11-14 02:47:03,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.87 | bwd_microstep: 3838.47 | bwd_inner_microstep: 3830.95 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.56 [2024-11-14 02:47:03,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.87 | bwd: 3838.48 | bwd_inner: 3830.95 | bwd_allreduce: 7.49 | step: 21.57 7%|▋ | 3570/50750 [10:04:25<83:58:05, 6.41s/it] {'loss': 0.0001, 'learning_rate': 3.98295837640719e-05, 'epoch': 3.52} 7%|▋ | 3570/50750 [10:04:25<83:58:05, 6.41s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:47:09,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 02:47:09,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.43 | bwd_microstep: 3842.44 | bwd_inner_microstep: 3834.90 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.44 [2024-11-14 02:47:09,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.43 | bwd: 3842.45 | bwd_inner: 3834.90 | bwd_allreduce: 7.51 | step: 21.44 7%|▋ | 3571/50750 [10:04:31<82:00:19, 6.26s/it] {'loss': 0.3518, 'learning_rate': 3.982941745714648e-05, 'epoch': 3.52} 7%|▋ | 3571/50750 [10:04:31<82:00:19, 6.26s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:47:15,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:47:15,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.50 | bwd_microstep: 3825.23 | bwd_inner_microstep: 3817.72 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.12 [2024-11-14 02:47:15,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.49 | bwd: 3825.24 | bwd_inner: 3817.72 | bwd_allreduce: 7.48 | step: 21.12 7%|▋ | 3572/50750 [10:04:37<80:33:56, 6.15s/it] {'loss': 0.0006, 'learning_rate': 3.982925106945981e-05, 'epoch': 3.52} 7%|▋ | 3572/50750 [10:04:37<80:33:56, 6.15s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:47:20,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 5.00 [2024-11-14 02:47:20,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.82 | bwd_microstep: 3835.49 | bwd_inner_microstep: 3827.99 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.13 [2024-11-14 02:47:20,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.82 | bwd: 3835.50 | bwd_inner: 3827.99 | bwd_allreduce: 7.47 | step: 21.13 7%|▋ | 3573/50750 [10:04:42<79:35:57, 6.07s/it] {'loss': 0.001, 'learning_rate': 3.98290846010126e-05, 'epoch': 3.52} 7%|▋ | 3573/50750 [10:04:42<79:35:57, 6.07s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:47:26,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:47:26,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.41 | bwd_microstep: 3828.02 | bwd_inner_microstep: 3820.49 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.23 [2024-11-14 02:47:26,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.41 | bwd: 3828.04 | bwd_inner: 3820.49 | bwd_allreduce: 7.51 | step: 21.24 7%|▋ | 3574/50750 [10:04:48<78:53:19, 6.02s/it] {'loss': 0.0007, 'learning_rate': 3.9828918051805494e-05, 'epoch': 3.52} 7%|▋ | 3574/50750 [10:04:48<78:53:19, 6.02s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:47:32,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:47:32,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.85 | bwd_microstep: 3831.69 | bwd_inner_microstep: 3824.17 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.17 [2024-11-14 02:47:32,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.85 | bwd: 3831.70 | bwd_inner: 3824.17 | bwd_allreduce: 7.49 | step: 21.18 7%|▋ | 3575/50750 [10:04:54<78:24:34, 5.98s/it] {'loss': 0.0012, 'learning_rate': 3.982875142183919e-05, 'epoch': 3.52} 7%|▋ | 3575/50750 [10:04:54<78:24:34, 5.98s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:47:38,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 02:47:38,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.10 | bwd_microstep: 3828.81 | bwd_inner_microstep: 3821.28 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.07 [2024-11-14 02:47:38,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.10 | bwd: 3828.82 | bwd_inner: 3821.28 | bwd_allreduce: 7.50 | step: 21.08 7%|▋ | 3576/50750 [10:05:00<78:03:50, 5.96s/it] {'loss': 0.0003, 'learning_rate': 3.9828584711114364e-05, 'epoch': 3.52} 7%|▋ | 3576/50750 [10:05:00<78:03:50, 5.96s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:47:44,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:47:44,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.50 | bwd_microstep: 3828.99 | bwd_inner_microstep: 3821.47 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.06 [2024-11-14 02:47:44,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.50 | bwd: 3829.01 | bwd_inner: 3821.47 | bwd_allreduce: 7.50 | step: 21.06 7%|▋ | 3577/50750 [10:05:06<77:48:58, 5.94s/it] {'loss': 0.0066, 'learning_rate': 3.982841791963169e-05, 'epoch': 3.52} 7%|▋ | 3577/50750 [10:05:06<77:48:58, 5.94s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:47:50,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 02:47:50,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.43 | bwd_microstep: 3846.42 | bwd_inner_microstep: 3838.80 | bwd_allreduce_microstep: 7.57 | step_microstep: 24.29 [2024-11-14 02:47:50,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.43 | bwd: 3846.44 | bwd_inner: 3838.80 | bwd_allreduce: 7.59 | step: 24.29 7%|▋ | 3578/50750 [10:05:12<77:43:31, 5.93s/it] {'loss': 0.0032, 'learning_rate': 3.982825104739185e-05, 'epoch': 3.53} 7%|▋ | 3578/50750 [10:05:12<77:43:31, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:47:56,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:47:56,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2017.60 | bwd_microstep: 3838.57 | bwd_inner_microstep: 3831.01 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.05 [2024-11-14 02:47:56,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2017.60 | bwd: 3838.59 | bwd_inner: 3831.01 | bwd_allreduce: 7.54 | step: 21.06 7%|▋ | 3579/50750 [10:05:18<77:37:07, 5.92s/it] {'loss': 0.0881, 'learning_rate': 3.982808409439552e-05, 'epoch': 3.53} 7%|▋ | 3579/50750 [10:05:18<77:37:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:48:02,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 02:48:02,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.09 | bwd_microstep: 3846.00 | bwd_inner_microstep: 3838.48 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.17 [2024-11-14 02:48:02,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.09 | bwd: 3846.01 | bwd_inner: 3838.48 | bwd_allreduce: 7.50 | step: 21.17 7%|▋ | 3580/50750 [10:05:24<77:34:04, 5.92s/it] {'loss': 0.0017, 'learning_rate': 3.982791706064338e-05, 'epoch': 3.53} 7%|▋ | 3580/50750 [10:05:24<77:34:04, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:48:08,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:48:08,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.72 | bwd_microstep: 3841.24 | bwd_inner_microstep: 3833.75 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.07 [2024-11-14 02:48:08,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.72 | bwd: 3841.25 | bwd_inner: 3833.75 | bwd_allreduce: 7.46 | step: 21.07 7%|▋ | 3581/50750 [10:05:30<77:30:58, 5.92s/it] {'loss': 0.0401, 'learning_rate': 3.982774994613613e-05, 'epoch': 3.53} 7%|▋ | 3581/50750 [10:05:30<77:30:58, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2195 [2024-11-14 02:48:14,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 02:48:14,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.60 | bwd_microstep: 3831.04 | bwd_inner_microstep: 3823.52 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.06 [2024-11-14 02:48:14,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.60 | bwd: 3831.06 | bwd_inner: 3823.52 | bwd_allreduce: 7.49 | step: 21.06 7%|▋ | 3582/50750 [10:05:36<77:26:25, 5.91s/it] {'loss': 0.0004, 'learning_rate': 3.982758275087442e-05, 'epoch': 3.53} 7%|▋ | 3582/50750 [10:05:36<77:26:25, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:48:19,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:48:19,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.34 | bwd_microstep: 3831.61 | bwd_inner_microstep: 3824.11 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.02 [2024-11-14 02:48:19,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.34 | bwd: 3831.62 | bwd_inner: 3824.11 | bwd_allreduce: 7.48 | step: 21.02 7%|▋ | 3583/50750 [10:05:41<77:23:15, 5.91s/it] {'loss': 0.6633, 'learning_rate': 3.982741547485896e-05, 'epoch': 3.53} 7%|▋ | 3583/50750 [10:05:41<77:23:15, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:48:25,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 02:48:25,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.89 | bwd_microstep: 3835.53 | bwd_inner_microstep: 3828.02 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.10 [2024-11-14 02:48:25,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.89 | bwd: 3835.54 | bwd_inner: 3828.02 | bwd_allreduce: 7.48 | step: 21.10 7%|▋ | 3584/50750 [10:05:47<77:22:06, 5.91s/it] {'loss': 0.0007, 'learning_rate': 3.982724811809041e-05, 'epoch': 3.53} 7%|▋ | 3584/50750 [10:05:47<77:22:06, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 02:48:31,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.94 [2024-11-14 02:48:31,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.34 | bwd_microstep: 3834.62 | bwd_inner_microstep: 3827.11 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.03 [2024-11-14 02:48:31,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.34 | bwd: 3834.64 | bwd_inner: 3827.11 | bwd_allreduce: 7.48 | step: 21.04 7%|▋ | 3585/50750 [10:05:53<77:21:02, 5.90s/it] {'loss': 0.0497, 'learning_rate': 3.9827080680569454e-05, 'epoch': 3.53} 7%|▋ | 3585/50750 [10:05:53<77:21:02, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:48:37,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 02:48:37,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2018.87 | bwd_microstep: 3836.16 | bwd_inner_microstep: 3828.64 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.29 [2024-11-14 02:48:37,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2018.87 | bwd: 3836.17 | bwd_inner: 3828.64 | bwd_allreduce: 7.49 | step: 21.31 7%|▋ | 3586/50750 [10:05:59<77:20:25, 5.90s/it] {'loss': 0.0033, 'learning_rate': 3.982691316229678e-05, 'epoch': 3.53} 7%|▋ | 3586/50750 [10:05:59<77:20:25, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:48:43,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.94 [2024-11-14 02:48:43,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.38 | bwd_microstep: 3837.19 | bwd_inner_microstep: 3829.65 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.33 [2024-11-14 02:48:43,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.38 | bwd: 3837.20 | bwd_inner: 3829.65 | bwd_allreduce: 7.51 | step: 21.34 7%|▋ | 3587/50750 [10:06:05<77:20:25, 5.90s/it] {'loss': 0.0008, 'learning_rate': 3.9826745563273074e-05, 'epoch': 3.53} 7%|▋ | 3587/50750 [10:06:05<77:20:25, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:48:49,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 02:48:49,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.45 | bwd_microstep: 3832.93 | bwd_inner_microstep: 3825.32 | bwd_allreduce_microstep: 7.57 | step_microstep: 21.16 [2024-11-14 02:48:49,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.45 | bwd: 3832.94 | bwd_inner: 3825.32 | bwd_allreduce: 7.59 | step: 21.17 7%|▋ | 3588/50750 [10:06:11<77:19:33, 5.90s/it] {'loss': 0.0222, 'learning_rate': 3.9826577883499e-05, 'epoch': 3.53} 7%|▋ | 3588/50750 [10:06:11<77:19:33, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 02:48:55,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:48:55,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.08 | bwd_microstep: 3831.06 | bwd_inner_microstep: 3823.53 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.00 [2024-11-14 02:48:55,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.07 | bwd: 3831.07 | bwd_inner: 3823.53 | bwd_allreduce: 7.50 | step: 21.01 7%|▋ | 3589/50750 [10:06:17<77:18:59, 5.90s/it] {'loss': 0.003, 'learning_rate': 3.982641012297527e-05, 'epoch': 3.54} 7%|▋ | 3589/50750 [10:06:17<77:18:59, 5.90s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:49:01,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:49:01,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2020.95 | bwd_microstep: 3846.40 | bwd_inner_microstep: 3838.79 | bwd_allreduce_microstep: 7.55 | step_microstep: 21.05 [2024-11-14 02:49:01,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2020.95 | bwd: 3846.41 | bwd_inner: 3838.79 | bwd_allreduce: 7.58 | step: 21.06 7%|▋ | 3590/50750 [10:06:23<77:21:40, 5.91s/it] {'loss': 0.0116, 'learning_rate': 3.982624228170254e-05, 'epoch': 3.54} 7%|▋ | 3590/50750 [10:06:23<77:21:40, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:49:07,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 02:49:07,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2019.74 | bwd_microstep: 3846.82 | bwd_inner_microstep: 3839.27 | bwd_allreduce_microstep: 7.50 | step_microstep: 20.85 [2024-11-14 02:49:07,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2019.74 | bwd: 3846.83 | bwd_inner: 3839.27 | bwd_allreduce: 7.52 | step: 20.85 7%|▋ | 3591/50750 [10:06:29<77:23:21, 5.91s/it] {'loss': 0.003, 'learning_rate': 3.982607435968151e-05, 'epoch': 3.54} 7%|▋ | 3591/50750 [10:06:29<77:23:21, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 02:49:13,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.25 | optimizer_step: 4.93 [2024-11-14 02:49:13,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.22 | bwd_microstep: 3848.45 | bwd_inner_microstep: 3840.96 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.78 [2024-11-14 02:49:13,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.22 | bwd: 3848.47 | bwd_inner: 3840.96 | bwd_allreduce: 7.47 | step: 21.78 7%|▋ | 3592/50750 [10:06:35<77:26:59, 5.91s/it] {'loss': 0.0024, 'learning_rate': 3.982590635691286e-05, 'epoch': 3.54} 7%|▋ | 3592/50750 [10:06:35<77:26:59, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:49:19,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:49:19,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.14 | bwd_microstep: 3847.27 | bwd_inner_microstep: 3839.78 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.99 [2024-11-14 02:49:19,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.14 | bwd: 3847.28 | bwd_inner: 3839.78 | bwd_allreduce: 7.46 | step: 21.00 7%|▋ | 3593/50750 [10:06:41<77:27:51, 5.91s/it] {'loss': 0.002, 'learning_rate': 3.982573827339727e-05, 'epoch': 3.54} 7%|▋ | 3593/50750 [10:06:41<77:27:51, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:49:24,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:49:24,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.06 | bwd_microstep: 3840.07 | bwd_inner_microstep: 3832.60 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.88 [2024-11-14 02:49:24,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.06 | bwd: 3840.08 | bwd_inner: 3832.60 | bwd_allreduce: 7.44 | step: 20.88 7%|▋ | 3594/50750 [10:06:46<77:26:42, 5.91s/it] {'loss': 0.0092, 'learning_rate': 3.982557010913543e-05, 'epoch': 3.54} 7%|▋ | 3594/50750 [10:06:46<77:26:42, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2203 [2024-11-14 02:49:30,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 5.08 [2024-11-14 02:49:30,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.65 | bwd_microstep: 3847.51 | bwd_inner_microstep: 3840.06 | bwd_allreduce_microstep: 7.42 | step_microstep: 21.05 [2024-11-14 02:49:30,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.65 | bwd: 3847.53 | bwd_inner: 3840.06 | bwd_allreduce: 7.43 | step: 21.06 7%|▋ | 3595/50750 [10:06:52<77:28:09, 5.91s/it] {'loss': 0.0014, 'learning_rate': 3.982540186412802e-05, 'epoch': 3.54} 7%|▋ | 3595/50750 [10:06:52<77:28:09, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:49:36,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:49:36,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.92 | bwd_microstep: 3849.17 | bwd_inner_microstep: 3841.67 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.95 [2024-11-14 02:49:36,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.92 | bwd: 3849.18 | bwd_inner: 3841.67 | bwd_allreduce: 7.47 | step: 20.96 7%|▋ | 3596/50750 [10:06:58<77:29:40, 5.92s/it] {'loss': 0.0018, 'learning_rate': 3.982523353837574e-05, 'epoch': 3.54} 7%|▋ | 3596/50750 [10:06:58<77:29:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:49:42,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 02:49:42,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.90 | bwd_microstep: 3842.04 | bwd_inner_microstep: 3834.56 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.07 [2024-11-14 02:49:42,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.90 | bwd: 3842.05 | bwd_inner: 3834.56 | bwd_allreduce: 7.45 | step: 21.07 7%|▋ | 3597/50750 [10:07:04<77:30:27, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.9825065131879246e-05, 'epoch': 3.54} 7%|▋ | 3597/50750 [10:07:04<77:30:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:49:48,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 02:49:48,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.68 | bwd_microstep: 3851.42 | bwd_inner_microstep: 3843.92 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.25 [2024-11-14 02:49:48,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.68 | bwd: 3851.43 | bwd_inner: 3843.92 | bwd_allreduce: 7.47 | step: 21.25 7%|▋ | 3598/50750 [10:07:10<77:32:24, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.982489664463925e-05, 'epoch': 3.54} 7%|▋ | 3598/50750 [10:07:10<77:32:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:49:54,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-14 02:49:54,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.61 | bwd_microstep: 3843.98 | bwd_inner_microstep: 3836.49 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.78 [2024-11-14 02:49:54,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.61 | bwd: 3843.99 | bwd_inner: 3836.49 | bwd_allreduce: 7.46 | step: 20.79 7%|▋ | 3599/50750 [10:07:16<77:32:01, 5.92s/it] {'loss': 0.0062, 'learning_rate': 3.982472807665643e-05, 'epoch': 3.55} 7%|▋ | 3599/50750 [10:07:16<77:32:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:50:00,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.93 [2024-11-14 02:50:00,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.52 | bwd_microstep: 3842.25 | bwd_inner_microstep: 3834.78 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.08 [2024-11-14 02:50:00,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.52 | bwd: 3842.26 | bwd_inner: 3834.78 | bwd_allreduce: 7.44 | step: 21.08 7%|▋ | 3600/50750 [10:07:22<77:31:29, 5.92s/it] {'loss': 0.0012, 'learning_rate': 3.9824559427931466e-05, 'epoch': 3.55} 7%|▋ | 3600/50750 [10:07:22<77:31:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:50:06,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:50:06,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.05 | bwd_microstep: 3846.55 | bwd_inner_microstep: 3839.00 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.46 [2024-11-14 02:50:06,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.05 | bwd: 3846.56 | bwd_inner: 3839.00 | bwd_allreduce: 7.52 | step: 21.46 7%|▋ | 3601/50750 [10:07:28<77:31:27, 5.92s/it] {'loss': 0.0008, 'learning_rate': 3.982439069846505e-05, 'epoch': 3.55} 7%|▋ | 3601/50750 [10:07:28<77:31:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:50:12,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-14 02:50:12,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.39 | bwd_microstep: 3844.61 | bwd_inner_microstep: 3836.10 | bwd_allreduce_microstep: 8.47 | step_microstep: 22.11 [2024-11-14 02:50:12,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.39 | bwd: 3844.63 | bwd_inner: 3836.10 | bwd_allreduce: 8.48 | step: 22.11 7%|▋ | 3602/50750 [10:07:34<77:31:53, 5.92s/it] {'loss': 0.5034, 'learning_rate': 3.982422188825788e-05, 'epoch': 3.55} 7%|▋ | 3602/50750 [10:07:34<77:31:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:50:18,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:50:18,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.03 | bwd_microstep: 3842.13 | bwd_inner_microstep: 3834.65 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.94 [2024-11-14 02:50:18,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.03 | bwd: 3842.15 | bwd_inner: 3834.65 | bwd_allreduce: 7.46 | step: 20.94 7%|▋ | 3603/50750 [10:07:40<77:30:00, 5.92s/it] {'loss': 0.3297, 'learning_rate': 3.982405299731063e-05, 'epoch': 3.55} 7%|▋ | 3603/50750 [10:07:40<77:30:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:50:24,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:50:24,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.37 | bwd_microstep: 3842.68 | bwd_inner_microstep: 3835.17 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.93 [2024-11-14 02:50:24,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.37 | bwd: 3842.69 | bwd_inner: 3835.17 | bwd_allreduce: 7.48 | step: 20.93 7%|▋ | 3604/50750 [10:07:46<77:28:48, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.982388402562399e-05, 'epoch': 3.55} 7%|▋ | 3604/50750 [10:07:46<77:28:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:50:30,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:50:30,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.24 | bwd_microstep: 3844.35 | bwd_inner_microstep: 3836.68 | bwd_allreduce_microstep: 7.62 | step_microstep: 21.67 [2024-11-14 02:50:30,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.24 | bwd: 3844.37 | bwd_inner: 3836.68 | bwd_allreduce: 7.64 | step: 21.67 7%|▋ | 3605/50750 [10:07:52<77:28:33, 5.92s/it] {'loss': 0.0727, 'learning_rate': 3.982371497319865e-05, 'epoch': 3.55} 7%|▋ | 3605/50750 [10:07:52<77:28:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:50:35,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:50:35,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.92 | bwd_microstep: 3842.15 | bwd_inner_microstep: 3834.63 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.27 [2024-11-14 02:50:35,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.92 | bwd: 3842.17 | bwd_inner: 3834.63 | bwd_allreduce: 7.50 | step: 21.28 7%|▋ | 3606/50750 [10:07:57<77:28:36, 5.92s/it] {'loss': 0.0063, 'learning_rate': 3.9823545840035295e-05, 'epoch': 3.55} 7%|▋ | 3606/50750 [10:07:57<77:28:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:50:41,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:50:41,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.93 | bwd_microstep: 3842.92 | bwd_inner_microstep: 3835.45 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.88 [2024-11-14 02:50:41,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.93 | bwd: 3842.93 | bwd_inner: 3835.45 | bwd_allreduce: 7.44 | step: 20.88 7%|▋ | 3607/50750 [10:08:03<77:28:49, 5.92s/it] {'loss': 0.003, 'learning_rate': 3.9823376626134625e-05, 'epoch': 3.55} 7%|▋ | 3607/50750 [10:08:03<77:28:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:50:47,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:50:47,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.67 | bwd_microstep: 3848.40 | bwd_inner_microstep: 3840.91 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.15 [2024-11-14 02:50:47,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.67 | bwd: 3848.41 | bwd_inner: 3840.91 | bwd_allreduce: 7.47 | step: 21.15 7%|▋ | 3608/50750 [10:08:09<77:29:06, 5.92s/it] {'loss': 0.0081, 'learning_rate': 3.982320733149731e-05, 'epoch': 3.55} 7%|▋ | 3608/50750 [10:08:09<77:29:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:50:53,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 02:50:53,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.00 | bwd_microstep: 3845.02 | bwd_inner_microstep: 3837.54 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.23 [2024-11-14 02:50:53,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.00 | bwd: 3845.03 | bwd_inner: 3837.54 | bwd_allreduce: 7.45 | step: 21.23 7%|▋ | 3609/50750 [10:08:15<77:28:41, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.982303795612406e-05, 'epoch': 3.56} 7%|▋ | 3609/50750 [10:08:15<77:28:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:50:59,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-14 02:50:59,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.64 | bwd_microstep: 3847.16 | bwd_inner_microstep: 3839.67 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.84 [2024-11-14 02:50:59,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.64 | bwd: 3847.18 | bwd_inner: 3839.67 | bwd_allreduce: 7.46 | step: 21.84 7%|▋ | 3610/50750 [10:08:21<77:28:48, 5.92s/it] {'loss': 0.1114, 'learning_rate': 3.982286850001555e-05, 'epoch': 3.56} 7%|▋ | 3610/50750 [10:08:21<77:28:48, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:51:05,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:51:05,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.21 | bwd_microstep: 3853.58 | bwd_inner_microstep: 3846.10 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.28 [2024-11-14 02:51:05,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.21 | bwd: 3853.59 | bwd_inner: 3846.10 | bwd_allreduce: 7.45 | step: 21.28 7%|▋ | 3611/50750 [10:08:27<77:30:52, 5.92s/it] {'loss': 0.2983, 'learning_rate': 3.9822698963172476e-05, 'epoch': 3.56} 7%|▋ | 3611/50750 [10:08:27<77:30:52, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:51:11,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 02:51:11,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.98 | bwd_microstep: 3845.26 | bwd_inner_microstep: 3837.79 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.82 [2024-11-14 02:51:11,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.95 | bwd: 3845.27 | bwd_inner: 3837.79 | bwd_allreduce: 7.45 | step: 20.82 7%|▋ | 3612/50750 [10:08:33<77:30:25, 5.92s/it] {'loss': 0.0048, 'learning_rate': 3.9822529345595534e-05, 'epoch': 3.56} 7%|▋ | 3612/50750 [10:08:33<77:30:25, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:51:17,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.08 | optimizer_step: 4.93 [2024-11-14 02:51:17,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.68 | bwd_microstep: 3845.30 | bwd_inner_microstep: 3837.81 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.94 [2024-11-14 02:51:17,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.68 | bwd: 3845.31 | bwd_inner: 3837.81 | bwd_allreduce: 7.46 | step: 20.95 7%|▋ | 3613/50750 [10:08:39<77:30:02, 5.92s/it] {'loss': 0.2664, 'learning_rate': 3.982235964728541e-05, 'epoch': 3.56} 7%|▋ | 3613/50750 [10:08:39<77:30:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:51:23,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.14 | optimizer_step: 4.93 [2024-11-14 02:51:23,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.28 | bwd_microstep: 3842.92 | bwd_inner_microstep: 3835.39 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.43 [2024-11-14 02:51:23,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.28 | bwd: 3842.93 | bwd_inner: 3835.39 | bwd_allreduce: 7.50 | step: 21.44 7%|▋ | 3614/50750 [10:08:45<77:29:03, 5.92s/it] {'loss': 0.0021, 'learning_rate': 3.9822189868242784e-05, 'epoch': 3.56} 7%|▋ | 3614/50750 [10:08:45<77:29:03, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:51:29,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:51:29,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.27 | bwd_microstep: 3846.62 | bwd_inner_microstep: 3839.09 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.12 [2024-11-14 02:51:29,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.27 | bwd: 3846.64 | bwd_inner: 3839.09 | bwd_allreduce: 7.51 | step: 21.12 7%|▋ | 3615/50750 [10:08:51<77:29:36, 5.92s/it] {'loss': 0.004, 'learning_rate': 3.9822020008468365e-05, 'epoch': 3.56} 7%|▋ | 3615/50750 [10:08:51<77:29:36, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:51:35,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.93 [2024-11-14 02:51:35,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.35 | bwd_microstep: 3848.65 | bwd_inner_microstep: 3841.12 | bwd_allreduce_microstep: 7.48 | step_microstep: 22.85 [2024-11-14 02:51:35,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.35 | bwd: 3848.66 | bwd_inner: 3841.12 | bwd_allreduce: 7.50 | step: 22.85 7%|▋ | 3616/50750 [10:08:57<77:30:35, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.9821850067962836e-05, 'epoch': 3.56} 7%|▋ | 3616/50750 [10:08:57<77:30:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:51:41,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 02:51:41,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.01 | bwd_microstep: 3847.09 | bwd_inner_microstep: 3839.61 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.39 [2024-11-14 02:51:41,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.01 | bwd: 3847.11 | bwd_inner: 3839.61 | bwd_allreduce: 7.46 | step: 21.40 7%|▋ | 3617/50750 [10:09:03<77:31:07, 5.92s/it] {'loss': 0.0012, 'learning_rate': 3.982168004672689e-05, 'epoch': 3.56} 7%|▋ | 3617/50750 [10:09:03<77:31:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:51:47,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:51:47,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.78 | bwd_microstep: 3846.96 | bwd_inner_microstep: 3839.44 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.16 [2024-11-14 02:51:47,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.79 | bwd: 3846.97 | bwd_inner: 3839.44 | bwd_allreduce: 7.49 | step: 21.16 7%|▋ | 3618/50750 [10:09:08<77:30:38, 5.92s/it] {'loss': 0.0878, 'learning_rate': 3.982150994476122e-05, 'epoch': 3.56} 7%|▋ | 3618/50750 [10:09:08<77:30:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:51:52,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:51:52,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.85 | bwd_microstep: 3845.73 | bwd_inner_microstep: 3838.21 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.15 [2024-11-14 02:51:52,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.85 | bwd: 3845.74 | bwd_inner: 3838.21 | bwd_allreduce: 7.49 | step: 21.15 7%|▋ | 3619/50750 [10:09:14<77:29:42, 5.92s/it] {'loss': 0.0319, 'learning_rate': 3.982133976206652e-05, 'epoch': 3.57} 7%|▋ | 3619/50750 [10:09:14<77:29:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:51:58,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 02:51:58,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.51 | bwd_microstep: 3846.77 | bwd_inner_microstep: 3839.30 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.95 [2024-11-14 02:51:58,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.51 | bwd: 3846.79 | bwd_inner: 3839.30 | bwd_allreduce: 7.45 | step: 20.96 7%|▋ | 3620/50750 [10:09:20<77:29:39, 5.92s/it] {'loss': 0.0023, 'learning_rate': 3.982116949864348e-05, 'epoch': 3.57} 7%|▋ | 3620/50750 [10:09:20<77:29:39, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:52:04,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:52:04,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.81 | bwd_microstep: 3846.96 | bwd_inner_microstep: 3839.46 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.93 [2024-11-14 02:52:04,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.81 | bwd: 3846.97 | bwd_inner: 3839.47 | bwd_allreduce: 7.47 | step: 20.94 7%|▋ | 3621/50750 [10:09:26<77:29:00, 5.92s/it] {'loss': 0.6925, 'learning_rate': 3.98209991544928e-05, 'epoch': 3.57} 7%|▋ | 3621/50750 [10:09:26<77:29:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:52:10,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:52:10,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.04 | bwd_microstep: 3844.66 | bwd_inner_microstep: 3837.00 | bwd_allreduce_microstep: 7.61 | step_microstep: 21.43 [2024-11-14 02:52:10,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.04 | bwd: 3844.68 | bwd_inner: 3837.00 | bwd_allreduce: 7.63 | step: 21.43 7%|▋ | 3622/50750 [10:09:32<77:28:20, 5.92s/it] {'loss': 0.1916, 'learning_rate': 3.9820828729615166e-05, 'epoch': 3.57} 7%|▋ | 3622/50750 [10:09:32<77:28:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:52:16,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 02:52:16,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.40 | bwd_microstep: 3845.86 | bwd_inner_microstep: 3838.39 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.97 [2024-11-14 02:52:16,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.40 | bwd: 3845.87 | bwd_inner: 3838.39 | bwd_allreduce: 7.44 | step: 20.97 7%|▋ | 3623/50750 [10:09:38<77:28:20, 5.92s/it] {'loss': 0.0019, 'learning_rate': 3.9820658224011275e-05, 'epoch': 3.57} 7%|▋ | 3623/50750 [10:09:38<77:28:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:52:22,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 02:52:22,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.95 | bwd_microstep: 3845.66 | bwd_inner_microstep: 3838.20 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.93 [2024-11-14 02:52:22,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.95 | bwd: 3845.67 | bwd_inner: 3838.20 | bwd_allreduce: 7.44 | step: 20.94 7%|▋ | 3624/50750 [10:09:44<77:28:24, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.982048763768182e-05, 'epoch': 3.57} 7%|▋ | 3624/50750 [10:09:44<77:28:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:52:28,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:52:28,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.88 | bwd_microstep: 3841.84 | bwd_inner_microstep: 3834.38 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.94 [2024-11-14 02:52:28,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.88 | bwd: 3841.85 | bwd_inner: 3834.38 | bwd_allreduce: 7.44 | step: 20.94 7%|▋ | 3625/50750 [10:09:50<77:27:10, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.98203169706275e-05, 'epoch': 3.57} 7%|▋ | 3625/50750 [10:09:50<77:27:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:52:34,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:52:34,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.09 | bwd_microstep: 3844.85 | bwd_inner_microstep: 3837.37 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.85 [2024-11-14 02:52:34,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.09 | bwd: 3844.87 | bwd_inner: 3837.37 | bwd_allreduce: 7.46 | step: 20.86 7%|▋ | 3626/50750 [10:09:56<77:26:34, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.982014622284901e-05, 'epoch': 3.57} 7%|▋ | 3626/50750 [10:09:56<77:26:34, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:52:40,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:52:40,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.62 | bwd_microstep: 3847.63 | bwd_inner_microstep: 3840.15 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.87 [2024-11-14 02:52:40,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.62 | bwd: 3847.65 | bwd_inner: 3840.15 | bwd_allreduce: 7.45 | step: 20.87 7%|▋ | 3627/50750 [10:10:02<77:27:13, 5.92s/it] {'loss': 0.017, 'learning_rate': 3.981997539434703e-05, 'epoch': 3.57} 7%|▋ | 3627/50750 [10:10:02<77:27:13, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:52:46,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:52:46,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.24 | bwd_microstep: 3844.74 | bwd_inner_microstep: 3837.27 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.37 [2024-11-14 02:52:46,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.24 | bwd: 3844.75 | bwd_inner: 3837.27 | bwd_allreduce: 7.45 | step: 21.37 7%|▋ | 3628/50750 [10:10:08<77:26:44, 5.92s/it] {'loss': 0.002, 'learning_rate': 3.981980448512228e-05, 'epoch': 3.57} 7%|▋ | 3628/50750 [10:10:08<77:26:44, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:52:52,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:52:52,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.32 | bwd_microstep: 3840.96 | bwd_inner_microstep: 3833.38 | bwd_allreduce_microstep: 7.53 | step_microstep: 20.97 [2024-11-14 02:52:52,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.32 | bwd: 3840.97 | bwd_inner: 3833.38 | bwd_allreduce: 7.54 | step: 20.97 7%|▋ | 3629/50750 [10:10:14<77:26:28, 5.92s/it] {'loss': 0.0015, 'learning_rate': 3.9819633495175445e-05, 'epoch': 3.58} 7%|▋ | 3629/50750 [10:10:14<77:26:28, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 02:52:58,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:52:58,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.37 | bwd_microstep: 3845.31 | bwd_inner_microstep: 3837.85 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.95 [2024-11-14 02:52:58,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.37 | bwd: 3845.33 | bwd_inner: 3837.85 | bwd_allreduce: 7.44 | step: 20.95 7%|▋ | 3630/50750 [10:10:19<77:26:53, 5.92s/it] {'loss': 0.1274, 'learning_rate': 3.981946242450722e-05, 'epoch': 3.58} 7%|▋ | 3630/50750 [10:10:19<77:26:53, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:53:03,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 5.06 [2024-11-14 02:53:03,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.84 | bwd_microstep: 3842.99 | bwd_inner_microstep: 3835.53 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.94 [2024-11-14 02:53:03,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.84 | bwd: 3843.01 | bwd_inner: 3835.53 | bwd_allreduce: 7.44 | step: 20.94 7%|▋ | 3631/50750 [10:10:25<77:26:12, 5.92s/it] {'loss': 0.0016, 'learning_rate': 3.981929127311831e-05, 'epoch': 3.58} 7%|▋ | 3631/50750 [10:10:25<77:26:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:53:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:53:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.06 | bwd_microstep: 3841.53 | bwd_inner_microstep: 3834.07 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.89 [2024-11-14 02:53:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.05 | bwd: 3841.54 | bwd_inner: 3834.07 | bwd_allreduce: 7.43 | step: 20.89 7%|▋ | 3632/50750 [10:10:31<77:24:45, 5.91s/it] {'loss': 0.0326, 'learning_rate': 3.981912004100939e-05, 'epoch': 3.58} 7%|▋ | 3632/50750 [10:10:31<77:24:45, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:53:15,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:53:15,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.45 | bwd_microstep: 3843.47 | bwd_inner_microstep: 3835.99 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.02 [2024-11-14 02:53:15,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.45 | bwd: 3843.48 | bwd_inner: 3835.99 | bwd_allreduce: 7.45 | step: 21.02 7%|▋ | 3633/50750 [10:10:37<77:24:33, 5.91s/it] {'loss': 0.0003, 'learning_rate': 3.9818948728181184e-05, 'epoch': 3.58} 7%|▋ | 3633/50750 [10:10:37<77:24:33, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:53:21,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:53:21,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.52 | bwd_microstep: 3841.50 | bwd_inner_microstep: 3834.02 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.96 [2024-11-14 02:53:21,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.49 | bwd: 3841.51 | bwd_inner: 3834.02 | bwd_allreduce: 7.45 | step: 20.97 7%|▋ | 3634/50750 [10:10:43<77:24:11, 5.91s/it] {'loss': 0.0651, 'learning_rate': 3.9818777334634375e-05, 'epoch': 3.58} 7%|▋ | 3634/50750 [10:10:43<77:24:11, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:53:27,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:53:27,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.07 | bwd_microstep: 3842.56 | bwd_inner_microstep: 3835.10 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.89 [2024-11-14 02:53:27,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.06 | bwd: 3842.57 | bwd_inner: 3835.10 | bwd_allreduce: 7.44 | step: 20.89 7%|▋ | 3635/50750 [10:10:49<77:24:28, 5.91s/it] {'loss': 0.041, 'learning_rate': 3.981860586036966e-05, 'epoch': 3.58} 7%|▋ | 3635/50750 [10:10:49<77:24:28, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:53:33,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:53:33,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.82 | bwd_microstep: 3843.63 | bwd_inner_microstep: 3836.18 | bwd_allreduce_microstep: 7.40 | step_microstep: 21.04 [2024-11-14 02:53:33,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.82 | bwd: 3843.64 | bwd_inner: 3836.18 | bwd_allreduce: 7.42 | step: 21.05 7%|▋ | 3636/50750 [10:10:55<77:24:05, 5.91s/it] {'loss': 0.0765, 'learning_rate': 3.9818434305387745e-05, 'epoch': 3.58} 7%|▋ | 3636/50750 [10:10:55<77:24:05, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:53:39,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:53:39,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.26 | bwd_microstep: 3842.17 | bwd_inner_microstep: 3834.69 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.87 [2024-11-14 02:53:39,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.26 | bwd: 3842.18 | bwd_inner: 3834.69 | bwd_allreduce: 7.46 | step: 20.87 7%|▋ | 3637/50750 [10:11:01<77:23:28, 5.91s/it] {'loss': 0.3429, 'learning_rate': 3.981826266968933e-05, 'epoch': 3.58} 7%|▋ | 3637/50750 [10:11:01<77:23:28, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:53:45,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:53:45,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.61 | bwd_microstep: 3844.04 | bwd_inner_microstep: 3836.58 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.90 [2024-11-14 02:53:45,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.61 | bwd: 3844.06 | bwd_inner: 3836.58 | bwd_allreduce: 7.44 | step: 20.91 7%|▋ | 3638/50750 [10:11:07<77:23:35, 5.91s/it] {'loss': 0.0012, 'learning_rate': 3.98180909532751e-05, 'epoch': 3.58} 7%|▋ | 3638/50750 [10:11:07<77:23:35, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:53:51,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:53:51,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.57 | bwd_microstep: 3841.36 | bwd_inner_microstep: 3833.84 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.94 [2024-11-14 02:53:51,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.57 | bwd: 3841.37 | bwd_inner: 3833.84 | bwd_allreduce: 7.49 | step: 20.94 7%|▋ | 3639/50750 [10:11:13<77:22:56, 5.91s/it] {'loss': 0.0016, 'learning_rate': 3.981791915614577e-05, 'epoch': 3.59} 7%|▋ | 3639/50750 [10:11:13<77:22:56, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:53:57,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:53:57,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.56 | bwd_microstep: 3846.90 | bwd_inner_microstep: 3839.33 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.37 [2024-11-14 02:53:57,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.56 | bwd: 3846.92 | bwd_inner: 3839.33 | bwd_allreduce: 7.54 | step: 21.37 7%|▋ | 3640/50750 [10:11:19<77:25:56, 5.92s/it] {'loss': 0.0157, 'learning_rate': 3.981774727830203e-05, 'epoch': 3.59} 7%|▋ | 3640/50750 [10:11:19<77:25:56, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:54:03,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:54:03,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.24 | bwd_microstep: 3849.88 | bwd_inner_microstep: 3842.31 | bwd_allreduce_microstep: 7.52 | step_microstep: 21.84 [2024-11-14 02:54:03,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.24 | bwd: 3849.90 | bwd_inner: 3842.32 | bwd_allreduce: 7.54 | step: 21.84 7%|▋ | 3641/50750 [10:11:25<77:27:30, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.9817575319744585e-05, 'epoch': 3.59} 7%|▋ | 3641/50750 [10:11:25<77:27:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:54:09,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 02:54:09,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.08 | bwd_microstep: 3849.10 | bwd_inner_microstep: 3841.61 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.30 [2024-11-14 02:54:09,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.08 | bwd: 3849.12 | bwd_inner: 3841.61 | bwd_allreduce: 7.47 | step: 21.30 7%|▋ | 3642/50750 [10:11:30<77:28:10, 5.92s/it] {'loss': 0.0017, 'learning_rate': 3.981740328047414e-05, 'epoch': 3.59} 7%|▋ | 3642/50750 [10:11:30<77:28:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:54:14,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 02:54:14,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.70 | bwd_microstep: 3842.53 | bwd_inner_microstep: 3834.99 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.17 [2024-11-14 02:54:14,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.70 | bwd: 3842.54 | bwd_inner: 3834.99 | bwd_allreduce: 7.51 | step: 21.18 7%|▋ | 3643/50750 [10:11:36<77:26:51, 5.92s/it] {'loss': 0.0054, 'learning_rate': 3.9817231160491385e-05, 'epoch': 3.59} 7%|▋ | 3643/50750 [10:11:36<77:26:51, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:54:20,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 02:54:20,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.53 | bwd_microstep: 3841.09 | bwd_inner_microstep: 3833.58 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.02 [2024-11-14 02:54:20,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.53 | bwd: 3841.10 | bwd_inner: 3833.58 | bwd_allreduce: 7.48 | step: 21.02 7%|▋ | 3644/50750 [10:11:42<77:25:41, 5.92s/it] {'loss': 0.1428, 'learning_rate': 3.981705895979702e-05, 'epoch': 3.59} 7%|▋ | 3644/50750 [10:11:42<77:25:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:54:26,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:54:26,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.60 | bwd_microstep: 3835.36 | bwd_inner_microstep: 3827.93 | bwd_allreduce_microstep: 7.39 | step_microstep: 20.64 [2024-11-14 02:54:26,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.60 | bwd: 3835.38 | bwd_inner: 3827.93 | bwd_allreduce: 7.41 | step: 20.64 7%|▋ | 3645/50750 [10:11:48<77:22:44, 5.91s/it] {'loss': 0.0002, 'learning_rate': 3.9816886678391754e-05, 'epoch': 3.59} 7%|▋ | 3645/50750 [10:11:48<77:22:44, 5.91s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:54:32,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:54:32,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.14 | bwd_microstep: 3850.62 | bwd_inner_microstep: 3842.98 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.19 [2024-11-14 02:54:32,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.12 | bwd: 3850.63 | bwd_inner: 3842.98 | bwd_allreduce: 7.62 | step: 21.19 7%|▋ | 3646/50750 [10:11:54<77:24:41, 5.92s/it] {'loss': 0.0003, 'learning_rate': 3.981671431627629e-05, 'epoch': 3.59} 7%|▋ | 3646/50750 [10:11:54<77:24:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:54:38,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:54:38,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.38 | bwd_microstep: 3843.08 | bwd_inner_microstep: 3835.56 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.07 [2024-11-14 02:54:38,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.38 | bwd: 3843.10 | bwd_inner: 3835.56 | bwd_allreduce: 7.50 | step: 21.07 7%|▋ | 3647/50750 [10:12:00<77:25:45, 5.92s/it] {'loss': 0.7713, 'learning_rate': 3.981654187345132e-05, 'epoch': 3.59} 7%|▋ | 3647/50750 [10:12:00<77:25:45, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:54:44,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:54:44,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.93 | bwd_microstep: 3843.86 | bwd_inner_microstep: 3836.40 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.67 [2024-11-14 02:54:44,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.93 | bwd: 3843.87 | bwd_inner: 3836.40 | bwd_allreduce: 7.43 | step: 20.67 7%|▋ | 3648/50750 [10:12:06<77:25:17, 5.92s/it] {'loss': 0.2141, 'learning_rate': 3.981636934991756e-05, 'epoch': 3.59} 7%|▋ | 3648/50750 [10:12:06<77:25:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:54:50,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 02:54:50,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.28 | bwd_microstep: 3844.45 | bwd_inner_microstep: 3836.82 | bwd_allreduce_microstep: 7.58 | step_microstep: 21.41 [2024-11-14 02:54:50,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.28 | bwd: 3844.46 | bwd_inner: 3836.82 | bwd_allreduce: 7.60 | step: 21.41 7%|▋ | 3649/50750 [10:12:12<77:25:26, 5.92s/it] {'loss': 0.0018, 'learning_rate': 3.9816196745675707e-05, 'epoch': 3.6} 7%|▋ | 3649/50750 [10:12:12<77:25:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:54:56,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.92 [2024-11-14 02:54:56,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.64 | bwd_microstep: 3841.60 | bwd_inner_microstep: 3834.14 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.88 [2024-11-14 02:54:56,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.64 | bwd: 3841.61 | bwd_inner: 3834.14 | bwd_allreduce: 7.43 | step: 20.88 7%|▋ | 3650/50750 [10:12:18<77:24:37, 5.92s/it] {'loss': 0.1221, 'learning_rate': 3.9816024060726454e-05, 'epoch': 3.6} 7%|▋ | 3650/50750 [10:12:18<77:24:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:55:02,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 02:55:02,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.70 | bwd_microstep: 3848.48 | bwd_inner_microstep: 3841.03 | bwd_allreduce_microstep: 7.41 | step_microstep: 20.73 [2024-11-14 02:55:02,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.70 | bwd: 3848.49 | bwd_inner: 3841.03 | bwd_allreduce: 7.42 | step: 20.73 7%|▋ | 3651/50750 [10:12:24<77:25:38, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.9815851295070526e-05, 'epoch': 3.6} 7%|▋ | 3651/50750 [10:12:24<77:25:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:55:08,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:55:08,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.08 | bwd_microstep: 3849.21 | bwd_inner_microstep: 3841.74 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.75 [2024-11-14 02:55:08,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.08 | bwd: 3849.22 | bwd_inner: 3841.74 | bwd_allreduce: 7.45 | step: 20.75 7%|▋ | 3652/50750 [10:12:30<77:27:00, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.98156784487086e-05, 'epoch': 3.6} 7%|▋ | 3652/50750 [10:12:30<77:27:00, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:55:14,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:55:14,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.30 | bwd_microstep: 3847.34 | bwd_inner_microstep: 3839.85 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.86 [2024-11-14 02:55:14,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.30 | bwd: 3847.35 | bwd_inner: 3839.85 | bwd_allreduce: 7.46 | step: 20.87 7%|▋ | 3653/50750 [10:12:36<77:26:59, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.98155055216414e-05, 'epoch': 3.6} 7%|▋ | 3653/50750 [10:12:36<77:26:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:55:20,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.92 [2024-11-14 02:55:20,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.26 | bwd_microstep: 3837.39 | bwd_inner_microstep: 3829.91 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.74 [2024-11-14 02:55:20,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.26 | bwd: 3837.40 | bwd_inner: 3829.91 | bwd_allreduce: 7.44 | step: 20.74 7%|▋ | 3654/50750 [10:12:41<77:24:40, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.981533251386962e-05, 'epoch': 3.6} 7%|▋ | 3654/50750 [10:12:41<77:24:40, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2196 [2024-11-14 02:55:25,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:55:25,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.85 | bwd_microstep: 3845.83 | bwd_inner_microstep: 3838.36 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.75 [2024-11-14 02:55:25,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.85 | bwd: 3845.84 | bwd_inner: 3838.36 | bwd_allreduce: 7.45 | step: 20.75 7%|▋ | 3655/50750 [10:12:47<77:24:49, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.981515942539397e-05, 'epoch': 3.6} 7%|▋ | 3655/50750 [10:12:47<77:24:49, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:55:31,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:55:31,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.47 | bwd_microstep: 3847.92 | bwd_inner_microstep: 3840.45 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.87 [2024-11-14 02:55:31,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.47 | bwd: 3847.93 | bwd_inner: 3840.45 | bwd_allreduce: 7.44 | step: 20.88 7%|▋ | 3656/50750 [10:12:53<77:25:29, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.981498625621515e-05, 'epoch': 3.6} 7%|▋ | 3656/50750 [10:12:53<77:25:29, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:55:37,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:55:37,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.61 | bwd_microstep: 3844.36 | bwd_inner_microstep: 3836.88 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.86 [2024-11-14 02:55:37,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.60 | bwd: 3844.37 | bwd_inner: 3836.88 | bwd_allreduce: 7.45 | step: 20.86 7%|▋ | 3657/50750 [10:12:59<77:24:46, 5.92s/it] {'loss': 0.005, 'learning_rate': 3.9814813006333874e-05, 'epoch': 3.6} 7%|▋ | 3657/50750 [10:12:59<77:24:46, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:55:43,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:55:43,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.16 | bwd_microstep: 3851.22 | bwd_inner_microstep: 3843.75 | bwd_allreduce_microstep: 7.43 | step_microstep: 20.60 [2024-11-14 02:55:43,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.16 | bwd: 3851.23 | bwd_inner: 3843.75 | bwd_allreduce: 7.45 | step: 20.60 7%|▋ | 3658/50750 [10:13:05<77:25:27, 5.92s/it] {'loss': 0.0015, 'learning_rate': 3.981463967575084e-05, 'epoch': 3.6} 7%|▋ | 3658/50750 [10:13:05<77:25:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:55:49,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.09 | optimizer_step: 4.93 [2024-11-14 02:55:49,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.75 | bwd_microstep: 3851.29 | bwd_inner_microstep: 3843.75 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.08 [2024-11-14 02:55:49,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.75 | bwd: 3851.30 | bwd_inner: 3843.76 | bwd_allreduce: 7.51 | step: 21.08 7%|▋ | 3659/50750 [10:13:11<77:28:38, 5.92s/it] {'loss': 0.035, 'learning_rate': 3.9814466264466756e-05, 'epoch': 3.6} 7%|▋ | 3659/50750 [10:13:11<77:28:38, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:55:55,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:55:55,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.74 | bwd_microstep: 3849.60 | bwd_inner_microstep: 3841.97 | bwd_allreduce_microstep: 7.59 | step_microstep: 20.92 [2024-11-14 02:55:55,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.73 | bwd: 3849.61 | bwd_inner: 3841.97 | bwd_allreduce: 7.61 | step: 20.93 7%|▋ | 3660/50750 [10:13:17<77:29:01, 5.92s/it] {'loss': 0.2142, 'learning_rate': 3.981429277248233e-05, 'epoch': 3.61} 7%|▋ | 3660/50750 [10:13:17<77:29:01, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:56:01,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:56:01,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.86 | bwd_microstep: 3841.52 | bwd_inner_microstep: 3834.06 | bwd_allreduce_microstep: 7.42 | step_microstep: 20.87 [2024-11-14 02:56:01,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.86 | bwd: 3841.53 | bwd_inner: 3834.06 | bwd_allreduce: 7.43 | step: 20.87 7%|▋ | 3661/50750 [10:13:23<77:27:21, 5.92s/it] {'loss': 0.0013, 'learning_rate': 3.981411919979826e-05, 'epoch': 3.61} 7%|▋ | 3661/50750 [10:13:23<77:27:21, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:56:07,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:56:07,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.97 | bwd_microstep: 3849.81 | bwd_inner_microstep: 3842.33 | bwd_allreduce_microstep: 7.44 | step_microstep: 20.86 [2024-11-14 02:56:07,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.97 | bwd: 3849.82 | bwd_inner: 3842.33 | bwd_allreduce: 7.45 | step: 20.86 7%|▋ | 3662/50750 [10:13:29<77:26:55, 5.92s/it] {'loss': 0.0068, 'learning_rate': 3.981394554641527e-05, 'epoch': 3.61} 7%|▋ | 3662/50750 [10:13:29<77:26:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2203 [2024-11-14 02:56:13,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:56:13,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.44 | bwd_microstep: 3850.51 | bwd_inner_microstep: 3842.20 | bwd_allreduce_microstep: 8.25 | step_microstep: 21.39 [2024-11-14 02:56:13,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.44 | bwd: 3850.52 | bwd_inner: 3842.20 | bwd_allreduce: 8.27 | step: 21.39 7%|▋ | 3663/50750 [10:13:35<77:28:32, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.9813771812334055e-05, 'epoch': 3.61} 7%|▋ | 3663/50750 [10:13:35<77:28:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:56:19,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.16 | optimizer_step: 4.93 [2024-11-14 02:56:19,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.44 | bwd_microstep: 3844.76 | bwd_inner_microstep: 3837.23 | bwd_allreduce_microstep: 7.50 | step_microstep: 22.26 [2024-11-14 02:56:19,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.44 | bwd: 3844.78 | bwd_inner: 3837.23 | bwd_allreduce: 7.51 | step: 22.27 7%|▋ | 3664/50750 [10:13:41<77:29:52, 5.93s/it] {'loss': 0.0005, 'learning_rate': 3.9813597997555316e-05, 'epoch': 3.61} 7%|▋ | 3664/50750 [10:13:41<77:29:52, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:56:25,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:56:25,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.36 | bwd_microstep: 3843.79 | bwd_inner_microstep: 3836.27 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.11 [2024-11-14 02:56:25,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.34 | bwd: 3843.81 | bwd_inner: 3836.27 | bwd_allreduce: 7.50 | step: 21.12 7%|▋ | 3665/50750 [10:13:47<77:27:47, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.9813424102079775e-05, 'epoch': 3.61} 7%|▋ | 3665/50750 [10:13:47<77:27:47, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2194 [2024-11-14 02:56:31,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:56:31,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2021.16 | bwd_microstep: 3843.43 | bwd_inner_microstep: 3835.90 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.21 [2024-11-14 02:56:31,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2021.16 | bwd: 3843.44 | bwd_inner: 3835.90 | bwd_allreduce: 7.50 | step: 21.21 7%|▋ | 3666/50750 [10:13:53<77:25:55, 5.92s/it] {'loss': 0.0, 'learning_rate': 3.981325012590814e-05, 'epoch': 3.61} 7%|▋ | 3666/50750 [10:13:53<77:25:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:56:36,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:56:36,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.19 | bwd_microstep: 3848.02 | bwd_inner_microstep: 3840.47 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.07 [2024-11-14 02:56:36,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.19 | bwd: 3848.03 | bwd_inner: 3840.47 | bwd_allreduce: 7.52 | step: 21.08 7%|▋ | 3667/50750 [10:13:58<77:26:08, 5.92s/it] {'loss': 0.0023, 'learning_rate': 3.981307606904111e-05, 'epoch': 3.61} 7%|▋ | 3667/50750 [10:13:58<77:26:08, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:56:42,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:56:42,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.85 | bwd_microstep: 3844.33 | bwd_inner_microstep: 3836.81 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.07 [2024-11-14 02:56:42,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.86 | bwd: 3844.34 | bwd_inner: 3836.81 | bwd_allreduce: 7.49 | step: 21.07 7%|▋ | 3668/50750 [10:14:04<77:25:20, 5.92s/it] {'loss': 0.0054, 'learning_rate': 3.98129019314794e-05, 'epoch': 3.61} 7%|▋ | 3668/50750 [10:14:04<77:25:20, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:56:48,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.12 | optimizer_step: 4.93 [2024-11-14 02:56:48,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.94 | bwd_microstep: 3850.14 | bwd_inner_microstep: 3842.51 | bwd_allreduce_microstep: 7.59 | step_microstep: 21.51 [2024-11-14 02:56:48,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.94 | bwd: 3850.16 | bwd_inner: 3842.51 | bwd_allreduce: 7.61 | step: 21.51 7%|▋ | 3669/50750 [10:14:10<77:27:05, 5.92s/it] {'loss': 0.0011, 'learning_rate': 3.981272771322372e-05, 'epoch': 3.61} 7%|▋ | 3669/50750 [10:14:10<77:27:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:56:54,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 02:56:54,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.07 | bwd_microstep: 3848.89 | bwd_inner_microstep: 3841.36 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.12 [2024-11-14 02:56:54,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.07 | bwd: 3848.91 | bwd_inner: 3841.36 | bwd_allreduce: 7.50 | step: 21.13 7%|▋ | 3670/50750 [10:14:16<77:27:42, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.9812553414274765e-05, 'epoch': 3.62} 7%|▋ | 3670/50750 [10:14:16<77:27:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:57:00,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:57:00,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.06 | bwd_microstep: 3851.64 | bwd_inner_microstep: 3844.13 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.07 [2024-11-14 02:57:00,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.06 | bwd: 3851.65 | bwd_inner: 3844.13 | bwd_allreduce: 7.48 | step: 21.07 7%|▋ | 3671/50750 [10:14:22<77:27:35, 5.92s/it] {'loss': 0.0115, 'learning_rate': 3.981237903463326e-05, 'epoch': 3.62} 7%|▋ | 3671/50750 [10:14:22<77:27:35, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:57:06,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:57:06,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.69 | bwd_microstep: 3844.63 | bwd_inner_microstep: 3836.99 | bwd_allreduce_microstep: 7.60 | step_microstep: 21.86 [2024-11-14 02:57:06,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.69 | bwd: 3844.65 | bwd_inner: 3836.99 | bwd_allreduce: 7.61 | step: 21.87 7%|▋ | 3672/50750 [10:14:28<77:27:55, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.981220457429992e-05, 'epoch': 3.62} 7%|▋ | 3672/50750 [10:14:28<77:27:55, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:57:12,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:57:12,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.71 | bwd_microstep: 3857.12 | bwd_inner_microstep: 3849.44 | bwd_allreduce_microstep: 7.64 | step_microstep: 20.98 [2024-11-14 02:57:12,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.69 | bwd: 3857.13 | bwd_inner: 3849.44 | bwd_allreduce: 7.66 | step: 20.98 7%|▋ | 3673/50750 [10:14:34<77:29:45, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.9812030033275445e-05, 'epoch': 3.62} 7%|▋ | 3673/50750 [10:14:34<77:29:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:57:18,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 02:57:18,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.50 | bwd_microstep: 3851.84 | bwd_inner_microstep: 3844.31 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.29 [2024-11-14 02:57:18,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.50 | bwd: 3851.86 | bwd_inner: 3844.31 | bwd_allreduce: 7.50 | step: 21.30 7%|▋ | 3674/50750 [10:14:40<77:29:34, 5.93s/it] {'loss': 0.0038, 'learning_rate': 3.981185541156055e-05, 'epoch': 3.62} 7%|▋ | 3674/50750 [10:14:40<77:29:34, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:57:24,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.93 [2024-11-14 02:57:24,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.51 | bwd_microstep: 3848.62 | bwd_inner_microstep: 3841.05 | bwd_allreduce_microstep: 7.53 | step_microstep: 21.57 [2024-11-14 02:57:24,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.51 | bwd: 3848.64 | bwd_inner: 3841.05 | bwd_allreduce: 7.55 | step: 21.57 7%|▋ | 3675/50750 [10:14:46<77:29:06, 5.93s/it] {'loss': 0.1892, 'learning_rate': 3.981168070915594e-05, 'epoch': 3.62} 7%|▋ | 3675/50750 [10:14:46<77:29:06, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:57:30,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.92 [2024-11-14 02:57:30,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.75 | bwd_microstep: 3846.41 | bwd_inner_microstep: 3838.69 | bwd_allreduce_microstep: 7.66 | step_microstep: 21.34 [2024-11-14 02:57:30,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.75 | bwd: 3846.42 | bwd_inner: 3838.69 | bwd_allreduce: 7.69 | step: 21.33 7%|▋ | 3676/50750 [10:14:52<77:27:11, 5.92s/it] {'loss': 0.382, 'learning_rate': 3.981150592606234e-05, 'epoch': 3.62} 7%|▋ | 3676/50750 [10:14:52<77:27:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:57:36,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:57:36,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.09 | bwd_microstep: 3845.64 | bwd_inner_microstep: 3838.14 | bwd_allreduce_microstep: 7.46 | step_microstep: 20.96 [2024-11-14 02:57:36,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.09 | bwd: 3845.66 | bwd_inner: 3838.14 | bwd_allreduce: 7.48 | step: 20.96 7%|▋ | 3677/50750 [10:14:58<77:25:10, 5.92s/it] {'loss': 0.0005, 'learning_rate': 3.981133106228044e-05, 'epoch': 3.62} 7%|▋ | 3677/50750 [10:14:58<77:25:10, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:57:42,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 02:57:42,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.46 | bwd_microstep: 3846.68 | bwd_inner_microstep: 3839.15 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.13 [2024-11-14 02:57:42,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.46 | bwd: 3846.69 | bwd_inner: 3839.15 | bwd_allreduce: 7.50 | step: 21.13 7%|▋ | 3678/50750 [10:15:04<77:24:18, 5.92s/it] {'loss': 0.0077, 'learning_rate': 3.981115611781098e-05, 'epoch': 3.62} 7%|▋ | 3678/50750 [10:15:04<77:24:18, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:57:48,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.11 | optimizer_step: 4.92 [2024-11-14 02:57:48,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.99 | bwd_microstep: 3854.34 | bwd_inner_microstep: 3846.60 | bwd_allreduce_microstep: 7.69 | step_microstep: 21.58 [2024-11-14 02:57:48,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.99 | bwd: 3854.35 | bwd_inner: 3846.60 | bwd_allreduce: 7.71 | step: 21.59 7%|▋ | 3679/50750 [10:15:10<77:27:05, 5.92s/it] {'loss': 0.0206, 'learning_rate': 3.981098109265465e-05, 'epoch': 3.62} 7%|▋ | 3679/50750 [10:15:10<77:27:05, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:57:54,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:57:54,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2022.39 | bwd_microstep: 3852.06 | bwd_inner_microstep: 3844.55 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.16 [2024-11-14 02:57:54,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2022.39 | bwd: 3852.08 | bwd_inner: 3844.55 | bwd_allreduce: 7.49 | step: 21.16 7%|▋ | 3680/50750 [10:15:15<77:26:30, 5.92s/it] {'loss': 0.0057, 'learning_rate': 3.9810805986812176e-05, 'epoch': 3.63} 7%|▋ | 3680/50750 [10:15:15<77:26:30, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:57:59,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:57:59,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.79 | bwd_microstep: 3842.99 | bwd_inner_microstep: 3835.49 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.00 [2024-11-14 02:57:59,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.79 | bwd: 3843.00 | bwd_inner: 3835.49 | bwd_allreduce: 7.48 | step: 21.00 7%|▋ | 3681/50750 [10:15:21<77:24:17, 5.92s/it] {'loss': 0.0218, 'learning_rate': 3.981063080028426e-05, 'epoch': 3.63} 7%|▋ | 3681/50750 [10:15:21<77:24:17, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:58:05,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:58:05,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.71 | bwd_microstep: 3851.56 | bwd_inner_microstep: 3844.06 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.02 [2024-11-14 02:58:05,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.71 | bwd: 3851.58 | bwd_inner: 3844.06 | bwd_allreduce: 7.47 | step: 21.02 7%|▋ | 3682/50750 [10:15:27<77:24:32, 5.92s/it] {'loss': 0.0008, 'learning_rate': 3.981045553307162e-05, 'epoch': 3.63} 7%|▋ | 3682/50750 [10:15:27<77:24:32, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:58:11,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.92 [2024-11-14 02:58:11,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.27 | bwd_microstep: 3854.09 | bwd_inner_microstep: 3846.58 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.05 [2024-11-14 02:58:11,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.27 | bwd: 3854.11 | bwd_inner: 3846.58 | bwd_allreduce: 7.49 | step: 21.05 7%|▋ | 3683/50750 [10:15:33<77:25:12, 5.92s/it] {'loss': 0.004, 'learning_rate': 3.9810280185174974e-05, 'epoch': 3.63} 7%|▋ | 3683/50750 [10:15:33<77:25:12, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:58:17,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:58:17,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.15 | bwd_microstep: 3849.53 | bwd_inner_microstep: 3842.02 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.16 [2024-11-14 02:58:17,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.15 | bwd: 3849.55 | bwd_inner: 3842.02 | bwd_allreduce: 7.49 | step: 21.16 7%|▋ | 3684/50750 [10:15:39<77:25:02, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.9810104756595034e-05, 'epoch': 3.63} 7%|▋ | 3684/50750 [10:15:39<77:25:02, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:58:23,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:58:23,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.13 | bwd_microstep: 3848.88 | bwd_inner_microstep: 3841.33 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.13 [2024-11-14 02:58:23,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.13 | bwd: 3848.89 | bwd_inner: 3841.33 | bwd_allreduce: 7.52 | step: 21.13 7%|▋ | 3685/50750 [10:15:45<77:24:24, 5.92s/it] {'loss': 0.0009, 'learning_rate': 3.9809929247332515e-05, 'epoch': 3.63} 7%|▋ | 3685/50750 [10:15:45<77:24:24, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:58:29,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.94 [2024-11-14 02:58:29,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.51 | bwd_microstep: 3847.97 | bwd_inner_microstep: 3840.45 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.11 [2024-11-14 02:58:29,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.51 | bwd: 3847.98 | bwd_inner: 3840.45 | bwd_allreduce: 7.49 | step: 21.12 7%|▋ | 3686/50750 [10:15:51<77:24:15, 5.92s/it] {'loss': 0.0031, 'learning_rate': 3.980975365738813e-05, 'epoch': 3.63} 7%|▋ | 3686/50750 [10:15:51<77:24:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:58:35,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:58:35,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.45 | bwd_microstep: 3847.12 | bwd_inner_microstep: 3839.56 | bwd_allreduce_microstep: 7.51 | step_microstep: 21.37 [2024-11-14 02:58:35,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.45 | bwd: 3847.13 | bwd_inner: 3839.56 | bwd_allreduce: 7.53 | step: 21.37 7%|▋ | 3687/50750 [10:15:57<77:24:22, 5.92s/it] {'loss': 0.374, 'learning_rate': 3.9809577986762595e-05, 'epoch': 3.63} 7%|▋ | 3687/50750 [10:15:57<77:24:22, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:58:41,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.93 [2024-11-14 02:58:41,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.12 | bwd_microstep: 3854.63 | bwd_inner_microstep: 3847.10 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.15 [2024-11-14 02:58:41,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.12 | bwd: 3854.65 | bwd_inner: 3847.10 | bwd_allreduce: 7.50 | step: 21.15 7%|▋ | 3688/50750 [10:16:03<77:25:26, 5.92s/it] {'loss': 0.0031, 'learning_rate': 3.9809402235456624e-05, 'epoch': 3.63} 7%|▋ | 3688/50750 [10:16:03<77:25:26, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:58:47,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 02:58:47,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.36 | bwd_microstep: 3850.06 | bwd_inner_microstep: 3842.55 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.06 [2024-11-14 02:58:47,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.36 | bwd: 3850.07 | bwd_inner: 3842.55 | bwd_allreduce: 7.49 | step: 21.06 7%|▋ | 3689/50750 [10:16:09<77:25:33, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.9809226403470933e-05, 'epoch': 3.63} 7%|▋ | 3689/50750 [10:16:09<77:25:33, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:58:53,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:58:53,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.39 | bwd_microstep: 3851.11 | bwd_inner_microstep: 3843.60 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.09 [2024-11-14 02:58:53,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.39 | bwd: 3851.12 | bwd_inner: 3843.60 | bwd_allreduce: 7.48 | step: 21.09 7%|▋ | 3690/50750 [10:16:15<77:25:15, 5.92s/it] {'loss': 0.0098, 'learning_rate': 3.980905049080624e-05, 'epoch': 3.64} 7%|▋ | 3690/50750 [10:16:15<77:25:15, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:58:59,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 02:58:59,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.69 | bwd_microstep: 3853.74 | bwd_inner_microstep: 3846.23 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.07 [2024-11-14 02:58:59,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.69 | bwd: 3853.75 | bwd_inner: 3846.23 | bwd_allreduce: 7.48 | step: 21.07 7%|▋ | 3691/50750 [10:16:21<77:26:07, 5.92s/it] {'loss': 0.0007, 'learning_rate': 3.9808874497463256e-05, 'epoch': 3.64} 7%|▋ | 3691/50750 [10:16:21<77:26:07, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:59:05,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 02:59:05,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2028.31 | bwd_microstep: 3855.69 | bwd_inner_microstep: 3848.21 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.07 [2024-11-14 02:59:05,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2028.31 | bwd: 3855.70 | bwd_inner: 3848.21 | bwd_allreduce: 7.45 | step: 21.08 7%|▋ | 3692/50750 [10:16:27<77:27:35, 5.93s/it] {'loss': 0.0003, 'learning_rate': 3.980869842344271e-05, 'epoch': 3.64} 7%|▋ | 3692/50750 [10:16:27<77:27:35, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 02:59:10,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 02:59:10,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2027.01 | bwd_microstep: 3851.45 | bwd_inner_microstep: 3843.96 | bwd_allreduce_microstep: 7.45 | step_microstep: 21.06 [2024-11-14 02:59:10,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2027.01 | bwd: 3851.46 | bwd_inner: 3843.96 | bwd_allreduce: 7.46 | step: 21.06 7%|▋ | 3693/50750 [10:16:32<77:27:13, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.9808522268745316e-05, 'epoch': 3.64} 7%|▋ | 3693/50750 [10:16:32<77:27:13, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 02:59:16,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:59:16,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.74 | bwd_microstep: 3850.53 | bwd_inner_microstep: 3843.02 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.09 [2024-11-14 02:59:16,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.74 | bwd: 3850.54 | bwd_inner: 3843.02 | bwd_allreduce: 7.48 | step: 21.09 7%|▋ | 3694/50750 [10:16:38<77:27:41, 5.93s/it] {'loss': 0.0004, 'learning_rate': 3.980834603337178e-05, 'epoch': 3.64} 7%|▋ | 3694/50750 [10:16:38<77:27:41, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:59:22,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.06 | optimizer_step: 4.93 [2024-11-14 02:59:22,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.29 | bwd_microstep: 3848.06 | bwd_inner_microstep: 3840.58 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.21 [2024-11-14 02:59:22,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.29 | bwd: 3848.08 | bwd_inner: 3840.58 | bwd_allreduce: 7.46 | step: 21.22 7%|▋ | 3695/50750 [10:16:44<77:27:00, 5.93s/it] {'loss': 0.0489, 'learning_rate': 3.9808169717322826e-05, 'epoch': 3.64} 7%|▋ | 3695/50750 [10:16:44<77:27:00, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:59:28,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:59:28,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.17 | bwd_microstep: 3856.95 | bwd_inner_microstep: 3849.43 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.11 [2024-11-14 02:59:28,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.17 | bwd: 3856.96 | bwd_inner: 3849.43 | bwd_allreduce: 7.50 | step: 21.12 7%|▋ | 3696/50750 [10:16:50<77:27:21, 5.93s/it] {'loss': 0.6882, 'learning_rate': 3.9807993320599175e-05, 'epoch': 3.64} 7%|▋ | 3696/50750 [10:16:50<77:27:21, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:59:34,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.05 | optimizer_step: 4.92 [2024-11-14 02:59:34,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.60 | bwd_microstep: 3848.37 | bwd_inner_microstep: 3840.85 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.03 [2024-11-14 02:59:34,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.60 | bwd: 3848.39 | bwd_inner: 3840.85 | bwd_allreduce: 7.50 | step: 21.03 7%|▋ | 3697/50750 [10:16:56<77:26:06, 5.92s/it] {'loss': 0.0033, 'learning_rate': 3.980781684320154e-05, 'epoch': 3.64} 7%|▋ | 3697/50750 [10:16:56<77:26:06, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 02:59:40,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 02:59:40,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.54 | bwd_microstep: 3848.70 | bwd_inner_microstep: 3841.18 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.10 [2024-11-14 02:59:40,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.54 | bwd: 3848.72 | bwd_inner: 3841.18 | bwd_allreduce: 7.49 | step: 21.10 7%|▋ | 3698/50750 [10:17:02<77:24:59, 5.92s/it] {'loss': 0.022, 'learning_rate': 3.980764028513065e-05, 'epoch': 3.64} 7%|▋ | 3698/50750 [10:17:02<77:24:59, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2202 [2024-11-14 02:59:46,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:59:46,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2031.98 | bwd_microstep: 3851.15 | bwd_inner_microstep: 3843.60 | bwd_allreduce_microstep: 7.50 | step_microstep: 21.07 [2024-11-14 02:59:46,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2031.98 | bwd: 3851.16 | bwd_inner: 3843.60 | bwd_allreduce: 7.52 | step: 21.07 7%|▋ | 3699/50750 [10:17:08<77:26:23, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.9807463646387214e-05, 'epoch': 3.64} 7%|▋ | 3699/50750 [10:17:08<77:26:23, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 02:59:52,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.01 | optimizer_step: 4.92 [2024-11-14 02:59:52,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.36 | bwd_microstep: 3851.02 | bwd_inner_microstep: 3843.51 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.12 [2024-11-14 02:59:52,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.36 | bwd: 3851.03 | bwd_inner: 3843.51 | bwd_allreduce: 7.48 | step: 21.13 7%|▋ | 3700/50750 [10:17:14<77:26:42, 5.93s/it] {'loss': 0.001, 'learning_rate': 3.980728692697195e-05, 'epoch': 3.65} 7%|▋ | 3700/50750 [10:17:14<77:26:42, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 02:59:58,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 02:59:58,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2029.74 | bwd_microstep: 3845.72 | bwd_inner_microstep: 3838.23 | bwd_allreduce_microstep: 7.45 | step_microstep: 20.95 [2024-11-14 02:59:58,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2029.74 | bwd: 3845.73 | bwd_inner: 3838.23 | bwd_allreduce: 7.46 | step: 20.95 7%|▋ | 3701/50750 [10:17:20<77:25:43, 5.92s/it] {'loss': 0.0004, 'learning_rate': 3.9807110126885584e-05, 'epoch': 3.65} 7%|▋ | 3701/50750 [10:17:20<77:25:43, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 03:00:04,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 03:00:04,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.92 | bwd_microstep: 3843.02 | bwd_inner_microstep: 3835.51 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.01 [2024-11-14 03:00:04,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.92 | bwd: 3843.04 | bwd_inner: 3835.51 | bwd_allreduce: 7.49 | step: 21.01 7%|▋ | 3702/50750 [10:17:26<77:23:11, 5.92s/it] {'loss': 0.0055, 'learning_rate': 3.980693324612884e-05, 'epoch': 3.65} 7%|▋ | 3702/50750 [10:17:26<77:23:11, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 03:00:10,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 03:00:10,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2030.45 | bwd_microstep: 3858.98 | bwd_inner_microstep: 3851.49 | bwd_allreduce_microstep: 7.44 | step_microstep: 21.09 [2024-11-14 03:00:10,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2030.45 | bwd: 3858.99 | bwd_inner: 3851.49 | bwd_allreduce: 7.46 | step: 21.09 7%|▋ | 3703/50750 [10:17:32<77:26:30, 5.93s/it] {'loss': 0.0, 'learning_rate': 3.980675628470243e-05, 'epoch': 3.65} 7%|▋ | 3703/50750 [10:17:32<77:26:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 03:00:16,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 03:00:16,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.74 | bwd_microstep: 3855.52 | bwd_inner_microstep: 3848.05 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.18 [2024-11-14 03:00:16,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.74 | bwd: 3855.54 | bwd_inner: 3848.05 | bwd_allreduce: 7.45 | step: 21.19 7%|▋ | 3704/50750 [10:17:38<77:26:30, 5.93s/it] {'loss': 0.0015, 'learning_rate': 3.980657924260707e-05, 'epoch': 3.65} 7%|▋ | 3704/50750 [10:17:38<77:26:30, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2200 [2024-11-14 03:00:22,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.10 | optimizer_step: 4.98 [2024-11-14 03:00:22,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.24 | bwd_microstep: 3845.74 | bwd_inner_microstep: 3838.27 | bwd_allreduce_microstep: 7.43 | step_microstep: 21.32 [2024-11-14 03:00:22,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.24 | bwd: 3845.75 | bwd_inner: 3838.27 | bwd_allreduce: 7.45 | step: 21.32 7%|▋ | 3705/50750 [10:17:44<77:24:27, 5.92s/it] {'loss': 0.0001, 'learning_rate': 3.98064021198435e-05, 'epoch': 3.65} 7%|▋ | 3705/50750 [10:17:44<77:24:27, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 03:00:28,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 03:00:28,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.23 | bwd_microstep: 3849.22 | bwd_inner_microstep: 3841.69 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.08 [2024-11-14 03:00:28,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.23 | bwd: 3849.23 | bwd_inner: 3841.69 | bwd_allreduce: 7.50 | step: 21.08 7%|▋ | 3706/50750 [10:17:49<77:23:37, 5.92s/it] {'loss': 0.001, 'learning_rate': 3.9806224916412424e-05, 'epoch': 3.65} 7%|▋ | 3706/50750 [10:17:49<77:23:37, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 03:00:33,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.92 [2024-11-14 03:00:33,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.55 | bwd_microstep: 3850.59 | bwd_inner_microstep: 3843.07 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.06 [2024-11-14 03:00:33,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.54 | bwd: 3850.60 | bwd_inner: 3843.07 | bwd_allreduce: 7.49 | step: 21.06 7%|▋ | 3707/50750 [10:17:55<77:23:23, 5.92s/it] {'loss': 0.0016, 'learning_rate': 3.9806047632314574e-05, 'epoch': 3.65} 7%|▋ | 3707/50750 [10:17:55<77:23:23, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 03:00:39,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.07 | optimizer_step: 4.93 [2024-11-14 03:00:39,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2026.97 | bwd_microstep: 3849.11 | bwd_inner_microstep: 3841.59 | bwd_allreduce_microstep: 7.48 | step_microstep: 21.05 [2024-11-14 03:00:39,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2026.96 | bwd: 3849.12 | bwd_inner: 3841.59 | bwd_allreduce: 7.49 | step: 21.05 7%|▋ | 3708/50750 [10:18:01<77:23:41, 5.92s/it] {'loss': 0.2336, 'learning_rate': 3.9805870267550666e-05, 'epoch': 3.65} 7%|▋ | 3708/50750 [10:18:01<77:23:41, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2197 [2024-11-14 03:00:45,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 03:00:45,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.68 | bwd_microstep: 3856.44 | bwd_inner_microstep: 3848.93 | bwd_allreduce_microstep: 7.46 | step_microstep: 21.00 [2024-11-14 03:00:45,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.68 | bwd: 3856.45 | bwd_inner: 3848.93 | bwd_allreduce: 7.48 | step: 21.00 7%|▋ | 3709/50750 [10:18:07<77:24:42, 5.92s/it] {'loss': 0.0011, 'learning_rate': 3.980569282212142e-05, 'epoch': 3.65} 7%|▋ | 3709/50750 [10:18:07<77:24:42, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2201 [2024-11-14 03:00:51,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.03 | optimizer_step: 4.93 [2024-11-14 03:00:51,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2025.73 | bwd_microstep: 3857.68 | bwd_inner_microstep: 3850.17 | bwd_allreduce_microstep: 7.47 | step_microstep: 20.98 [2024-11-14 03:00:51,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2025.73 | bwd: 3857.69 | bwd_inner: 3850.17 | bwd_allreduce: 7.48 | step: 20.99 7%|▋ | 3710/50750 [10:18:13<77:25:45, 5.93s/it] {'loss': 0.0006, 'learning_rate': 3.980551529602757e-05, 'epoch': 3.66} 7%|▋ | 3710/50750 [10:18:13<77:25:45, 5.93s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2198 [2024-11-14 03:00:57,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.04 | optimizer_step: 4.93 [2024-11-14 03:00:57,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.85 | bwd_microstep: 3849.84 | bwd_inner_microstep: 3842.31 | bwd_allreduce_microstep: 7.49 | step_microstep: 21.08 [2024-11-14 03:00:57,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.84 | bwd: 3849.85 | bwd_inner: 3842.31 | bwd_allreduce: 7.50 | step: 21.08 7%|▋ | 3711/50750 [10:18:19<77:24:19, 5.92s/it] {'loss': 0.0002, 'learning_rate': 3.9805337689269826e-05, 'epoch': 3.66} 7%|▋ | 3711/50750 [10:18:19<77:24:19, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 03:01:03,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.02 | optimizer_step: 4.93 [2024-11-14 03:01:03,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2024.31 | bwd_microstep: 3847.54 | bwd_inner_microstep: 3840.03 | bwd_allreduce_microstep: 7.47 | step_microstep: 21.05 [2024-11-14 03:01:03,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2024.31 | bwd: 3847.56 | bwd_inner: 3840.03 | bwd_allreduce: 7.49 | step: 21.05 7%|▋ | 3712/50750 [10:18:25<77:22:50, 5.92s/it] {'loss': 0.0006, 'learning_rate': 3.980516000184891e-05, 'epoch': 3.66} 7%|▋ | 3712/50750 [10:18:25<77:22:50, 5.92s/it]dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2199 [2024-11-14 03:01:09,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.21 | optimizer_step: 5.09 [2024-11-14 03:01:09,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2023.48 | bwd_microstep: 3857.52 | bwd_inner_microstep: 3849.86 | bwd_allreduce_microstep: 7.62 | step_microstep: 22.55 [2024-11-14 03:01:09,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2023.48 | bwd: 3857.54 | bwd_inner: 3849.86 | bwd_allreduce: 7.63 | step: 22.55