cometadata/jina-reranker-v2-multilingual-affiliations-large

This is a Cross Encoder model finetuned from jinaai/jina-reranker-v2-base-multilingual using the sentence-transformers library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.

Model Details

Model Description

Model Sources

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import CrossEncoder

# Download from the 🤗 Hub
model = CrossEncoder("cometadata/jina-reranker-v2-multilingual-affiliations-large")
# Get scores for pairs of texts
pairs = [
    ['Instituto Multidisciplinar para el Estudio del Medio ‘Ramón Margalef’, Universidad de Alicante, Alicante, Spain', 'Departamento de Matemática Aplicada, Universidad de Alicante, San Vicente del Raspeig (Alicante), España'],
    ['Instituto Multidisciplinar para el Estudio del Medio ‘Ramón Margalef’, Universidad de Alicante, Alicante, Spain', 'Research Institute of Physics and Aerospace Science, University of Vigo, Vigo, Spain'],
    ['Departamento de Patologia Básica, Setor de Ciências Biológicas, Universidade Federal do Paraná, 81531-970 Curitiba, PR, Brasil', 'Laboratory of Hematology, Department of Medical Pathology, Federal University of Paraná, Curitiba, Brazil'],
    ['Departamento de Patologia Básica, Setor de Ciências Biológicas, Universidade Federal do Paraná, 81531-970 Curitiba, PR, Brasil', 'Laboratório de Patologia Experimental Pontifícia Universidade Católica do Paraná Curitiba Brazil'],
    ['Institute of Information & Control, Hangzhou Dianzi University, Hangzhou 310018, P.R. China', 'College of Media & Design Hangzhou Dianzi University Hangzhou 310018 China'],
]
scores = model.predict(pairs)
print(scores.shape)
# (5,)

# Or rank different texts based on similarity to a single text
ranks = model.rank(
    'Instituto Multidisciplinar para el Estudio del Medio ‘Ramón Margalef’, Universidad de Alicante, Alicante, Spain',
    [
        'Departamento de Matemática Aplicada, Universidad de Alicante, San Vicente del Raspeig (Alicante), España',
        'Research Institute of Physics and Aerospace Science, University of Vigo, Vigo, Spain',
        'Laboratory of Hematology, Department of Medical Pathology, Federal University of Paraná, Curitiba, Brazil',
        'Laboratório de Patologia Experimental Pontifícia Universidade Católica do Paraná Curitiba Brazil',
        'College of Media & Design Hangzhou Dianzi University Hangzhou 310018 China',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

Evaluation

Metrics

Cross Encoder Reranking

Metric Value
map 0.9880 (-0.0120)
mrr@10 0.9880 (-0.0120)
ndcg@10 0.9933 (-0.0067)

Training Details

Training Dataset

Unnamed Dataset

  • Size: 170,000 training samples
  • Columns: query, document, and label
  • Approximate statistics based on the first 1000 samples:
    query document label
    type string string int
    details
    • min: 22 characters
    • mean: 89.21 characters
    • max: 209 characters
    • min: 25 characters
    • mean: 101.13 characters
    • max: 279 characters
    • 0: ~50.00%
    • 1: ~50.00%
  • Samples:
    query document label
    Max-Planck-Institut für Astronomie, Königgstuhl 17, D-69117 Heidelberg, Germany Max-Planck-Institute for Astronomy, Königstuhl 17, 69117 Heidelberg, Germany e-mail: beuther@mpia.de 1
    Max-Planck-Institut für Astronomie, Königgstuhl 17, D-69117 Heidelberg, Germany Clinical Trials Center Cardiovascular Research Foundation New York City NY USA 0
    Stowers Institute for Medical Research, 64110, Kansas City, Missouri, USA Stowers Institute for Medical Research, 1,000 East 50th Street, Kansas City, MO 64110, USA 1
  • Loss: BinaryCrossEntropyLoss with these parameters:
    {
        "activation_fn": "torch.nn.modules.linear.Identity",
        "pos_weight": null
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 30,000 evaluation samples
  • Columns: query, document, and label
  • Approximate statistics based on the first 1000 samples:
    query document label
    type string string int
    details
    • min: 28 characters
    • mean: 113.4 characters
    • max: 298 characters
    • min: 24 characters
    • mean: 104.23 characters
    • max: 272 characters
    • 0: ~50.00%
    • 1: ~50.00%
  • Samples:
    query document label
    Instituto Multidisciplinar para el Estudio del Medio ‘Ramón Margalef’, Universidad de Alicante, Alicante, Spain Departamento de Matemática Aplicada, Universidad de Alicante, San Vicente del Raspeig (Alicante), España 1
    Instituto Multidisciplinar para el Estudio del Medio ‘Ramón Margalef’, Universidad de Alicante, Alicante, Spain Research Institute of Physics and Aerospace Science, University of Vigo, Vigo, Spain 0
    Departamento de Patologia Básica, Setor de Ciências Biológicas, Universidade Federal do Paraná, 81531-970 Curitiba, PR, Brasil Laboratory of Hematology, Department of Medical Pathology, Federal University of Paraná, Curitiba, Brazil 1
  • Loss: BinaryCrossEntropyLoss with these parameters:
    {
        "activation_fn": "torch.nn.modules.linear.Identity",
        "pos_weight": null
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • learning_rate: 3e-05
  • warmup_ratio: 0.1
  • bf16: True
  • load_best_model_at_end: True
  • push_to_hub: True
  • hub_model_id: cometadata/jina-reranker-v2-multilingual-affiliations-large

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 3e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: True
  • resume_from_checkpoint: None
  • hub_model_id: cometadata/jina-reranker-v2-multilingual-affiliations-large
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss Validation Loss affiliation-val_ndcg@10
-1 -1 - - 0.9200 (-0.0800)
0.0008 1 0.1129 - -
0.0752 100 0.3049 - -
0.1505 200 0.1295 - -
0.1881 250 - 0.6259 0.9715 (-0.0285)
0.2257 300 0.1076 - -
0.3010 400 0.0978 - -
0.3762 500 0.1031 0.2822 0.9871 (-0.0129)
0.4515 600 0.0932 - -
0.5267 700 0.1015 - -
0.5643 750 - 0.2395 0.9890 (-0.0110)
0.6020 800 0.0999 - -
0.6772 900 0.1112 - -
0.7524 1000 0.1196 0.1980 0.9921 (-0.0079)
0.8277 1100 0.1288 - -
0.9029 1200 0.1295 - -
0.9406 1250 - 0.1773 0.9929 (-0.0071)
0.9782 1300 0.1338 - -
1.0534 1400 0.0585 - -
1.1287 1500 0.0295 0.3412 0.9879 (-0.0121)
1.2039 1600 0.0412 - -
1.2792 1700 0.0491 - -
1.3168 1750 - 0.2622 0.9903 (-0.0097)
1.3544 1800 0.0619 - -
1.4296 1900 0.0612 - -
1.5049 2000 0.0676 0.2131 0.9919 (-0.0081)
1.5801 2100 0.073 - -
1.6554 2200 0.0801 - -
1.6930 2250 - 0.1940 0.9927 (-0.0073)
1.7306 2300 0.0963 - -
1.8059 2400 0.1114 - -
1.8811 2500 0.1083 0.1773 0.9933 (-0.0067)
1.9564 2600 0.1203 - -
2.0316 2700 0.0841 - -
2.0692 2750 - 0.2898 0.9907 (-0.0093)
2.1068 2800 0.0248 - -
2.1821 2900 0.032 - -
2.2573 3000 0.0468 0.2455 0.9915 (-0.0085)
2.3326 3100 0.0497 - -
2.4078 3200 0.0585 - -
2.4454 3250 - 0.2142 0.9921 (-0.0079)
2.4831 3300 0.0653 - -
2.5583 3400 0.0701 - -
2.6336 3500 0.0758 0.2034 0.9924 (-0.0076)
2.7088 3600 0.0903 - -
2.7840 3700 0.1037 - -
2.8217 3750 - 0.1984 0.9927 (-0.0073)
2.8593 3800 0.113 - -
2.9345 3900 0.1199 - -
-1 -1 - - 0.9933 (-0.0067)
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.12
  • Sentence Transformers: 5.2.0
  • Transformers: 4.57.3
  • PyTorch: 2.9.1+cu128
  • Accelerate: 1.12.0
  • Datasets: 4.4.2
  • Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
18
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cometadata/jina-reranker-v2-multilingual-affiliations-large

Finetuned
(26)
this model

Paper for cometadata/jina-reranker-v2-multilingual-affiliations-large

Evaluation results