Caching image prototype embeddings for image-guided object detection using OWL-ViT

The OWL-ViT model currently supports image-guided one-shot object detection by using reference image embeddings as the input to the classification head instead of the text embedding. This is implemented by the image_guided_detection method.

There are 2 problems

  • it doesn’t support passing multiple reference images as input
  • the reference image is passed through the image encoder every time

In practice I’d like to use the model’s image_guided_detection for inference on larger dataset and computing the reference image query embedding for each image I’m doing an inference on is clearly wasteful, as the query embeddings are not dependent on the target image.

  1. Is there a way to cache the query image embeddings?
  2. And is there a way to use multiple query images for one target image?

Motivation

In practice One-shot learning is an extreme case of Few-Shot learning and it’s usually very hard / impossible to represent the whole class with only one reference image.

Therefore a natural extension is to use multiple prototypical images capturing the detected object in various situations, lightning conditions etc.
But as of now, the running time of the OWL-ViT scales linearly with the number of query images, which makes it impractical for real-world usage.

3 Likes

Did you happen to get a solution or alternative for this? I am trying to do something similar.

2 Likes

me too , any solutions?

1 Like

Hmm…?


What the current HF image_guided_detection does (and why it feels limiting)

image_guided_detection(pixel_values, query_pixel_values=...) is implemented as a visual prompt / one-shot mode: it takes a single “query” image (support/prototype) and uses it to create a query embedding that replaces the usual text embedding inside the classification head.

Two important implications show up directly in the Transformers docs:

  • The API expects one query image per target image (“Pass in one query image per target image.”). (Hugging Face)
  • The task guide explicitly states that only a single example image is allowed for image-guided detection (unlike text queries). (Hugging Face)

So: out of the box, Transformers does not expose (a) cached query embeddings, nor (b) multi-query prototypes for one target.

Your instinct is correct: if you are scanning a large dataset with a fixed set of prototypes, repeatedly re-encoding the same query image is wasteful.


1) Can you cache the query image embeddings?

Short practical answer

  • Not with the public image_guided_detection API in OWL-ViT today.
  • Yes in practice if you bypass/extend the pipeline: compute the query prototype embedding(s) once, store them, and reuse them for all target images.

Why it’s not a one-line flag

image_guided_detection is not just “encode query image → use embedding”. In OWL-ViT/OWLv2, the implementation typically includes a heuristic that selects/constructs the query embedding from the query image (e.g., selecting a “best” region/box/patch to represent the object). That heuristic is a known pain point in OWLv2 discussions, and people dig into why the query-embedding selection behaves oddly. (Hugging Face Forums)
There is also a more recent feature request around changing how the visual prompt query embedding is chosen in OWLv2 (because results can look random). (GitHub)

So the “right thing to cache” is not always “global image embedding”; it’s “the embedding the detector actually uses as the query”.

The cleanest route: switch to OWLv2 and pass embeddings explicitly (when possible)

OWLv2 is introduced in Scaling Open-Vocabulary Object Detection (Matthias Minderer et al.), which describes OWLv2 + OWL-ST scaling/self-training. (arXiv)
In Transformers, OWLv2’s forward path is documented as accepting precomputed embeddings via kwargs (e.g., text_embeds, image_embeds, class_embeds), which is exactly what you need to “encode once, reuse forever” for prototypes. (GitHub)

Conceptually:

  • Precompute K prototype vectors once (from K query images).
  • For each target image, run the vision encoder once, then do a cheap similarity/matmul against those K vectors.

This removes the query-side encoder cost from the per-target loop.

If you must stay on OWL-ViT

OWL-ViT’s image_guided_detection signature only takes query_pixel_values, not an embedding. (Hugging Face)
So caching requires copying/patching the logic:

  • Factor image_guided_detection into two stages:

    1. compute_query_prototype(query_pixel_values) -> query_embeds
    2. detect_with_query_embeds(pixel_values, query_embeds) -> outputs

You then cache the output of (1).

This is exactly the feature request raised by users (same as your post) in both the HF forum and a Transformers GitHub issue; neither thread shows an official built-in solution. (Hugging Face Forums)


2) Can you use multiple query images for one target image?

Out of the box (OWL-ViT image_guided_detection)

No: the HF task guide and the method’s own doc wording are aligned with “single example image”. (Hugging Face)

In practice (what you want to implement)

Yes, and there are two common patterns:

Pattern A — Prototype pooling (fastest, simplest)

Encode each query image to a prototype vector, then combine into a single vector:

  • mean of normalized embeddings
  • trimmed mean / median (more robust to outliers)
  • attention-weighted mean (weights from a small scoring model, or heuristics)

Then you run exactly one “query” per target image.

Pros: constant runtime w.r.t. number of query images once cached
Cons: loses multi-modal structure (“red car” vs “blue car” prototypes get averaged)

Pattern B — Keep multiple prototypes, but score them in one pass

Instead of forcing one prototype, keep K prototypes:

  • Treat them as K “queries/classes”.
  • Compute patch-to-prototype similarity for all K at once (single matmul).

Then aggregate detections:

  • take max score across prototypes per box, or
  • take union of detections from all prototypes + NMS.

Pros: preserves multi-modality; still fast (matmul cost)
Cons: you now have to aggregate outputs carefully

This is the exact scaling improvement you’re after: the expensive part becomes “encode target once”, not “encode query K times”.


A practical recipe for your use case (recommended)

Step 0 — Make the “prototype” represent the object, not the whole query image

The biggest pitfall in “image as prompt” is background leakage. Practical fixes:

  • Crop the query image tightly around the object (best).
  • If you have a mask/box, compute a region prototype (pool patch embeddings inside the region).
  • If you don’t, you’re relying on the model’s internal heuristic to guess the region—which is exactly where many “random box” complaints come from. (GitHub)

Step 1 — Precompute and cache prototypes

Store:

  • float16 / bfloat16 vectors (often enough)
  • normalized vectors (unit length) if your scoring uses cosine similarity
  • multiple prototypes per class if needed

Step 2 — Run target images once, score against cached prototypes

Runtime becomes:

  • one vision forward per target image
  • plus a small (num_patches Ă— K) matmul

That scales well for “large dataset, small prototype bank”.

Step 3 — Aggregate multi-prototype outputs

If you keep K prototypes:

  • Score aggregation: score(box) = max_k score_k(box) is a decent baseline.
  • Then apply NMS on final boxes.

Similar issues / threads (directly on-point)

  • Transformers GitHub feature request: caching + multiple reference images (closed, but captures the need clearly). (GitHub)
  • Older Transformers issue about “support for image embeddings as input” (same underlying theme: bypass repeated encoder work). (GitHub)
  • OWLv2 “visual prompt query embedding selection” feature request (improving how the query embedding is picked). (GitHub)
  • OWLv2 discussion of the query-embedding heuristic (embed_image_query). (Hugging Face Forums)

Good papers / surveys / projects to orient your approach

Core model papers

  • OWL-ViT: open-vocabulary detection with a ViT + patch-level heads; supports one-shot image-conditioned detection. (Hugging Face)
  • OWLv2 / OWL-ST: scaling recipe + OWLv2 model. (arXiv)

Broader “few-shot detection with modern foundation models”

  • Few-Shot Object Detection with Foundation Models (CVPR 2024): explicitly frames extracting support prototypes + using them for query images (not OWL-ViT-specific, but conceptually close to your “prototype bank” approach). (CVF Open Access)

Curated survey-style resource lists

  • “Awesome Open-Vocabulary Detection and Segmentation” (good jumping-off point for alternative open-vocab detectors and prompt mechanisms). (GitHub)

My recommendation for your exact scenario

If your goal is: a small set of prototype images per class and high-throughput inference over many target images, then:

  1. Do not call image_guided_detection inside the target loop.
    It hard-binds you to repeated query encoding and the single-example API. (Hugging Face)

  2. Build a cached “prototype bank”.
    Prefer object-cropped prototypes (manual boxes, masks, or pre-crop pipeline). This reduces background mismatch and makes averaging/ensembling meaningful.

  3. Use OWLv2 if you can, because the library surface area is more aligned with passing embeddings / altering query selection, and the community is actively discussing improvements to the visual prompt path. (arXiv)

  4. Keep multiple prototypes per class and aggregate scores, rather than forcing one “mean prototype” immediately. Start with max-aggregation + NMS; it’s simple and usually strong.