Hmm…?
What the current HF image_guided_detection does (and why it feels limiting)
image_guided_detection(pixel_values, query_pixel_values=...) is implemented as a visual prompt / one-shot mode: it takes a single “query” image (support/prototype) and uses it to create a query embedding that replaces the usual text embedding inside the classification head.
Two important implications show up directly in the Transformers docs:
- The API expects one query image per target image (“Pass in one query image per target image.”). (Hugging Face)
- The task guide explicitly states that only a single example image is allowed for image-guided detection (unlike text queries). (Hugging Face)
So: out of the box, Transformers does not expose (a) cached query embeddings, nor (b) multi-query prototypes for one target.
Your instinct is correct: if you are scanning a large dataset with a fixed set of prototypes, repeatedly re-encoding the same query image is wasteful.
1) Can you cache the query image embeddings?
Short practical answer
- Not with the public
image_guided_detection API in OWL-ViT today.
- Yes in practice if you bypass/extend the pipeline: compute the query prototype embedding(s) once, store them, and reuse them for all target images.
Why it’s not a one-line flag
image_guided_detection is not just “encode query image → use embedding”. In OWL-ViT/OWLv2, the implementation typically includes a heuristic that selects/constructs the query embedding from the query image (e.g., selecting a “best” region/box/patch to represent the object). That heuristic is a known pain point in OWLv2 discussions, and people dig into why the query-embedding selection behaves oddly. (Hugging Face Forums)
There is also a more recent feature request around changing how the visual prompt query embedding is chosen in OWLv2 (because results can look random). (GitHub)
So the “right thing to cache” is not always “global image embedding”; it’s “the embedding the detector actually uses as the query”.
The cleanest route: switch to OWLv2 and pass embeddings explicitly (when possible)
OWLv2 is introduced in Scaling Open-Vocabulary Object Detection (Matthias Minderer et al.), which describes OWLv2 + OWL-ST scaling/self-training. (arXiv)
In Transformers, OWLv2’s forward path is documented as accepting precomputed embeddings via kwargs (e.g., text_embeds, image_embeds, class_embeds), which is exactly what you need to “encode once, reuse forever” for prototypes. (GitHub)
Conceptually:
- Precompute K prototype vectors once (from K query images).
- For each target image, run the vision encoder once, then do a cheap similarity/matmul against those K vectors.
This removes the query-side encoder cost from the per-target loop.
If you must stay on OWL-ViT
OWL-ViT’s image_guided_detection signature only takes query_pixel_values, not an embedding. (Hugging Face)
So caching requires copying/patching the logic:
You then cache the output of (1).
This is exactly the feature request raised by users (same as your post) in both the HF forum and a Transformers GitHub issue; neither thread shows an official built-in solution. (Hugging Face Forums)
2) Can you use multiple query images for one target image?
Out of the box (OWL-ViT image_guided_detection)
No: the HF task guide and the method’s own doc wording are aligned with “single example image”. (Hugging Face)
In practice (what you want to implement)
Yes, and there are two common patterns:
Pattern A — Prototype pooling (fastest, simplest)
Encode each query image to a prototype vector, then combine into a single vector:
- mean of normalized embeddings
- trimmed mean / median (more robust to outliers)
- attention-weighted mean (weights from a small scoring model, or heuristics)
Then you run exactly one “query” per target image.
Pros: constant runtime w.r.t. number of query images once cached
Cons: loses multi-modal structure (“red car” vs “blue car” prototypes get averaged)
Pattern B — Keep multiple prototypes, but score them in one pass
Instead of forcing one prototype, keep K prototypes:
- Treat them as K “queries/classes”.
- Compute patch-to-prototype similarity for all K at once (single matmul).
Then aggregate detections:
- take max score across prototypes per box, or
- take union of detections from all prototypes + NMS.
Pros: preserves multi-modality; still fast (matmul cost)
Cons: you now have to aggregate outputs carefully
This is the exact scaling improvement you’re after: the expensive part becomes “encode target once”, not “encode query K times”.
A practical recipe for your use case (recommended)
Step 0 — Make the “prototype” represent the object, not the whole query image
The biggest pitfall in “image as prompt” is background leakage. Practical fixes:
- Crop the query image tightly around the object (best).
- If you have a mask/box, compute a region prototype (pool patch embeddings inside the region).
- If you don’t, you’re relying on the model’s internal heuristic to guess the region—which is exactly where many “random box” complaints come from. (GitHub)
Step 1 — Precompute and cache prototypes
Store:
float16 / bfloat16 vectors (often enough)
- normalized vectors (unit length) if your scoring uses cosine similarity
- multiple prototypes per class if needed
Step 2 — Run target images once, score against cached prototypes
Runtime becomes:
- one vision forward per target image
- plus a small (num_patches Ă— K) matmul
That scales well for “large dataset, small prototype bank”.
Step 3 — Aggregate multi-prototype outputs
If you keep K prototypes:
- Score aggregation:
score(box) = max_k score_k(box) is a decent baseline.
- Then apply NMS on final boxes.
Similar issues / threads (directly on-point)
- Transformers GitHub feature request: caching + multiple reference images (closed, but captures the need clearly). (GitHub)
- Older Transformers issue about “support for image embeddings as input” (same underlying theme: bypass repeated encoder work). (GitHub)
- OWLv2 “visual prompt query embedding selection” feature request (improving how the query embedding is picked). (GitHub)
- OWLv2 discussion of the query-embedding heuristic (
embed_image_query). (Hugging Face Forums)
Good papers / surveys / projects to orient your approach
Core model papers
- OWL-ViT: open-vocabulary detection with a ViT + patch-level heads; supports one-shot image-conditioned detection. (Hugging Face)
- OWLv2 / OWL-ST: scaling recipe + OWLv2 model. (arXiv)
Broader “few-shot detection with modern foundation models”
- Few-Shot Object Detection with Foundation Models (CVPR 2024): explicitly frames extracting support prototypes + using them for query images (not OWL-ViT-specific, but conceptually close to your “prototype bank” approach). (CVF Open Access)
Curated survey-style resource lists
- “Awesome Open-Vocabulary Detection and Segmentation” (good jumping-off point for alternative open-vocab detectors and prompt mechanisms). (GitHub)
My recommendation for your exact scenario
If your goal is: a small set of prototype images per class and high-throughput inference over many target images, then:
-
Do not call image_guided_detection inside the target loop.
It hard-binds you to repeated query encoding and the single-example API. (Hugging Face)
-
Build a cached “prototype bank”.
Prefer object-cropped prototypes (manual boxes, masks, or pre-crop pipeline). This reduces background mismatch and makes averaging/ensembling meaningful.
-
Use OWLv2 if you can, because the library surface area is more aligned with passing embeddings / altering query selection, and the community is actively discussing improvements to the visual prompt path. (arXiv)
-
Keep multiple prototypes per class and aggregate scores, rather than forcing one “mean prototype” immediately. Start with max-aggregation + NMS; it’s simple and usually strong.