Hierarchical Multimodal Frame Selection
for Long Video Question Answering
Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost.
We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve.
Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency–accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32–512 frames while requiring roughly 10× fewer FLOPs.
HiMu operates in four stages: it decomposes the query into a hierarchical logic tree, grounds each atomic predicate with a specialized expert, composes signals bottom-up through fuzzy-logic operators, and selects the most informative frames via Peak-And-Spread Selection.
A single LLM call parses the question into a hierarchical logic tree of atomic predicates connected by AND, OR, SEQ, and RIGHT_AFTER operators.
Each predicate is routed to a specialized lightweight expert: CLIP for appearance, YOLO-World for objects, OCR for text, ASR for speech, CLAP for audio.
Expert signals are normalized, temporally smoothed with bandwidth-matched filters, and composed bottom-up via fuzzy-logic operators into a satisfaction curve.
Peak-And-Spread Selection identifies the top-K most relevant frames, balancing peak satisfaction with temporal diversity across the video.
HiMu decomposes complex, compositional video questions into structured logic trees. Each leaf is an atomic predicate grounded by a specialized expert, while internal nodes enforce temporal and logical relationships.
| Method | Model | Short | Med. | Long | Overall | LVBval | HERBench-Lite |
|---|---|---|---|---|---|---|---|
| Uniform Sampling | Qwen3-VL-8B | 76.34 | 66.31 | 55.58 | 66.36 | 55.74 | 41.70 |
| BOLT | Qwen3-VL-8B | 69.58 | 67.87 | 68.73 | 68.74 | 54.55 | 42.20 |
| T* | Qwen3-VL-8B | 73.66 | 67.39 | 68.12 | 69.77 | 57.49 | 39.10 |
| AKS | Qwen3-VL-8B | 70.05 | 65.10 | 68.73 | 67.98 | 57.14 | 40.25 |
| HiMu (Ours) | Qwen3-VL-8B | 78.55 | 71.00 | 69.90 | 73.22 | 64.19 | 43.22 |
Comparison on Video-MME, LongVideoBenchval, and HERBench-Lite. All methods use K=16 frames with Qwen3-VL-8B.
HiMu with GPT-4o at 16 frames achieves 78.18% on Video-MME, surpassing VideoChat-A1 at 384 frames (77.2%) and VSLS at 32 frames (63.0%) — demonstrating that precisely localized, compositionally verified frames are more effective than expanding the context window with hundreds of densely sampled frames.
HiMu provides a fully interpretable trace of its selection decisions. Each frame's relevance can be traced back through the logic tree to individual expert scores, revealing why specific frames were selected and which modality contributed most.
@article{benami2026himu,
title = {HiMu: Hierarchical Multimodal Frame Selection
for Long Video Question Answering},
author = {Ben-Ami, Dan and Serussi, Gabriele and
Cohen, Kobi and Baskin, Chaim},
journal = {arXiv preprint arXiv:2603.18558},
year = {2026}
}