HiMu

Hierarchical Multimodal Frame Selection
for Long Video Question Answering

Dan Ben-Ami1 Gabriele Serussi1 Kobi Cohen2 Chaim Baskin1
1INSIGHT Lab, Ben-Gurion University of the Negev, Israel 2Ben-Gurion University of the Negev, Israel
HiMu bridges the gap between shallow similarity methods and expensive agent-based approaches using a hierarchical neuro-symbolic framework.
HiMu bridges the gap between fast but shallow similarity methods and accurate but expensive agent-based approaches through hierarchical neuro-symbolic composition of multimodal experts.
73.22%
Video-MME Accuracy
Qwen3-VL-8B, 16 frames
~10×
Fewer FLOPs
vs. agentic methods
Training-Free
No Finetuning
Plug-and-play with any LVLM
16 frames
Beats 32–512 Frames
of competing methods

Abstract

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost.

We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve.

Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency–accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32–512 frames while requiring roughly 10× fewer FLOPs.

Method

HiMu operates in four stages: it decomposes the query into a hierarchical logic tree, grounds each atomic predicate with a specialized expert, composes signals bottom-up through fuzzy-logic operators, and selects the most informative frames via Peak-And-Spread Selection.

HiMu pipeline: Query Decomposition → Expert Grounding → Fuzzy Logic Composition → PASS Frame Selection
01

Query Decomposition

A single LLM call parses the question into a hierarchical logic tree of atomic predicates connected by AND, OR, SEQ, and RIGHT_AFTER operators.

02

Expert Grounding

Each predicate is routed to a specialized lightweight expert: CLIP for appearance, YOLO-World for objects, OCR for text, ASR for speech, CLAP for audio.

03

Fuzzy Logic Composition

Expert signals are normalized, temporally smoothed with bandwidth-matched filters, and composed bottom-up via fuzzy-logic operators into a satisfaction curve.

04

PASS Selection

Peak-And-Spread Selection identifies the top-K most relevant frames, balancing peak satisfaction with temporal diversity across the video.

Hierarchical Logic Trees

HiMu decomposes complex, compositional video questions into structured logic trees. Each leaf is an atomic predicate grounded by a specialized expert, while internal nodes enforce temporal and logical relationships.

Examples of hierarchical logic trees generated from Video-MME questions
Logic trees generated from Video-MME questions. Leaves are routed to modality-specific experts (CLIP, OVD, OCR, ASR, CLAP), and composed via AND, OR, SEQ operators.

Results

HiMu advances the Pareto front of accuracy vs compute on Video-MME
HiMu advances the efficiency–accuracy Pareto front on Video-MME. It achieves higher accuracy than similarity-based methods while using ~10× fewer FLOPs than agentic approaches.
Method Model Short Med. Long Overall LVBval HERBench-Lite
Uniform Sampling Qwen3-VL-8B 76.34 66.31 55.58 66.36 55.74 41.70
BOLT Qwen3-VL-8B 69.58 67.87 68.73 68.74 54.55 42.20
T* Qwen3-VL-8B 73.66 67.39 68.12 69.77 57.49 39.10
AKS Qwen3-VL-8B 70.05 65.10 68.73 67.98 57.14 40.25
HiMu (Ours) Qwen3-VL-8B 78.55 71.00 69.90 73.22 64.19 43.22

Comparison on Video-MME, LongVideoBenchval, and HERBench-Lite. All methods use K=16 frames with Qwen3-VL-8B.

HiMu with GPT-4o at 16 frames achieves 78.18% on Video-MME, surpassing VideoChat-A1 at 384 frames (77.2%) and VSLS at 32 frames (63.0%) — demonstrating that precisely localized, compositionally verified frames are more effective than expanding the context window with hundreds of densely sampled frames.

Interpretability

HiMu provides a fully interpretable trace of its selection decisions. Each frame's relevance can be traced back through the logic tree to individual expert scores, revealing why specific frames were selected and which modality contributed most.

Interpretability heatmap showing per-expert scores across video frames with the hierarchical logic tree structure
Interpretability visualization for the question "What does the black dog do in the water?" Each row shows a different expert's per-frame scores through the logic tree. The selected frames (bottom) correspond to peaks in the composed satisfaction curve. Click to zoom.

Citation

@article{benami2026himu,
  title     = {HiMu: Hierarchical Multimodal Frame Selection
               for Long Video Question Answering},
  author    = {Ben-Ami, Dan and Serussi, Gabriele and
               Cohen, Kobi and Baskin, Chaim},
  journal   = {arXiv preprint arXiv:2603.18558},
  year      = {2026}
}