HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Abstract

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost.

We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve.

Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency–accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32–512 frames while requiring roughly 10× fewer FLOPs.

Method

HiMu operates in four stages: it decomposes the query into a hierarchical logic tree, grounds each atomic predicate with a specialized expert, composes signals bottom-up through fuzzy-logic operators, and selects the most informative frames via Peak-And-Spread Selection.

01

Query Decomposition

A single LLM call parses the question into a hierarchical logic tree of atomic predicates connected by AND, OR, SEQ, and RIGHT_AFTER operators.

02

Expert Grounding

Each predicate is routed to a specialized lightweight expert: CLIP for appearance, YOLO-World for objects, OCR for text, ASR for speech, CLAP for audio.

03

Fuzzy Logic Composition

Expert signals are normalized, temporally smoothed with bandwidth-matched filters, and composed bottom-up via fuzzy-logic operators into a satisfaction curve.

04

PASS Selection

Peak-And-Spread Selection identifies the top-K most relevant frames, balancing peak satisfaction with temporal diversity across the video.

Hierarchical Logic Trees

HiMu decomposes complex, compositional video questions into structured logic trees. Each leaf is an atomic predicate grounded by a specialized expert, while internal nodes enforce temporal and logical relationships.

Results

HiMu advances the Pareto front of accuracy vs compute on Video-MME — HiMu advances the efficiency–accuracy Pareto front on Video-MME. It achieves higher accuracy than similarity-based methods while using ~10× fewer FLOPs than agentic approaches.

Method	Model	Short	Med.	Long	Overall	LVB_val	HERBench-Lite
Uniform Sampling	Qwen3-VL-8B	76.34	66.31	55.58	66.36	55.74	41.70
BOLT	Qwen3-VL-8B	69.58	67.87	68.73	68.74	54.55	42.20
T*	Qwen3-VL-8B	73.66	67.39	68.12	69.77	57.49	39.10
AKS	Qwen3-VL-8B	70.05	65.10	68.73	67.98	57.14	40.25
HiMu (Ours)	Qwen3-VL-8B	78.55	71.00	69.90	73.22	64.19	43.22

Comparison on Video-MME, LongVideoBench_val, and HERBench-Lite. All methods use K=16 frames with Qwen3-VL-8B.

Method	Model	Short	Med.	Long	Overall	LVB_val	HERBench-Lite
Uniform	LLaVA-OV-1.5-8B	72.26	62.33	54.85	63.55	54.33	35.75
HiMu (Ours)	LLaVA-OV-1.5-8B	71.87	66.99	63.94	67.65	57.85	35.87

Uniform	InternVL-3.5-8B	75.41	67.39	56.55	66.63	59.23	38.30
HiMu (Ours)	InternVL-3.5-8B	76.92	70.40	66.50	71.35	64.11	38.32

Uniform	Qwen2.5-VL-7B	72.49	61.01	53.82	62.57	54.58	34.05
VideoZoomer (128f)	Qwen2.5-VL-7B	–	–	55.8	65.2	57.7	–
VideoChat-A1 (512f)	Qwen2.5-VL-7B	–	–	–	69.7	55.6	–
HiMu (Ours)	Qwen2.5-VL-7B	73.08	65.10	62.86	67.09	57.51	35.17

Uniform	Gemma-3-12B	73.31	60.29	55.83	62.99	47.59	31.20
HiMu (Ours)	Gemma-3-12B	71.56	65.70	67.48	68.28	53.92	31.47

Uniform	Gemini-2.5-Flash*	78.43	67.11	61.22	68.95	56.33	37.27
HiMu (Ours)	Gemini-2.5-Flash*	78.92	75.00	74.49	76.11	70.13	37.68

Uniform	GPT-4o*	76.96	73.03	71.43	73.81	55.58	37.47
VSLS (32f)	GPT-4o	71.9	61.9	55.2	63.0	63.4	–
VideoChat-A1 (384f)	–	–	–	–	77.2	66.7	–
HiMu (Ours)	GPT-4o*	80.88	77.19	76.53	78.18	65.10	40.68

HiMu generalizes across 6 LVLMs without model-specific tuning. All methods use K=16 frames unless noted. Gray entries denote results reported by respective authors under different frame budgets. *Evaluated on a stratified random 25% subset.

Configuration	Short	Med.	Long	Overall
HiMu	78.55	71.00	69.90	73.22

Flat Fusion	68.65	66.06	68.45	67.73

w/o ASR	76.34	69.07	68.08	71.23
w/o CLAP	77.97	69.43	69.05	72.22
w/o CLIP	76.46	69.55	69.17	71.79
w/o OCR	77.39	69.92	69.05	72.18
w/o OVD	77.39	70.64	69.17	72.46

Expert ablation on Video-MME (K=16, Qwen3-VL-8B). “Flat Fusion” replaces the logic tree with a flat summation of all leaf scores; each “w/o” row removes one expert.

Method	# Frames	Short	Med.	Long	Overall
Uniform	8	73.54	63.42	54.67	64.00
HiMu	8	73.54	68.71	68.81	70.39

Uniform	16	76.34	66.31	55.58	66.36
HiMu	16	78.55	71.00	69.90	73.22

Uniform	32	81.35	69.19	58.67	69.89
HiMu	32	80.77	73.29	70.02	74.77

Uniform	64	82.87	71.12	60.61	71.68
HiMu	64	82.40	73.53	71.12	75.77

Effect of frame budget K on Video-MME (Qwen3-VL-8B). HiMu at K=16 rivals Uniform at K=64.

Hyperparameter Sensitivity

Configuration	Accuracy	Δ
HiMu (default)	73.31	—

Smoothing bandwidths
All smoothing disabled	73.23	−0.08
Visual σ (CLIP/OVD/OCR, default: 0.5) → 0	73.23	−0.08
Visual σ (CLIP/OVD/OCR, default: 0.5) → 2	72.27	−1.04
Speech σ (ASR, default: 1.5) → 0	73.07	−0.24
Speech σ (ASR, default: 1.5) → 4	72.51	−0.80
Temporal decay κ (default: 2.0)
κ = 0.5	72.83	−0.48
κ = 4.0	72.67	−0.64
Sigmoid sharpness γ (default: 3.0)
γ = 1.0	73.07	−0.24
γ = 5.0	72.99	−0.32

Expert Backbone Ablation

Configuration	Accuracy	Δ
HiMu (default)	73.31	—

Expert backbone swap
CLIP-dfn → SigLIP2	72.59	−0.72
YOLO-World v2 → Grounding DINO	73.71	+0.40
docTR → EasyOCR	72.35	−0.96
faster-whisper lv3-turbo → whisper-large	72.67	−0.64
LAION CLAP → CLAP music-speech	72.75	−0.56
Selection strategy
PASS → Vanilla top-K	72.59	−0.72

LLM Tree Parser Comparison

Tree Parser LLM	Accuracy	Δ
Qwen3-VL-8B (default)	73.31	—
LLaVA-OV-1.5-8B	74.17	+0.86
Gemini-2.5-Flash	73.31	0.00
InternVL-3.5-8B	73.22	−0.09

All sensitivity experiments use Qwen3-VL-8B with K=16 on a random 50% subset of Video-MME. Δ is the difference from the baseline.

Component	Hardware	Latency (s)
Preprocessing (cacheable, one-time per video)
Video / audio I/O	CPU	1.3
CLIP-dfn ViT-L/14 (600 frames)	1 GPU	2.1
LAION CLAP (300 windows)	1 GPU	0.9
docTR OCR (~300 frames)	4 GPUs (DP)	3.0
faster-whisper lv3-turbo (10 min)	3 GPUs	2.8
Preprocessing total (parallel)		4.3
Per-query (amortized cost)
Query planning (8B parser)	1 GPU, CPU	6.7
OVD / YOLOv8x-worldv2 (600 frames)	6 GPUs (DP)	2.1
Per-query scoring (text matching)	1 GPU	<0.2
Composition + PASS	CPU	<0.1
Per-query total		9.0

E2E total (preprocess + per-query)		13.3

Per-component selector latency on 8×A100 (80 GB) for a 10-minute video at 1 FPS (600 candidate frames, K=16). Preprocessing is cached; per-query stages run for each new question. DP = data-parallel.

HiMu with GPT-4o at 16 frames achieves 78.18% on Video-MME, surpassing VideoChat-A1 at 384 frames (77.2%) and VSLS at 32 frames (63.0%) — demonstrating that precisely localized, compositionally verified frames are more effective than expanding the context window with hundreds of densely sampled frames.

Interpretability

HiMu provides a fully interpretable trace of its selection decisions. Each frame's relevance can be traced back through the logic tree to individual expert scores, revealing why specific frames were selected and which modality contributed most.

Citation

@article{benami2026himu,
  title     = {HiMu: Hierarchical Multimodal Frame Selection
               for Long Video Question Answering},
  author    = {Ben-Ami, Dan and Serussi, Gabriele and
               Cohen, Kobi and Baskin, Chaim},
  journal   = {arXiv preprint arXiv:2603.18558},
  year      = {2026}
}