We hit a wall regarding the Latency vs. Resolution trade-off and I’d love to hear some battle-tested opinions from the CV/Embedded community.
The constraint: We need HD resolution to detect small targets at range, but running inference on full HD frames kills our control loop frequency (Target is <20ms glass-to-motor response).
We are debating two architectural paths:
Option A: Static Tiling (SAHI-style) Slice the HD frame into overlapping tiles.
Pro: High detection probability for small objects.
Con: Even with NMS-free architectures, the inference time on the DSP effectively triples. Latency spikes cause our Proportional Navigation guidance to oscillate.
Option B: Dynamic ROI ("The Sniper Approach") Run a low-res global search (320x320) at high FPS. Once a target is found, lock a dynamic High-Res Region of Interest (ROI) from the raw camera stream and only run inference on that crop.
Pro: Extremely fast. Keeps the loop tight.
Con: Single Point of Failure. If the tracker (Kalman Filter) loses the crop due to abrupt ego-motion, we are blind until global search re-acquires. In a terminal phase intercept, that’s a miss.
Has anyone here successfully implemented robust Dynamic ROI on edge silicon (Jetson/Hexagon DSP) for erratic targets? Are we over-engineering this, or is full-frame HD inference simply dead on arrival for real-time guidance?
Any pointers to papers or repos are appreciated.
PS: If you live for these kinds of problems (and enjoy solving them in Munich), we are looking for a Founding Engineer to own this entire pipeline. Email in profile.
0 comments