Abstract: The correlation between the vision and text is essential for video moment retrieval (VMR), however, existing methods heavily rely on separate pre-training feature extractors for visual and ...
Abstract: Accurate and efficient multi-task perception remains a core challenge in autonomous driving, particularly under real-world constraints such as limited computational resources and dynamic ...