The 1st Workshop on Computer Vision with Humans in the Loop
@ CVPR 2024, June 18 (only invited talk)
Overview
The ultimate goal of computer vision is to enable computers to perceive the world like humans. In the past two decades, computer vision has
made phenomenal progress and even surpassed human parity in a few tasks. However, in many more tasks, computer vision is still not as great as
human vision, with many scattered issues that are difficult to unify and many long-tailed applications that require either special architectures
or tedious data effort.
While the research community is still pursuing foundation models, we want to emphasize the importance of having humans in the loop
to solve computer vision. We have observed many research works trying to address vision problems from this perspective.
Early works include Lazy Snapping[SIGGRAPH 2004], which leverages user clicks and scribbles to make interactive segmentation
possible. Nowadays, mouse clicks and scribbles have become an indispensable interactive operation for object segmentation,
matting, and many other image and video editing and annotation problems, such as Implicit PointRend[CVPR 2022],
Focal Click[CVPR 2022], SAM[ICCV 2023], and Click Pose[ICCV 2023].
More recently, thanks to the advancement in language
understanding, a new trend has emerged to leverage language or visual prompts to help extend a vision model to cover longtail concepts,
such as CLIP [ICML 2021], GLIP [CVPR 2022], Grounding DINO [arxiv 2023], SEEM [NeurIPS 2023], and Matting Anything [arxiv 2023].
Moreover, for real-world applications and from a system perspective, computer vision systems need to be self-aware, meaning that
the systems need to know when they do not know. Since the ultimate objective of visual perception is to facilitate downstream
decisions, such self-awareness is very important as it endows the systems the capability to actively query humans for input or
actively prompt for human control (such as in the Tesla Autopilot scenario). Early research works include using active learning
for more efficient data labeling [ICCV 2009], to more recent efforts advocating solutions to open-set recognition [T-PAMI 2013],
and more recent efforts in uncertainty modeling in deep learning, dubbed the name evidential deep learning [NeurIPS 2018].
Considering both its importance and the recent research trend, we propose to organize a workshop entitled “Computer Vision with Humans in the Loop”
to bring researchers, practitioners, and enthusiasts to explore and discuss the evolving role of human feedback in solving computer vision problems.