The 1st Workshop on Computer Vision with Humans in the Loop
@ CVPR 2024, June 18
The ultimate goal of computer vision is to enable computers to perceive the world like humans. In the past two decades, computer vision has
made phenomenal progress and even surpassed human parity in a few tasks. However, in many more tasks, computer vision is still not as great as
human vision, with many scattered issues that are difficult to unify and many long-tailed applications that require either special architectures
or tedious data effort.
While the research community is still pursuing foundation models, we want to emphasize the importance of having humans in the loop to solve computer vision. We have observed many research works trying to address vision problems from this perspective. Early works include Lazy Snapping[SIGGRAPH 2004], which leverages user clicks and scribbles to make interactive segmentation possible. Nowadays, mouse clicks and scribbles have become an indispensable interactive operation for object segmentation, matting, and many other image and video editing and annotation problems, such as Implicit PointRend[CVPR 2022], Focal Click[CVPR 2022], SAM[ICCV 2023], and Click Pose[ICCV 2023]. More recently, thanks to the advancement in language understanding, a new trend has emerged to leverage language or visual prompts to help extend a vision model to cover longtail concepts, such as CLIP [ICML 2021], GLIP [CVPR 2022], Grounding DINO [arxiv 2023], SEEM [NeurIPS 2023], and Matting Anything [arxiv 2023].
Moreover, for real-world applications and from a system perspective, computer vision systems need to be self-aware, meaning that the systems need to know when they do not know. Since the ultimate objective of visual perception is to facilitate downstream decisions, such self-awareness is very important as it endows the systems the capability to actively query humans for input or actively prompt for human control (such as in the Tesla Autopilot scenario). Early research works include using active learning for more efficient data labeling [ICCV 2009], to more recent efforts advocating solutions to open-set recognition [T-PAMI 2013], and more recent efforts in uncertainty modeling in deep learning, dubbed the name evidential deep learning [NeurIPS 2018].
Considering both its importance and the recent research trend, we propose to organize a workshop entitled “Computer Vision with Humans in the Loop” to bring researchers, practitioners, and enthusiasts to explore and discuss the evolving role of human feedback in solving computer vision problems.