The 1st Workshop on Computer Vision with Humans in the Loop
@ CVPR 2024, June 18

Overview

The ultimate goal of computer vision is to enable computers to perceive the world like humans. In the past two decades, computer vision has made phenomenal progress and even surpassed human parity in a few tasks. However, in many more tasks, computer vision is still not as great as human vision, with many scattered issues that are difficult to unify and many long-tailed applications that require either special architectures or tedious data effort.

While the research community is still pursuing foundation models, we want to emphasize the importance of having humans in the loop to solve computer vision. We have observed many research works trying to address vision problems from this perspective. Early works include Lazy Snapping[SIGGRAPH 2004], which leverages user clicks and scribbles to make interactive segmentation possible. Nowadays, mouse clicks and scribbles have become an indispensable interactive operation for object segmentation, matting, and many other image and video editing and annotation problems, such as Implicit PointRend[CVPR 2022], Focal Click[CVPR 2022], SAM[ICCV 2023], and Click Pose[ICCV 2023]. More recently, thanks to the advancement in language understanding, a new trend has emerged to leverage language or visual prompts to help extend a vision model to cover longtail concepts, such as CLIP [ICML 2021], GLIP [CVPR 2022], Grounding DINO [arxiv 2023], SEEM [NeurIPS 2023], and Matting Anything [arxiv 2023].

Moreover, for real-world applications and from a system perspective, computer vision systems need to be self-aware, meaning that the systems need to know when they do not know. Since the ultimate objective of visual perception is to facilitate downstream decisions, such self-awareness is very important as it endows the systems the capability to actively query humans for input or actively prompt for human control (such as in the Tesla Autopilot scenario). Early research works include using active learning for more efficient data labeling [ICCV 2009], to more recent efforts advocating solutions to open-set recognition [T-PAMI 2013], and more recent efforts in uncertainty modeling in deep learning, dubbed the name evidential deep learning [NeurIPS 2018].

Considering both its importance and the recent research trend, we propose to organize a workshop entitled “Computer Vision with Humans in the Loop” to bring researchers, practitioners, and enthusiasts to explore and discuss the evolving role of human feedback in solving computer vision problems.


Invited Speakers



Jianfeng Gao
Microsoft Research



Jason Corso
University of Michigan



Chao-Yuan Wu
Meta



Olga Russakovsky
Princeton University



Jingyi Yu
ShanghaiTech University



James Hays
Georgia Institute of Technology



Michael Rubinstein (tentative)
Google



Serge Belongie (tentative)
University of Copenhagen



Bastian Leibe (tentative)
RWTH Aachen University



Ranjay Krishna
University of Washington



Danna Gurari
University of Colorado Boulder


Workshop Organizers



Lei Zhang
IDEA Research, China



Gang Hua
Wormpex AI Research, USA



Nicu Sebe
University of Trento, Italy



Kristen Grauman
UT Austin, USA



Yasuyuki Matsushita
Osaka University, Japan



Aniruddha Kembhavi
Allen Institute for AI, USA



Ailing Zeng
IDEA Research, China



Jianwei Yang
Microsoft Research, USA



Xi Yin
Meta, USA



Heung-Yeung Shum
HKUST, Hongkong