The 1st Workshop on Computer Vision with Humans in the Loop
@ CVPR 2024, June 18 (only invited talk)


The ultimate goal of computer vision is to enable computers to perceive the world like humans. In the past two decades, computer vision has made phenomenal progress and even surpassed human parity in a few tasks. However, in many more tasks, computer vision is still not as great as human vision, with many scattered issues that are difficult to unify and many long-tailed applications that require either special architectures or tedious data effort.

While the research community is still pursuing foundation models, we want to emphasize the importance of having humans in the loop to solve computer vision. We have observed many research works trying to address vision problems from this perspective. Early works include Lazy Snapping[SIGGRAPH 2004], which leverages user clicks and scribbles to make interactive segmentation possible. Nowadays, mouse clicks and scribbles have become an indispensable interactive operation for object segmentation, matting, and many other image and video editing and annotation problems, such as Implicit PointRend[CVPR 2022], Focal Click[CVPR 2022], SAM[ICCV 2023], and Click Pose[ICCV 2023]. More recently, thanks to the advancement in language understanding, a new trend has emerged to leverage language or visual prompts to help extend a vision model to cover longtail concepts, such as CLIP [ICML 2021], GLIP [CVPR 2022], Grounding DINO [arxiv 2023], SEEM [NeurIPS 2023], and Matting Anything [arxiv 2023].

Moreover, for real-world applications and from a system perspective, computer vision systems need to be self-aware, meaning that the systems need to know when they do not know. Since the ultimate objective of visual perception is to facilitate downstream decisions, such self-awareness is very important as it endows the systems the capability to actively query humans for input or actively prompt for human control (such as in the Tesla Autopilot scenario). Early research works include using active learning for more efficient data labeling [ICCV 2009], to more recent efforts advocating solutions to open-set recognition [T-PAMI 2013], and more recent efforts in uncertainty modeling in deep learning, dubbed the name evidential deep learning [NeurIPS 2018].

Considering both its importance and the recent research trend, we propose to organize a workshop entitled “Computer Vision with Humans in the Loop” to bring researchers, practitioners, and enthusiasts to explore and discuss the evolving role of human feedback in solving computer vision problems.

Invited Speakers

Jianfeng Gao
Microsoft Research

Jason Corso
University of Michigan

Chao-Yuan Wu

Olga Russakovsky
Princeton University

Jingyi Yu
ShanghaiTech University

James Hays
Georgia Institute of Technology

Michael Rubinstein (tentative)

Serge Belongie (tentative)
University of Copenhagen

Bastian Leibe (tentative)
RWTH Aachen University

Ranjay Krishna
University of Washington

Danna Gurari
University of Colorado Boulder

Workshop Organizers

Lei Zhang
IDEA Research, China

Gang Hua
Wormpex AI Research, USA

Nicu Sebe
University of Trento, Italy

Kristen Grauman
UT Austin, USA

Yasuyuki Matsushita
Osaka University, Japan

Aniruddha Kembhavi
Allen Institute for AI, USA

Ailing Zeng
IDEA Research, China

Jianwei Yang
Microsoft Research, USA

Xi Yin
Meta, USA

Heung-Yeung Shum
HKUST, Hongkong