Overview

The ultimate goal of computer vision is to enable computers to perceive the world like humans. In the past two decades, computer vision has made phenomenal progress and even surpassed human parity in a few tasks. However, in many more tasks, computer vision is still not as great as human vision, with many scattered issues that are difficult to unify and many long-tailed applications that require either special architectures or tedious data effort.

While the research community is still pursuing foundation models, we want to emphasize the importance of having humans in the loop to solve computer vision. We have observed many research works trying to address vision problems from this perspective. Early works include Lazy Snapping[SIGGRAPH 2004], which leverages user clicks and scribbles to make interactive segmentation possible. Nowadays, mouse clicks and scribbles have become an indispensable interactive operation for object segmentation, matting, and many other image and video editing and annotation problems, such as Implicit PointRend[CVPR 2022], Focal Click[CVPR 2022], SAM[ICCV 2023], and Click Pose[ICCV 2023]. More recently, thanks to the advancement in language understanding, a new trend has emerged to leverage language or visual prompts to help extend a vision model to cover longtail concepts, such as CLIP [ICML 2021], GLIP [CVPR 2022], Grounding DINO [arxiv 2023], SEEM [NeurIPS 2023], and Matting Anything [arxiv 2023].

Moreover, for real-world applications and from a system perspective, computer vision systems need to be self-aware, meaning that the systems need to know when they do not know. Since the ultimate objective of visual perception is to facilitate downstream decisions, such self-awareness is very important as it endows the systems the capability to actively query humans for input or actively prompt for human control (such as in the Tesla Autopilot scenario). Early research works include using active learning for more efficient data labeling [ICCV 2009], to efforts advocating solutions to open-set recognition [T-PAMI 2013], and more recent efforts in uncertainty modeling in deep learning, dubbed the name evidential deep learning [NeurIPS 2018].

Considering both its importance and the recent research trend, we propose to organize a workshop entitled “Computer Vision with Humans in the Loop” to bring researchers, practitioners, and enthusiasts to explore and discuss the evolving role of human feedback in solving computer vision problems.

Tentative Schedule (June 18th, Tuesday)

(All talks are scheduled for 30 minutes including Q&A)

8:30 am - 8:35 am PT	Lei Zhang	Opening Remarks
Session Chair: Dongdong Chen
8:35 am - 9:05 am PT	Talk1: Serge Belongie	Searching for Structure in Unfalsifiable Claims
9:05 am - 9:35 am PT	Talk2: Jingyi Yu	The Ways We Perceive 3D and How They Affects AIGC
9:35 am - 10:05 am PT	Talk3: Jianfeng Gao	From LLMs to Multi-Modal Agents
10:05 am - 10:45 am PT	Coffee Break
10:45 am - 11:15 am PT	Talk4: James Hays	Humans in the loop in Personalization and Geolocalization
11:15 am - 12:00 pm PT	Panel1: Serge Belongie, Jingyi Yu, Jianfeng Gao, James Hays Moderator: Dongdong Chen	AIGC: Hallucination vs. Intelligence
12:00 pm - 1:30 pm PT	Lunch Time
Session Chair: Lei Zhang
1:30 pm - 2:00 pm PT	Talk5: Christoph Feichtenhofer	How to collect large-scale image-language data from the web
2:00 pm - 2:30 pm PT	Talk6: Danna Gurari	Predicting When to Engage Humans to Efficiently, Collect High Quality Image and Video Annotations
2:30 pm - 3:00 pm PT	Talk7: Jason Corso	Hazy Oracles in Human+AI Collaboration
3:00 pm - 3:45 pm PT	Coffee Break
3:45 pm - 4:15 pm PT	Talk8: Ranjay Krishna	~~Humans~~ AI-in-the-loop
4:15 pm - 5:15 pm PT	Panel2: Christoph Feichtenhofer, Danna Gurari, Jason Corso, Ranjay Krishna Moderator: Lei Zhang	Collaborative Vision: The Synergy of Human Insight and AI Sight

Talk Details

Serge Belongie

Talk Title:
Searching for Structure in Unfalsifiable Claims
Abstract:
While advances in automated fact-checking are critical in the fight against the spread of misinformation in social media, we argue that more attention is needed in the domain of unfalsifiable claims. In this talk, we outline some promising directions for identifying the prevailing narratives in shared content (image & text) and explore how the associated learned representations can be used to identify misinformation campaigns and sources of polarization.
Bio:
Serge Belongie is a professor of Computer Science at the University of Copenhagen, where he also serves as the head of the Pioneer Centre for Artificial Intelligence (P1). Previously, he was a professor of Computer Science at Cornell University, an Associate Dean at Cornell Tech, and a member of the Visiting Faculty program at Google. His research interests include Computer Vision, Machine Learning, Augmented Reality, and Human-in-the-Loop Computing. He is also a co-founder of several companies including Digital Persona and Anchovi Labs. He is a recipient of the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review “Innovators Under 35” Award, and the Helmholtz Prize for fundamental contributions in Computer Vision. He is a member of the Royal Danish Academy of Sciences and Letters and serves on the board of the European Laboratory for Learning and Intelligent Systems (ELLIS).

Jingyi Yu

Talk Title:
The Ways We Perceive 3D and How They Affects AIGC
Abstract:
We humans perceive the 3D world via visual cues (e.g., stereo parallax, motion parallax, shading, etc) and scene understanding (e.g., shapes, materials, etc). For decades, 3D vision had largely aligned with these mechanisms and sought to explicitly recover depth maps, normal maps, albedo, etc. In contrast, recent neural rendering (NR) techniques, ranging from NeRF to 3DGS, employ "ancient" representations (volumes, point clouds, tensors, etc) and have achieved unprecedented visual fidelity. In this talk, I discuss how imperfect or even poor 3D geometry induced/used by NR manages to achieve high rendering quality, by drawing analogy to the visual fixation mechanism in human vision. These 3D representations have further led to drastically different 3D generation methods, each with unique benefits and limitations. Finally, I discuss how these insights may lead to future 3D representations suitable for machine vision in the era of embodied AI.
Bio:
Jingyi Yu is an OSA Fellow, IEEE Fellow and an ACM Distinguished Scientist. He received B.S. with honor from Caltech in 2000 in Computer Science and Applied Mathematics and Ph.D. from MIT in EECS in 2005. He is currently the Vice Provost of the ShanghaiTech University and the Executive Dean of the School of Information Science and Technology. Dr. Yu has been working extensively on computational imaging, computer vision, computer graphics, and bioinformatics. He received both the Magnolia Memorial Award, NSF CAREER Award and Air Force Young Investigator Award. He has over 10 PCT patents on AI-driven computational imaging solutions, many of which have been widely deployed in smart cities, digital human, human-computer interactions, etc. He has served as an Associate Editor of IEEE TPAMI, IEEE TIP, and Elsevier CVIU as well as program chairs of several top AI conferences including ICCP 2016, ICPR 2020, WACV 2021, IEEE CVPR 2021, and ICCV 2025. He is also a member of the World Economic Forum’s Global Future Council, serving as a Curator of the Metaverse Transformation Map.

Jianfeng Gao

Talk Title:
From LLMs to Multi-Modal Agents
Abstract:
In this talk, I will start with a review of the success of LLMs, then discuss how we build LLM-powered multi-modal agents. I discuss the challenges of deploying large foundation models for real world applications, such as cost in modeling, hallucination and self-improving, and present ongoing research on addressing these challenges.
Bio:
Jianfeng Gao is Distinguished Scientist & Vice President at Microsoft, IEEE Fellow, ACM Fellow, and AAIA Fellow. He is leading the Deep Learning Group at Microsoft Research. The group’s mission is to advance the state-of-the-art on deep learning and its application to natural language and image understanding, and for making progress on conversational models and methods.

James Hays

Talk Title:
Humans in the loop in Personalization and Geolocalization
Bio:
James Hays is an associate professor in the School of Interactive Computing at Georgia Institute of Technology since 2015. Previously, he was the Manning assistant professor of computer science at Brown University. He is also the director of perception research at Overland AI, a startup focused on off-road autonomy. He was a principal scientist at self-driving vehicle startup Argo AI from 2017 to 2022. He was a postdoc at Massachusetts Institute of Technology, received his Ph.D. from Carnegie Mellon University, and received his B.S. from Georgia Institute of Technology. His research interests span computer vision, robotics, and machine learning. His research often involves finding new data sources to exploit (e.g. geotagged imagery, thermal imagery) or creating new datasets where none existed (e.g. human sketches, HD maps). He is the recipient of the NSF CAREER award, the Sloan Fellowship, and the PAMI Mark Everingham Prize.

Christoph Feichtenhofer

Talk Title:
How to collect large-scale image-language data from the web
Bio:
Christoph Feichtenhofer is a Research Scientist Manager at Meta AI (FAIR). He received the BSc, MSc and PhD degrees (all with distinction) in computer science from TU Graz in 2011, 2013 and 2017, and spent time as a visiting researcher at York University, Toronto as well as the University of Oxford. He is a recipient of the PAMI Young Researcher Award, the DOC Fellowship of the Austrian Academy of Sciences, and the Award of Excellence for outstanding doctoral theses in Austria. His main areas of research include the development of effective representations for image and video understanding.

Danna Gurari

Talk Title:
Predicting When to Engage Humans to Efficiently, Collect High Quality Image and Video Annotations
Bio:
Danna Gurari is an Assistant Professor as well as Director of the Image and Video Computing group in the Computer Science Department at University of Colorado Boulder. Her research interests span computer vision, machine learning, human computation, crowdsourcing, human computer partnerships, accessibility, and (bio)medical image analysis. Her group focuses on creating computing systems that enable and accelerate the analysis of visual information. She and her work has been recognized with research awards from WACV, CHI, CSCW, ASIS&T, HCOMP GroupSight, MICCAI IMIC, and AAPM. Gurari's research has been supported by the National Science Foundation, Silicon Valley Community Foundation's Chan Zuckerberg Initiative, Microsoft, Adobe, and Amazon.

Jason Corso

Talk Title:
Hazy Oracles in Human+AI Collaboration
Abstract:
This talk explores the evolving dynamics of human+AI collaboration, focusing on the concept of the human as a "hazy oracle" rather than an infallible source. It outlines the journey of integrating AI systems more deeply into practical applications through human+AI cooperation, discussing the potential value and challenges. The discussion includes the modeling of interaction errors and the strategic choices between immediate AI inference or seeking additional human input, supported by results from a user study on optimizing these collaborations.
Bio:
Corso is Professor of Robotics, Electrical Engineering and Computer Science at the University of Michigan and Co-Founder / CSO of the AI startup Voxel51. He received his PhD and MSE degrees at The Johns Hopkins University in 2005 and 2002, respectively, and the BS Degree with honors from Loyola College In Maryland in 2000, all in Computer Science. He is the recipient of a U Michigan EECS Outstanding Achievement Award 2018, Google Faculty Research Award 2015, the Army Research Office Young Investigator Award 2010, National Science Foundation CAREER award 2009, SUNY Buffalo Young Investigator Award 2011, a member of the 2009 DARPA Computer Science Study Group, and a recipient of the Link Foundation Fellowship in Advanced Simulation and Training 2003. Corso has authored more than 150 peer-reviewed papers and hundreds of thousands of lines of open-source code on topics of his interest including computer vision, robotics, data science, and general computing. He is a member of the AAAI, ACM, MAA and a senior member of the IEEE.

Ranjay Krishna

Talk Title:
~~Humans~~ AI-in-the-loop
Bio:
Ranjay Krishna is an Assistant Professor at the Paul G. Allen School of Computer Science & Engineering. His research lies at the intersection of computer vision and human computer interaction. This research has received best paper, outstanding paper, and orals at CVPR, ACL, CSCW, NeurIPS, UIST, and ECCV, and has been reported by Science, Forbes, the Wall Street Journal, and PBS NOVA. His research has been supported by Google, Amazon, Cisco, Toyota Research Institute, NSF, ONR, and Yahoo. He holds a bachelor's degree in Electrical & Computer Engineering and in Computer Science from Cornell University, a master's degree in Computer Science from Stanford University and a Ph.D. in Computer Science from Stanford University.