awesome-cvpr-2024

github.com/harpreetsahota204/awesome-cvpr-2024 ↗

🤩 An AWESOME Curated List of Papers, Workshops, Datasets, and Challenges from CVPR 2024

146

GitHub Stars

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me 👁️💬 vision-language resources from awesome-cvpr-2024"

Installation instructions →

What's inside

👁️💬 Vision-Language

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
CLIP is a powerful vision-language model for visual recognition. However, fine-tuning it for small downstream tasks with limited labeled samples is challenging. Efficient transfer learning (ETL) methods adapt VLMs with few parameters, but require careful per-task hyperparameter tuning using large validation sets. To overcome this, the authors propose CLAP, a principled approach that adapts linear probing for few-shot learning. CLAP consistently outperforms ETL methods, providing an efficient and robust approach for few-shot adaptation of large vision-language models in realistic settings where hyperparameter tuning with large validation sets is not feasible.
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Alpha-CLIP is an improved version of the CLIP model that focuses on specific regions of interest in images through an auxiliary alpha channel. It can enhance CLIP in different image-related tasks, including 2D and 3D image generation, captioning, and detection. Alpha-CLIP preserves CLIP's visual recognition ability and boosts zero-shot classification accuracy by 4.1% when using foreground masks.
CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
CLOVA is a system that leverages large language models (LLMs) to generate programs that can accomplish various visual tasks using off-the-shelf visual tools. To overcome the limitation of fixed tools, CLOVA has a closed-loop framework that includes an inference phase, reflection phase, and learning phase. It also uses a multimodal global-local reflection scheme and three flexible methods to collect real-time training data. CLOVA's learning capability enables it to adapt to new environments, resulting in a 5-20% better performance on VQA, multiple-image reasoning, knowledge tagging, and image editing tasks.
Convolutional Prompting meets Language Models for Continual Learning
The paper introduces ConvPrompt, a novel approach for continual learning in vision transformers. ConvPrompt leverages convolutional prompts and large language models to maintain layer-wise shared embeddings and improve knowledge sharing across tasks. The method improves state-of-the-art by around 3% with significantly fewer parameters. In summary, ConvPrompt is an efficient and effective prompt-based continual learning approach that adapts the model capacity based on task similarity.
Describing Differences in Image Sets with Natural Language
Improved Visual Grounding through Self-Consistent Explanations
This paper presents a strategy called SelfEQ. The aim of SelfEQ is to improve the ability of vision-and-language models to locate specific objects in an image. The proposed strategy involves adding paraphrases generated by a large language model to existing text-image datasets. The model is then fine-tuned to ensure that a phrase and its paraphrase map to the same region in the image. This promotes self-consistency in visual explanations, expands the model's vocabulary, and enhances the quality of object locations highlighted by gradient-based visual explanation methods like GradCAM.

🏆 Challenges

Agriculture-Vision Prize Challenge
The Agriculture-Vision Prize Challenge 2024 encourages the development of algorithms for recognizing agricultural patterns from aerial images and to promote sustainable agriculture practices. Semi-supervised learning techniques will be used to merge two datasets and assess model performance. Prizes are $2,500 for 1st place, $1,500 for 2nd place, and $1,000 for 3rd place.
Building3D Challenge
This challenge utilizes the Building3D dataset, an urban-scale publicly available dataset with over 160,000 buildings from 16 cities in Estonia. Participants must develop algorithms that take point clouds as input and generate wireframe models.
Chalearn Face Anti-spoofing Workshop
Spoofing clues resulting from physical presentation attacks are caused by color distortion, screen moire patterns, and production traces. Forgery clues resulting from digital editing attacks are changes in pixel values. The fifth competition aims to explore common characteristics of these attack clues and promote unified detection algorithms. We have a Unified physical-digital Attack dataset, called UniAttackData, with 1,800 participations, 2 physical and 12 digital attacks, and 29,706 videos.
DataCV Challenge
The DataCV Challenge searches training sets for various targets in object detection. The datasets for the challenge consist of a data source pool, combining multiple existing detection datasets, and a newly introduced target dataset with diverse detection environments recorded across 100 countries. Test set A is publicly available on Github, while test set B is reserved for determining challenge awards. An evaluation server is provided for calculating test accuracy. Ethical considerations have been followed by blurring human faces and vehicle license plates to ensure individual privacy and validating copyright before distributing the datasets.
Grocery Vision
The GroceryVision Dataset is part of the RetailVision Workshop Challenge at CVPR 2024. It has two tracks that use real-world retail data collected in typical grocery store environments. Track 1 focuses on Video and Spatial Temporal Action Localization (TAL and STAL). Participants are provided with 73,683 image-annotation pairs for training, and their performance is evaluated based on frame-mAP for TAL and tube-mAP for STAL. Track 2 is the Multi-modal Product Retrieval (MPR) challenge. Participants must design methods to accurately retrieve product identity by measuring similarity between images and descriptions.
Pixel-level Video Understanding in the Wild
The PVUW challenge includes four tracks: Video Semantic Segmentation (VSS), Video Panoptic Segmentation (VPS), Complex Video Object Segmentation, and Motion Expression guided Video Segmentation[1]. The two new tracks, based on the MOSE and MeViS datasets, aim to foster the development of more comprehensive and robust pixel-level understanding of video scenes in complex environments and realistic scenarios.

🧩 Segmentation

Amodal Ground Truth and Completion in the Wild
The paper introduces amodal image segmentation which predicts masks for entire objects, including occluded parts. Previous methods used manual annotation, but the authors use 3D data to construct the MP3D-Amodal dataset with authentic amodal ground truth masks. Two architecture variants are explored: a two-stage OccAmodal model and a one-stage SDAmodal model. Their method achieves state-of-the-art performance on amodal segmentation datasets, including COCOA and the new MP3D-Amodal dataset.

📦 3D Vision

ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering
ASH generates real-time photorealistic renderings of animatable human avatars using Gaussian splats attached to a deformable mesh template. The skeletal motion is encoded using pose-dependent normal maps, and the dynamic Gaussian parameters are learned using 2D convolutional architectures. This approach surpasses existing real-time human avatar rendering methods and represents a significant step towards producing real-time, high-fidelity, controllable human avatars.
Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes
This paper introduces a new method for generating precise 3D shapes from abstract freehand sketches, without the need for paired sketch-3D data. The approach uses a part-level modeling and alignment framework, which enables sketch modeling and in-position editing. By operating in a low-dimensional implicit latent space and using diffusion models, the approach significantly reduces computational demands and processing time. Overall, the method offers a novel solution for enabling accurate 3D generation from abstract sketches.
GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians
GaussianAvatars is a new technique for creating customizable photorealistic head avatars using a dynamic 3D representation based on 3D Gaussian splats. This approach allows for precise animation control while maintaining photorealistic rendering. The technique has shown impressive animation capabilities in challenging scenarios, such as reenactments from a driving video, where it outperforms existing techniques by a significant margin.
Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships
Open3DSG predicts open-vocabulary 3D scene graphs from point clouds, combining vision-language and large language models. Key ideas include constructing a 3D graph with a GNN, aligning features with CLIP, and using an LLM. It allows querying arbitrary objects and relationships at inference time, and enables open-vocabulary prediction not limited to fixed labels.

📊 Datasets/Benchmarks

Benchmarking and Evaluating Large Video Generation Models
The paper proposes a comprehensive evaluation framework for large video generation models that have grown rapidly. Existing academic metrics are inadequate for evaluating these models trained on massive datasets. The proposed evaluation pipeline comprises prompt curation, objective evaluation, subjective studies, and opinion alignment. The models are evaluated based on 17 objective metrics covering visual quality, content quality, motion quality, and text-caption alignment. Additionally, it provides a comparison table of various video generation models across different metrics and capabilities.
ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object
ImageNet-D is a new benchmark for evaluating neural network robustness in visual perception tasks. It generates synthetic images with diverse backgrounds, textures, and materials, making it more challenging than other synthetic datasets. Key features include diversified image generation, high visual fidelity, and significant accuracy reduction of various vision models. The benchmark is created by combining object categories and refining through human verification. ImageNet-D is effective in evaluating neural network robustness, as accuracy on it improves with accuracy on ImageNet.
LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs
The LaMPilot dataset consists of 4,900 human-annotated traffic scenes, each with an instruction (I), an initial state (b), and a set of goal state criteria (G). The dataset is classified by maneuver and scenario types and is divided into training, validation, and testing sets.
MAPLM: A Real-World Large-Scale Vision-Language Dataset for Map and Traffic Scene Understanding
The dataset contains 3D point cloud Bird's Eye View and high-resolution panoramic images of various traffic scenarios. It also includes detailed annotations at the feature, lane, and road levels. The dataset is designed for a Q&A task, where models will be evaluated based on their ability to answer questions about the traffic scenes such as the number of lanes, presence of intersections, and data quality.
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
The Polaris dataset, used to train the model, contains 131,020 human judgments from 550 evaluators on the appropriateness of image captions. The dataset is much larger than existing ones and is capable of training image captioning metrics. The captions in Polaris are more diverse, collected from humans and generated by 10 modern image captioning models. This demonstrates the effectiveness and robustness of Polos compared to previous metrics.
SoccerNet Game State Reconstruction
SoccerNet Game State Reconstruction (GSR) is a novel computer vision task involving the tracking and identification of sports players from a single moving camera to construct a video game-like minimap, without any specific hardware worn by the players. SoccerNet-GSR, the released dataset, includes 200 clips with 9.37M pitch localization annotations and 2.36M athlete positions on the pitch with their role, team & jersey number. Furthermore, a new performance metric 'GS-HOTA' is introduced to evaluate GSR methods.

🛠️ Workshop

Computer Vision for Mixed Reality
VR has the potential to revolutionize our interactions. Passthrough techniques like Apple Vision Pro and Quest-3 allow for deeply immersive mixed reality experiences. We focus on capturing real environments with cameras and using AI to augment them with virtual objects. Our call for papers invites research on novel methods for Mixed Reality. Topics include real-time view synthesis, scene understanding, 3D capture, and more.
Dataset Distillation
Full-day workshop on June 17. The workshop will explore the potential of Dataset Distillation (DD) in computer vision applications like face recognition, object detection, image segmentation, and video understanding. DD has the potential to reduce training costs, make AI eco-friendly, and enable research groups with limited resources to engage in state-of-the-art research. The workshop will also cover related topics such as active learning, few-shot learning, generative models, and learning from synthetic data.
Gaze Estimation and Prediction in the Wild
Morning of June 18th. The workshop will cover gaze-based interaction techniques, eye tracking technologies, applications of gaze interaction in various domains, and methodological considerations in gaze-based research. The main objective is to enhance the field of gaze interaction by providing a platform for researchers and practitioners to present their work, exchange ideas, and explore future directions.
Large Scale Holistic Video Understanding
The main objective of the workshop is to establish a video benchmark integrating joint recognition of all the semantic concepts, as a single class label per task is often not sufficient to describe the holistic content of a video. The planned panel discussion with world’s leading experts on this problem will be a fruitful input and source of ideas for all participants. The community is invited to help to extend the HVU dataset that will spur research in video understanding as a comprehensive, multi-faceted problem.
Multimodal Algorithmic Reasoning
Morning of June 17. The Multimodal Algorithmic Reasoning (MAR) 2024 workshop at CVPR 2024 aims to bring together researchers working on neural algorithmic learning, multimodal reasoning, and cognitive models of intelligence1. The workshop will focus on the emerging topic of multimodal algorithmic reasoning, where agents automatically deduce new algorithms for solving real-world tasks, and will also encourage the vision community to build neural networks with human-like intelligence abilities
Representation Learning with Very Limited Images
Afternoon of June 18th. This workshop focuses on developing visual and multi-modal models with limited data resources. It aims to bring together diverse communities that work on approaches such as self-supervised learning with a single image or synthetic pre-training with generated images. The workshop's organizers include researchers from various institutions.

🧨Diffusion

DemoFusion: Democratising High-Resolution Image Generation With No $$$
DemoFusion is an extension that enables the generation of high-res images through an accessible and efficient inference procedure. It uses global-local denoising paths and introduces three techniques for coherent high-res generation: progressive upscaling, skip residual, and dilated sampling. DemoFusion unlocks the potential in existing open-source text-to-image models without additional training or prohibitive costs, democratizing high-res image synthesis.
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
DragDiffusion is a novel method for interactive point-based image editing that enhances the applicability and versatility of the DragGAN framework by extending it to diffusion models. It optimizes the latent of a single diffusion step and introduces techniques to preserve the identity of the original image. The authors present DragBench, the first benchmark dataset for evaluating interactive point-based image editing methods. Experiments demonstrate the effectiveness of DragDiffusion compared to DragGAN, and an ablation study explores key factors.
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models
FaceTalk is a novel method to generate 3D motion sequences of talking human heads from audio signals. It employs neural parametric head models with speech signals and a new latent diffusion model. The approach denoises Gaussian noise sequences iteratively and extracts mesh sequences using marching cubes from the frozen NPHM model. FaceTalk outperforms existing methods by 75% in perceptual user study evaluations and produces visually natural motion with diverse facial expressions and styles.
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
RAVE is a fast and innovative method for zero-shot video editing that uses pre-trained text-to-image diffusion models. It preserves the original motion and structure of the input video while producing high-quality, temporally consistent edited videos. RAVE edits videos 25% faster than existing methods by efficiently leveraging spatio-temporal interactions between frames. It outperforms existing methods across diverse editing scenarios and requires no extra training or manual inputs. However, there are some limitations such as flickering issues for extreme shape edits in very long videos and fine detail flickering. Try the demo here: https://huggingface.co/spaces/ozgurkara/RAVE
Relightful Harmonization: Lighting-aware Portrait Background Replacement
The paper presents Relightful Harmonization, a technique for harmonizing portrait lighting with a new background image. The method encodes lighting information from the target background image and aligns it with features from panoramic environment maps. Relightful Harmonization outperforms existing benchmarks in visual fidelity and lighting coherence. The technique only requires an arbitrary background image during inference and expands the training data using a novel data simulation pipeline. This approach enables realistic, lighting-aware portrait background replacement using just a single target background image, without requiring HDR environment maps.
SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors
SceneTex generates high-quality indoor scene textures using depth-to-image diffusion priors. Key features include optimization in RGB space, multiresolution texture field, and cross-attention decoder for global style consistency. Experiments show it outperforms prior methods, but limitations include occasional artifacts and inability to handle complex geometry.

🧨 Diffusion

Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features
Diff3F is a feature descriptor for untextured 3D shapes. It computes 3D semantic features using pre-trained 2D diffusion models, rendering depth and normal maps from multiple views, and lifting the 2D diffusion features back to the 3D surface. This produces semantic descriptors on the 3D shape without requiring additional training data or part segmentation.
One-step Diffusion with Distribution Matching Distillation
Distribution Matching Distillation (DMD accelerates multi-step diffusion models into a one-step generator without compromising image quality. DMD matches the distribution of the original diffusion model by minimizing KL divergence and using two score functions - one for the actual data distribution and one for the generated distribution. A regression loss matches the large-scale structure of the multi-step diffusion outputs.

Showing a sample of 61 resources. View the full list on GitHub →