Context Awesome

awesome_think_with_images

github.com/zhaochen0110/awesome_think_with_images ↗

Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.

1.5k

GitHub Stars

117

Curated Resources

5

Categories

18 hours ago

Last Refreshed

🔔 News🛠️ Stage 1: Tool-Driven Visual Exploration💻 Stage 2: Programmatic Visual Manipulation🎨 Stage 3: Intrinsic Visual Imagination📊 Evaluation & Benchmarks

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me ➤ benchmarks for thinking with images resources from awesome_think_with_images"

Installation instructions →

What's inside

📊 Evaluation & Benchmarks

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models➤ Benchmarks for Thinking with Images
ARC Prize 2024: Technical Report➤ Benchmarks for Thinking with Images
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps➤ Benchmarks for Thinking with Images
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models➤ Benchmarks for Thinking with Images
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models➤ Benchmarks for Thinking with Images
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation➤ Benchmarks for Thinking with Images

🛠️ Stage 1: Tool-Driven Visual Exploration

Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO➤ RL-Based Approaches
Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL➤ RL-Based Approaches
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models➤ Prompt-Based Approaches
CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation➤ SFT-Based Approaches
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations➤ SFT-Based Approaches
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning➤ RL-Based Approaches

💻 Stage 2: Programmatic Visual Manipulation

Advancing vision-language models in front-end development via data synthesis➤ SFT-Based Approaches
CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers?➤ Prompt-Based Approaches
COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning➤ SFT-Based Approaches
Interactive Sketchpad: A Multimodal Tutoring System for Collaborative, Visual Problem-Solving➤ Prompt-Based Approaches
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning➤ SFT-Based Approaches
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks➤ Prompt-Based Approaches

🔔 News

🎨 Stage 3: Intrinsic Visual Imagination

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset➤ SFT-Based Approaches
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step➤ RL-Based Approaches
Chameleon: Mixed-Modal Early-Fusion Foundation Models➤ SFT-Based Approaches
ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning➤ RL-Based Approaches
Cot-vla: Visual chain-of-thought reasoning for vision-language-action models➤ SFT-Based Approaches
Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO➤ RL-Based Approaches

Showing a sample of 117 resources. View the full list on GitHub →