awesome_think_with_images
github.com/zhaochen0110/awesome_think_with_images ↗Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
1.5k
GitHub Stars
117
Curated Resources
5
Categories
23 hours ago
Last Refreshed
🔔 News🛠️ Stage 1: Tool-Driven Visual Exploration💻 Stage 2: Programmatic Visual Manipulation🎨 Stage 3: Intrinsic Visual Imagination📊 Evaluation & Benchmarks
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me ➤ benchmarks for thinking with images resources from awesome_think_with_images"
Installation instructions →What's inside
📊 Evaluation & Benchmarks
- A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models➤ Benchmarks for Thinking with Images
- ARC Prize 2024: Technical Report➤ Benchmarks for Thinking with Images
- Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps➤ Benchmarks for Thinking with Images
- ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models➤ Benchmarks for Thinking with Images
- CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models➤ Benchmarks for Thinking with Images
- CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation➤ Benchmarks for Thinking with Images
🛠️ Stage 1: Tool-Driven Visual Exploration
- Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO➤ RL-Based Approaches
- Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL➤ RL-Based Approaches
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models➤ Prompt-Based Approaches
- CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation➤ SFT-Based Approaches
- CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations➤ SFT-Based Approaches
- DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning➤ RL-Based Approaches
💻 Stage 2: Programmatic Visual Manipulation
- Advancing vision-language models in front-end development via data synthesis➤ SFT-Based Approaches
- CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers?➤ Prompt-Based Approaches
- COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning➤ SFT-Based Approaches
- Interactive Sketchpad: A Multimodal Tutoring System for Collaborative, Visual Problem-Solving➤ Prompt-Based Approaches
- MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning➤ SFT-Based Approaches
- MMFactory: A Universal Solution Search Engine for Vision-Language Tasks➤ Prompt-Based Approaches
🔔 News
🎨 Stage 3: Intrinsic Visual Imagination
- BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset➤ SFT-Based Approaches
- Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step➤ RL-Based Approaches
- Chameleon: Mixed-Modal Early-Fusion Foundation Models➤ SFT-Based Approaches
- ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning➤ RL-Based Approaches
- Cot-vla: Visual chain-of-thought reasoning for vision-language-action models➤ SFT-Based Approaches
- Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO➤ RL-Based Approaches
Showing a sample of 117 resources. View the full list on GitHub →