awesome-mcot
github.com/yaotingwangofficial/awesome-mcot ↗Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
1k
GitHub Stars
236
Curated Resources
21
Categories
16 min ago
Last Refreshed
Tab-1: Datasets for MCoT Training with Rationale.Tab-2: Benchmarks for MCoT Evaluation without Rationale.Tab-3: Benchmarks for MCoT Evaluation with Rationale.MCoT Reasoning Over ImageMCoT Reasoning Over VideoMCoT Reasoning Over 3DMCoT Reasoning Over Audio and SpeechMCoT Reasoning Over Table and ChartCross-modal CoT ReasoningRationale ConstructionStructural ReasoningInformation EnhancingObjective GranularityMultimodal RationaleTest-time ScalingEmbodied AIAgentic SystemAutonomous DrivingMedical and HealthcareSocial and HumanMultimodal Generation
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me survey resources from awesome-mcot"
Installation instructions →What's inside
Multimodal Generation
- 3D-PreMise: Can Large Language Models Generate 3D Shapes with Sharp Features and Parametric Control?
- Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
- EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
- From System 1 to System 2: A Survey of Reasoning Large Language ModelsSurvey
- GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
- L3GO: Language Agents with Chain-of-3D-Thoughts for Generating Unconventional Objects
MCoT Reasoning Over 3D
- 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding
- CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding
- Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models
- Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning
- SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
Tab-2: Benchmarks for MCoT Evaluation without Rationale.
- AgentClinic
2024
- AVHBench
2024
- AV-Odyssey
2024
- AVTrustBench
2025
- HallusionBench
2024
- MathVerse
2024
Embodied AI
- AgentVLN: Towards Agentic Vision-and-Language Navigation
- CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
- EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
- Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration
- FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
- ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation
MCoT Reasoning Over Video
- Analyzing Key Factors Influencing Emotion Prediction Performance of VLLMs in Conversational Contexts
- AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
- CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion
- Following Clues, Approaching the Truth: Explainable Micro-Video Rumor Detection via Chain-of-Thought Reasoning
- FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning
- Hallucination Mitigation Prompts Long-term Video Understanding
MCoT Reasoning Over Image
- Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning
- Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models
- Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning
- CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
- CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection
- CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language Prompting
Tab-1: Datasets for MCoT Training with Rationale.
- A-OKVQA
2022
- EgoCoT
2023
- EMMA-X
2024
- LLaVA-CoT-100k
2024
- M3CoT
2024
- MAmmoTH-VL
2024
Rationale Construction
- A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal Reasoning
- Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought
- Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation
- Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data
- Multimodal Chain-of-Thought Reasoning in Language Models
- MultiModal Tree of Thoughts
Showing a sample of 236 resources. View the full list on GitHub →