awesome-multimodal-large-language-models
github.com/bradyfu/awesome-multimodal-large-language-models ↗:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me multimodal instruction tuning (& latest works) resources from awesome-multimodal-large-language-models"
Installation instructions →What's inside
Multimodal Instruction Tuning (& Latest Works)
- 3D-LLM: Injecting the 3D World into Large Language Models
arXiv
- Addendum to GPT-4o System Card: Native image generation
OpenAI
- ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
arXiv
- An Embodied Generalist Agent in 3D World
arXiv
- An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models
arXiv
- An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
arXiv
LLM-Aided Visual Reasoning
- Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
arXiv
- AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
arXiv
- AVIS: Autonomous Visual Information Seeking with Large Language Models
arXiv
- Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models
arXiv
- Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv
- Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv
Evaluation
- A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
arXiv
- A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging
arXiv
- An Early Evaluation of GPT-4V(ision)
arXiv
- BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
arXiv
- Benchmarking Large Multimodal Models against Common Corruptions
NAACL
- Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
arXiv
Multimodal Hallucination
- AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
arXiv
- Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation
arXiv
- Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
ICLR
- An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
arXiv
- A Survey on Hallucination in Large Vision-Language Models
arXiv
- Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
arXiv
Multimodal RLHF
Datasets of Multimodal Instruction Tuning
- ALLaVA-4V
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
- BuboGPT
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
- CAP2QA
Visually Dehallucinative Instruction Generation
- cc-sbu-align
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
- ChartLlama
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
- ComVint
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Multimodal In-Context Learning
- An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
AAAI
- Can MLLMs Perform Text-to-Image In-Context Learning?
arXiv
- Exploring Diverse In-Context Configurations for Image Captioning
NeurIPS
- Flamingo: a Visual Language Model for Few-Shot Learning
NeurIPS
- Generative Multimodal Models are In-Context Learners
CVPR
- Hijacking Context in Large Multi-modal Models
arXiv
Benchmarks for Evaluation
- BenchLMM
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
- Bingo
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
- Charting-New-Territories
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
- CharXiv
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
- CMMMU
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
- CoBSAT
Can MLLMs Perform Text-to-Image In-Context Learning?
Showing a sample of 548 resources. View the full list on GitHub →