Context Awesome

awesome-multimodal-large-language-models

github.com/bradyfu/awesome-multimodal-large-language-models ↗

:sparkles::sparkles:Latest Advances on Multimodal Large Language Models

18k

GitHub Stars

553

Curated Resources

15

Categories

23 hours ago

Last Refreshed

✨ Highlights of NJU-MiGMultimodal Instruction Tuning (& Latest Works)Multimodal HallucinationMultimodal In-Context LearningMultimodal Chain-of-ThoughtLLM-Aided Visual ReasoningFoundation ModelsEvaluationMultimodal RLHFOthersDatasets of Multimodal Instruction TuningDatasets of In-Context LearningDatasets of Multimodal Chain-of-ThoughtDatasets of Multimodal RLHFBenchmarks for Evaluation

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me multimodal instruction tuning (& latest works) resources from awesome-multimodal-large-language-models"

Installation instructions →

What's inside

Multimodal Instruction Tuning (& Latest Works)

LLM-Aided Visual Reasoning

Evaluation

Multimodal Hallucination

Multimodal RLHF

Datasets of Multimodal Instruction Tuning

ALLaVA-4V
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
BuboGPT
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
CAP2QA
Visually Dehallucinative Instruction Generation
cc-sbu-align
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
ChartLlama
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
ComVint
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

Multimodal In-Context Learning

Benchmarks for Evaluation

BenchLMM
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Bingo
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
Charting-New-Territories
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
CharXiv
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
CMMMU
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
CoBSAT
Can MLLMs Perform Text-to-Image In-Context Learning?

Showing a sample of 553 resources. View the full list on GitHub →