awesome-document-understanding
github.com/harrytea/awesome-document-understanding ↗Document Artifical Intelligence
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me 📑 document understanding resources from awesome-document-understanding"
Installation instructions →What's inside
📑 Document Understanding
- A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
24.7.2 | arXiv | Code
- A Simple yet Effective Layout Token in Large Language Models for Document Understanding
25.3.24
- BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
23.8.19 | AAAI24 |Code
- DiT: Self-supervised Pre-training for Document Image Transformer
22.03.04 | ACM MM22 |Code
- DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models
24.10.4 | arXiv | Code
- DocLLM: A layout-aware generative language model for multimodal document understanding
23.12.31 | arXiv
🎬 Video LLM
- Artemis: Towards Referential Understanding in Complex Videos
24.6.1 | arXiv |Code
- Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
23.11.14 | arXiv |Code
- TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
23.12.04 | CVPR24 |Code
- Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
23.11.27 | arXiv |Code
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
23.06.05 | arXiv |code
- Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
23.11.16 | arXiv |code
🔮 MLLM
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
23.01.30 | arXiv |Code
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
24.6.24
- DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models
24.2.22 | arXiv |Code
- FastVLM: Efficient Vision Encoding for Vision Language Models
24.12.17
- Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
24.03.05 | arXiv |Code
- Flamingo: a Visual Language Model for Few-Shot Learning
22.11.15 | Nips22 | Code
🎯 Grounded MLLM
- BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
23.07.17 | arXiv |Code
- Ferret: Refer and Ground Anything Anywhere at Any Granularity
23.10.11 | arXiv |Code
- Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
24.04.08 | arXiv | Code
- Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
24.04.11 | arXiv | Code
- Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
24.4.19 | ECCV24 |Code
- GroundingGPT: Language Enhanced Multi-modal Grounding Model
24.03.05 | arXiv |Code
🏆 Milestone
- InternLM2 Technical Report
24.3.26
- InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities
23.6.3
- InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
24.7.3
- InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
25.8.25
- InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
25.4.14
- Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
23.8.23
Showing a sample of 133 resources. View the full list on GitHub →