awesome-nlp

:book: A curated list of resources dedicated to Natural Language Processing (NLP)

19k

GitHub Stars

656

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me machine translation resources from awesome-nlp"

ACL Anthology
canonical archive of papers from ACL, EMNLP, NAACL, EACL, COLING, and related venues.
ACL Rolling Review
the rolling review process feeding ACL-affiliated venues.

A collection of Natural Language Processing (NLP) Ruby libraries, tools and software
AllenNLP
An NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks.
Amazon ComprehendServices
NLP and ML suite covers most common tasks like NER, tagging, and sentiment analysis
AnaforaAnnotation Tools
Annotation LabAnnotation Tools
Free End-to-End No-Code platform for text annotation and DL model training/tuning. Out-of-the-box support for Named Entity Recognition, Classification, Relation extraction and Assertion Status Spark NLP models. Unlimited support for users, teams, projects, documents. Not FOSS.
ArgillaAnnotation Tools
open-source platform for collecting human feedback, building NLP and LLM datasets, and curating preference data.

Adapting LLMs for Document-Level MTMachine Translation
LLMs for context-aware translation.
Adapting LLMs for Minimal-edit GECInformation Extraction Beyond NER
decoder-only LLMs with a novel error-rate adaptation schedule set new SOTA on BEA-test grammatical error correction.
AtlasQuestion Answering and Reading Comprehension
retrieval-augmented LM for few-shot QA.
Attention Is All You NeedMachine Translation
transformer; reset the field.
Benchmarking LLMs for News SummarizationSummarization
LLMs vs fine-tuned summarizers.
BERTopicTopic Modeling
clustering-based topic modeling on top of contextual embeddings; common modern default.

Advanced Natural Language ProcessingVideos and Online Courses
CS 685, UMass Amherst CS
Advanced NLP with spaCyReading Content
Free online course covering text processing, large-scale data analysis, processing pipelines, and training neural network models for custom NLP tasks.
AI PlaybookReading Content
a16z AI playbook is a great link to forward to your managers or content for your presentations
Applied Natural Language ProcessingVideos and Online Courses
arXiv: Natural Language Processing (Almost) from ScratchReading Content
Cohere LLM UniversityVideos and Online Courses
free course on LLMs, embeddings, semantic search, and NLP applications.

AfriqueLLMMultilingual and Cross-Lingual Models
suite of open LLMs (4B-14B) continued-pretrained on 26B tokens across 20 African languages with a comprehensive empirical study of data mixing.
Alignment Faking in Large Language ModelsBias, Fairness, Safety in NLP
models strategically complying during training.
Apple Intelligence Foundation Language ModelsEfficient and Small Language Models
on-device 3B model using KV-cache sharing and 2-bit QAT for 37.5% cache memory reduction without accuracy loss.
A Primer in BERTologyProbing and Interpretability
what BERT learns about language.
Atomic CalibrationFactuality, Hallucination, Calibration
claim-level calibration analysis for long-form generation; models are substantially worse-calibrated on extended outputs than on single claims.
AWQEfficient and Small Language Models
activation-aware weight quantization.

AI4Bharat IndicNLP SuiteLibraries and Tooling
tools, datasets, and models across 22 Indic languages.
AiravataModels and Embeddings
instruction-tuned Hindi LLM.
AlbertinaModels
encoder-only Portuguese LMs for both PT-PT and PT-BR.
ALLaMModels and Embeddings
Arabic-first foundation models.
AlpinoNLP in Dutch
dependency parser for Dutch (also does POS tagging and lemmatization).
AraBERTModels and Embeddings
Arabic BERT family.

Common Corpus
2T-token open-license multilingual corpus.
CulturaX
6.3T tokens across 167 languages.
Dolma
3T-token open pretraining corpus with documented filtering pipeline.
FineWeb / FineWeb-Edu
15T-token cleaned web corpus; FineWeb-Edu filters for educational quality.
gensim-data
data repository for pretrained NLP models and NLP corpora.

Showing a sample of 656 resources. View the full list on GitHub →