awesome-nlp-polish

A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.

307

GitHub Stars

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me polish transformer models resources from awesome-nlp-polish"

Allegro HerBERTPolish Transformer models
Polish BERT model trained on Polish Corpora using only MLM objective with dynamic masking of whole words.
BPEmb: Subword Embeddings includes polishOther models
easy to use with
Common CrawlOther models
train on:
ELMO embeddingsOther models
A model of ELMo embeddings for Polish language trained on large textual corpora (KGR10).
FastText KGR10 polish model binaryOther models
IPIPAN Word2vec polish modelsOther models

Benchmarks of some of polish NLP tools
Single-word lemmatization and morphological analysis, Multi-word lemmatization,Disambiguated POS tagging, Dependency parsing, Shallow parsing, Named entity recognition, Summarization etc.
https://github.com/sdadas/polish-nlp-resources
Polish Sentence Evaluation
Polish Word Embeddings Review
Evaluation of polish word embeddings: word2vec, fastext etc. prepared by various research groups. Evaluation is done by words analogy task.
TRAINING ROBERTA FROM SCRATCH - THE MISSING GUIDE
complete user guide for trainning Roberta model with use of Huggingface/Transformers for polish

Clean Polish OSCARRaw texts
preprosessed polish oscar corpus, removed: foreign sentences(non-polish), non-valid polish senteces (eg. enums), corpus preprocessed by @Ermlab
Ermlab Opineo datasetTask oriented datsets
http://zil.ipipan.waw.pl/HateSpeechTask oriented datsets
NKJPTask oriented datsets
National Corpus of Polish. It contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. Only a small sub-corpus is available for
Opus - the open parallel corpusRaw texts
you can select languages and download only polish file
OSCAR or Open Super-large Crawled ALMAnaCH coRpusRaw texts
is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. Contains 109GB or 49GB of polish text.

Duckling
library for parsing text into structured data with support for Polish
KRNNT Polish morphological tagger
KRNNT is a morphological tagger for Polish based on recurrent neural networks
Morfeusz
morphological analyzer. See also
Morfologik
dictionary-based morphological analyzer
Polish abbreviations for NLTK sentence tokenizer
spaCy for Polish
extend spaCy, a popular production-ready NLP library, to fully support Polish language.

Showing a sample of 45 resources. View the full list on GitHub →