awesome-nlp-polish
github.com/ksopyla/awesome-nlp-polish ↗A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me polish transformer models resources from awesome-nlp-polish"
Installation instructions →What's inside
Models and Embeddings
- Allegro HerBERTPolish Transformer models
Polish BERT model trained on Polish Corpora using only MLM objective with dynamic masking of whole words.
- BPEmb: Subword Embeddings includes polishOther models
easy to use with
- Common CrawlOther models
train on:
- ELMO embeddingsOther models
A model of ELMo embeddings for Polish language trained on large textual corpora (KGR10).
- FastText KGR10 polish model binaryOther models
- IPIPAN Word2vec polish modelsOther models
Papers, articles, blog post
- Benchmarks of some of polish NLP tools
Single-word lemmatization and morphological analysis, Multi-word lemmatization,Disambiguated POS tagging, Dependency parsing, Shallow parsing, Named entity recognition, Summarization etc.
- https://github.com/sdadas/polish-nlp-resources
- Polish Sentence Evaluation
- Polish Word Embeddings Review
Evaluation of polish word embeddings: word2vec, fastext etc. prepared by various research groups. Evaluation is done by words analogy task.
- TRAINING ROBERTA FROM SCRATCH - THE MISSING GUIDE
complete user guide for trainning Roberta model with use of Huggingface/Transformers for polish
Polish text datasets
- Clean Polish OSCARRaw texts
preprosessed polish oscar corpus, removed: foreign sentences(non-polish), non-valid polish senteces (eg. enums), corpus preprocessed by @Ermlab
- Ermlab Opineo datasetTask oriented datsets
- http://zil.ipipan.waw.pl/HateSpeechTask oriented datsets
- NKJPTask oriented datsets
National Corpus of Polish. It contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. Only a small sub-corpus is available for
- Opus - the open parallel corpusRaw texts
sentences 45.9M, polish tokens 287.1M ,collection of translated movie subtitles from
- OSCAR or Open Super-large Crawled ALMAnaCH coRpusRaw texts
is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. Contains 109GB or 49GB of polish text.
Language processing tools and libraries
- Duckling
library for parsing text into structured data with support for Polish
- KRNNT Polish morphological tagger
KRNNT is a morphological tagger for Polish based on recurrent neural networks
- Morfeusz
morphological analyzer. See also
- Morfologik
dictionary-based morphological analyzer
- Polish abbreviations for NLTK sentence tokenizer
- spaCy for Polish
extend spaCy, a popular production-ready NLP library, to fully support Polish language.
Showing a sample of 45 resources. View the full list on GitHub →