Skip to main content

A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.

307
GitHub Stars
45
Curated Resources
4
Categories
5 hours ago
Last Refreshed
Polish text datasetsModels and EmbeddingsLanguage processing tools and librariesPapers, articles, blog post

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me polish transformer models resources from awesome-nlp-polish"

Installation instructions →

What's inside

Models and Embeddings

Papers, articles, blog post

Polish text datasets

  • Clean Polish OSCARRaw texts

    preprosessed polish oscar corpus, removed: foreign sentences(non-polish), non-valid polish senteces (eg. enums), corpus preprocessed by @Ermlab

  • Ermlab Opineo datasetTask oriented datsets

  • http://zil.ipipan.waw.pl/HateSpeechTask oriented datsets

  • NKJPTask oriented datsets

    National Corpus of Polish. It contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. Only a small sub-corpus is available for

  • Opus - the open parallel corpusRaw texts

    sentences 45.9M, polish tokens 287.1M ,collection of translated movie subtitles from

  • OSCAR or Open Super-large Crawled ALMAnaCH coRpusRaw texts

    is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. Contains 109GB or 49GB of polish text.

Language processing tools and libraries

Showing a sample of 45 resources. View the full list on GitHub →