Skip to main content

A curated list of resources for the conservation, development, and documentation of low resource (human) languages.

446
GitHub Stars
476
Curated Resources
23
Categories
4 hours ago
Last Refreshed
Generic RepositoriesKeyboard Layout Configuration HelpersAnnotationFormat Specificationsi18n-related RepositoriesAudio automationText-to-Speech (TTS)Automatic Speech Recognition (ASR)Text automationExperimentationFlashcardsNatural language generationComputing systemsAndroid ApplicationsChrome ExtensionsFieldDBAcademic Research Paper-Specific RepositoriesExample RepositoriesFontsCorporaOrganizationsTutorialsLanguage Specific Projects

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me software resources from low-resource-languages"

Installation instructions →

What's inside

Generic Repositories

  • 4langSoftware

    Concept dictionary using Eilenberg machines.

  • accentuate.usSoftware

  • alignment-with-openfstSoftware

    This is an implementation of the CRF autoencoder framework for four tasks: bitext word alignment, part-of-speech tagging, code switching, dependency parsing.

  • ApertiumSoftware

  • Apertium

    A free/open-source machine translation platform, initially aimed at related-language pairs but expanded to deal with more divergent language pairs (Wikipedia-like army of other MT linguists). Wikipedia has a

  • ark-tweet-nlpSoftware

    CMU ARK Twitter Part-of-Speech Tagger (

Organizations

  • 7000 LanguagesOther OSS Organizations

    Creates free online language learning courses and materials in partnership with Indigenous, minority, and refugee communities.

  • African Languages LabOther OSS Organizations

    Develops enterprise-grade language AI models (including Mansa LLM) supporting 30+ African languages for translation, transcription, and NLP.

  • AI4BharatOn GitHub

    Open-source datasets, tools, and models for Indian languages from IIT Madras, including IndicTrans2 (translation), Indic-TTS, IndicLID (language identification), and IndicVoices.

  • batumiOn GitHub

    Speech recognition and natural language processing for low-resource languages

  • BloomBooksOn GitHub

  • cmusphinxOn GitHub

    Mirror of the SourceForge repositories

Language Specific Projects

Annotation

  • AGTK

    AGTK is a suite of software components for building tools for annotating linguistic signals, time-series data which documents any kind of linguistic behavior (e.g. audio, video). The internal data structures are based on annotation graphs. (Original project is on SourceForge:

  • Annotation page

    Ethnographic tools for annotation.

  • brat

    brat rapid annotation tool (brat) for online text annotation.

  • brendano/gfl_syntax

    Graph Fragment Language for Easy Syntactic Annotation.

  • CLAM

    Quickly and transparently transforms command-line NLP tools into RESTful webservices with an interface for human end-users.

  • eopas

    ETHNOER Online Presentation and Annotation System.

Android Applications

FieldDB

  • AndroidLanguageLearningClientForFieldDB-sikuliFieldDB Webservices/Components/Plugins

    Sikuli tests for AndroidLanguageLearningClientForFieldDB.

  • AuthenticationWebServiceFieldDB Webservices/Components/Plugins

    A node.js web service which mananges users and corpora creation and authentication.

  • bower-fielddbFieldDB Webservices/Components/Plugins

    A bower repository which hosts fielddb core components, bower install fielddb --save.

  • bower-fielddb-angularFieldDB Webservices/Components/Plugins

    A bower repository which hosts fielddb-angular components, bower install fielddb-angular --save.

  • FieldDB

    An offline/online field database which adapts to its user's terminology and I-Language, has plugins for various data automation routines along the process of primary data collection to cleaning to publication and archival.

  • FieldDBActivityFeedFieldDB Webservices/Components/Plugins

    A fielddb activity feed widget which can be embedded in other codebases, websites etc

Flashcards

  • Anki

    Anki is a program to make and share flaschard decks (including audio) for any language or writing system.

  • awesome-anki

    A curated list of awesome Anki add-ons, decks and resources.

Audio automation

  • arctic-prompts

    Generate prompts PDF for CMU ARCTIC dataset.

  • Audacity

    Free, open source, cross-platform software for recording and editing sounds.

  • AudioWebService

    a simple nodejs server which accepts upload of audio and runs it through praat.

  • AuToBI

    Automatic prosodic annotation tool written in Java.

  • BashScriptsForPhonetics

    (

  • CMU Sphinx

    Open source toolkit for speech recognition. PocketSphinx, SphinxTrain, Sphinx4, and sphinxbase.

Showing a sample of 476 resources. View the full list on GitHub →