awesome-bioie
github.com/caufieldjh/awesome-bioie ↗🧫 A curated list of resources relevant to doing Biomedical Information Extraction (including BioNLP)
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me conferences and other events resources from awesome-bioie"
Installation instructions →What's inside
Journals and Events
- ACM-BCBConferences and Other Events
The ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Held annually since 2010.
- BIBMConferences and Other Events
The IEEE International Conference on Bioinformatics and Biomedicine.
- BioASQChallenges
Challenges on biomedical semantic indexing and question answering. Challenges and workshops held annually since 2013.
- BioCreAtIvE workshopChallenges
These workshops have been organized since 2004, with BioCreative VI happening February 2017 and the
- DatabaseJournals
Its subtitle is "The Journal of Biological Databases and Curation". Open access.
- eHealth-KDChallenges
Challenges for encouraging "development of software technologies to automatically extract a large variety of knowledge from eHealth documents written in the Spanish Language". Previously held as part of
Datasets
- AIMedProtein-protein Interaction Annotated Corpora
225 MEDLINE abstracts annotated for PPI.
- BioC-BioGRIDProtein-protein Interaction Annotated Corpora
120 full text articles annotated for PPI and genetic interactions. Used in the BioCreative V BioC task.
- BioCreAtIvE 2Annotated Text Data
15,000 sentences (10,000 training and 5,000 test, different from the first corpus) annotated for protein and gene names. 542 abstracts linked to EntrezGene identifiers. A variety of research articles annotated for features of protein–protein interactions.
- BioCreAtIvE V CDR Task Corpus (BC5CDR)Annotated Text Data
1,500 articles (title and abstract) published in 2014 or later, annotated for 4,409 chemicals, 5,818 diseases and 3116 chemical–disease interactions. Requires registration.
- BioCreative VI CHEMPROT CorpusAnnotated Text Data
>2,400 articles annotated with chemical-protein interactions of a variety of relation types. Requires registration.
- BioInferProtein-protein Interaction Annotated Corpora
1,100 sentences from biomedical research abstracts annotated for relationships (including PPI), named entities, and syntactic dependencies.
Techniques and Models
- Alsentzer et al Clinical BERTBERT models
- BioASQword2vecText Embeddings
Qord embeddings derived from biomedical text (>10 million PubMed abstracts) using the popular
- BioBERTBERT models
A PubMed and PubMed Central-trained version of the
- BioGPTGPT-2 models
A GPT-2 model pre-trained on 15 million PubMed abstracts, along with fine-tuned versions for several biomedical tasks.
- BioWordVecText Embeddings
Word embeddings derived from biomedical text (>27 million PubMed titles and abstracts), including subword embedding model based on MeSH.
- BlueBERTBERT models
A BERT model pre-trained on PubMed text and MIMIC-III notes.
Organizations
Tools, Platforms, and Services
- AnaforaAnnotation Tools
An annotation tool with adjudication and progress tracking features.
- bratAnnotation Tools
The brat rapid annotation tool. Supports producing text annotations visually, through the browser. Not subject specific; appropriate for many annotation projects. Visualization is based on that of the
- CLAMP
A natural language processing toolkit intended for use with the text in clinical reports. Check out their
- cTAKES
A system for processing the text in electronic medical records. Widely used and open source.
- DeepPhe
A system for processing documents describing cancer presentations. Based on cTAKES (see above).
- DNorm
A method for disease normalization, i.e., linking mentions of disease names and acronyms to unique concept identifiers. Downloadable version includes the NCBI Disease Corpus and BC5CDR (see Annotated Text Data below).
Research Overviews
- Assessing the research landscape and clinical utility of large language models: a scoping reviewLLMs in Biomedical IE
a high-level review of LLM applications in medicine as of March 2024.
- Awesome AI-based Protein DesignPre-LLM Overviews
This is a collection of research papers for AI-based protein design.
- Biomedical Informatics on the Cloud: A Treasure Hunt for Advancing Cardiovascular MedicinePre-LLM Overviews
An overview of how BioIE and bioinformatics workflows can be applied to questions in cardiovascular health and medicine research.
- Capturing the Patient's Perspective: a Review of Advances in Natural Language Processing of Health-Related TextPre-LLM Overviews
A 2017 review of natural language processing methods applied to information extraction in health records and social media text. An important note from this review: "One of the main challenges in the field is the availability of data that can be shared and which can be used by the community to push the development of methods based on comparable and reproducible studies".
- Clinical information extraction applications: A literature reviewPre-LLM Overviews
A review of clinical IE papers published as of September 2016. From Mayo Clinic group (see below).
- Ethical and regulatory challenges of large language models in medicineLLMs in Biomedical IE
a review of ethical issues arising from applications of LLMs in biomedicine.
Data Models
- Biolink
A data model of biological entities. Provided as a
- BioUML
An architecture for biomedical data analysis, integration, and visualization. Conceptually based on the visual modeling language
- OMOP Common Data Model
a standard for observational healthcare data.
- unmiri-ngs-fhir-schema
Apache-2.0 JSON Schema (Draft 2020-12) API contract for cross-vendor somatic NGS interpretation output (Foundation Medicine, Tempus, Caris, Guardant), aligned with the HL7 FHIR Genomics IG. A standards-aligned target representation for biomedical information-extraction pipelines that parse oncology lab reports.
Tutorials
- Biomedical Literature MiningPre-LLM Guides, Lectures, and Courses
A (non-free) volume of Methods in Molecular Biology from 2014. Chapters covers introductory principles in text mining, applications in the biological sciences, and potential for use in clinical or medical safety scenarios.
- Coursera - Foundations of mining non-structured medical dataPre-LLM Guides, Lectures, and Courses
About three hours worth of video lectures on working with medical data of various types and structures, including text and image data. Appears fairly high-level and intended for beginners.
- Getting Started in Text MiningPre-LLM Guides, Lectures, and Courses
A brief introduction to bio-text mining from Cohen and Hunter. More than ten years old but still quite relevant. See also an
- JensenLab text mining exercisesPre-LLM Guides, Lectures, and Courses
- VIB text mining and curation trainingPre-LLM Guides, Lectures, and Courses
This training workshop happenened in 2013 but the slides are still online.
Showing a sample of 114 resources. View the full list on GitHub →