awesome-document-similarity
github.com/malteos/awesome-document-similarity ↗A curated list of resources on document similarity measures (papers, tutorials, code, ...)
255
GitHub Stars
81
Curated Resources
8
Categories
5 hours ago
Last Refreshed
MotivationDocument RepresentationsSimilarity / Distance MeasuresBenchmarks & DatasetsPerformance measuresSurveysTutorialsSee also
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me from word to sentence level resources from awesome-document-similarity"
Installation instructions →What's inside
Benchmarks & Datasets
- A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications
- A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
- CSFCube -- A Test Collection of Computer Science Research Articles for Faceted Query by Example
- Paper
- SciDocs - The Dataset Evaluation Suite for SPECTER (for classification, citation prediction, user activity, recommendation)
The Dataset Evaluation Suite for SPECTER (for classification, citation prediction, user activity, recommendation)
- STSbenchmark
Document Representations
- A Simple but Tough-to-Beat Baseline for Sentence EmbeddingsFrom word to sentence level
- BERT-AL: BERT for Arbitrarily Long Document UnderstandingBERT and other Transformer Language Models
- Blockwise Self-Attention for Long Document UnderstandingBERT and other Transformer Language Models
- BlogpostBERT and other Transformer Language Models
- CodeBERT and other Transformer Language Models
- Easy-to-use interface to fine-tuned BERT models for computing semantic similarityBERT and other Transformer Language Models
See also
- Awesome Network Embedding
- Awesome Neural Models for Semantic Match
- Awesome Sentence Embeddings
- Charu C. Aggarwal. Content-Based Recommender Systems
- Michael J. Pazzani, Daniel Billsus. Content-Based Recommendation Systems
- Sentence Similarity Calculator (ELMo, BERT and Universal Sentence Encoder, and different similarity measures)
Motivation
- Bär, D., Zesch, T., & Gurevych, I. (2011). A reflective view on text similarity. International Conference Recent Advances in Natural Language Processing, RANLP, (September), 515–520.Similarity concepts
- Bär, D., Zesch, T., & Gurevych, I. (2015). Composing Measures for Computing Text Similarity. Technical Report TUD-CS-2015-0017, 1–30.Similarity concepts
- Medin, D. L., Goldstone, R. L., & Gentner, D. (1993). Respects for Similarity. Psychological Review, 100(2), 254–278.Similarity concepts
- Nguyen, D., Trieschnigg, D., & Theune, M. (2014). Using crowdsourcing to investigate perception of narrative similarity. CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management, 321–330.Similarity concepts
Proceedings of the 2014 ACM International Conference on Information and Knowledge Management, 321–330.
- Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327.Similarity concepts
Performance measures
Similarity / Distance Measures
- Feature-wise transformations (Distill)Text matching
- FiLMText matching
- https://doi.org/10.1109/CVPR.2014.180Text matching
- https://www.reddit.com/r/MachineLearning/comments/e525c6/d_what_beats_concatenation/Text matching
- Jiang, J. et al. 2019. Semantic Text Matching for Long-Form Documents. The World Wide Web Conference on - WWW ’19 (New York, New York, USA, 2019), 795–806.Siamese Networks
WWW ’19 (New York, New York, USA, 2019), 795–806.
- Liu, B. et al. 2018. Matching Article Pairs with Graphical Decomposition and Convolutions. (Feb. 2018).Siamese Networks
Showing a sample of 81 resources. View the full list on GitHub →