Amharic Passage Retrieval Dataset
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/kidist-amde/amharic-ir-benchmarks
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是由预处理后的阿姆哈拉语新闻文本分类数据集构建而成的,包含了大约45,000个查询-段落对,这些对是从50,706篇阿姆哈拉语新闻文章中生成的,并分为六个领域。为了确保六个新闻领域在训练集和测试集中的平衡代表,数据集采用了分层方法进行划分,且由于缺乏明确的关联性判断,采用了启发式监督方法。该数据集的规模约为45,000个查询-段落对,其任务是进行段落检索。
This dataset is constructed from a preprocessed Amharic news text classification dataset, containing approximately 45,000 query-passage pairs generated from 50,706 Amharic news articles and categorized into six domains. To ensure balanced representation of the six news domains in both the training and test sets, a stratified sampling approach is adopted for dataset splitting. Moreover, due to the lack of explicit relevance judgments, a heuristic supervision method is employed. The dataset comprises a total of approximately 45,000 query-passage pairs, and its core task is passage retrieval.
提供机构:
Research team from the paper



