five

NAMAA-Space/Arabic-Triplet-With-Multi-Negatives

收藏
Hugging Face2024-11-21 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/NAMAA-Space/Arabic-Triplet-With-Multi-Negatives
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - feature-extraction - sentence-similarity language: - ar size_categories: - 10K<n<100K --- # Arabic Triplet with Multi Negatives ## Dataset Summary This dataset is a modified version of the Arabic subset of the [Mr. TyDi dataset](https://huggingface.co/datasets/castorini/mr-tydi), tailored for retrieval and re-ranking tasks. The original dataset has been restructured by splitting the negative passages into separate fields (`negative1`, `negative2`, ..., `negativeN`) for each query. This modification allows more flexibility for training and evaluating retrieval and re-ranking models. The dataset retains the original intent of Mr. Tydi, focusing on monolingual retrieval for the Arabic language while offering a new structure for ease of use in ranking-based learning tasks. ## Dataset Structure The dataset includes train split only where each query is paired with a set of positive passages and multiple individually enumerated negative passages (up to 30). ### Example Data #### Train Set ```json { "query_id": "1", "query": "متى تم تطوير نظرية الحقل الكمي؟", "positive_passages": [ { "text": "بدأت نظرية الحقل الكمي بشكل طبيعي بدراسة التفاعلات الكهرومغناطيسية ..." } ], "negative1": { "text": "تم تنفيذ النهج مؤخرًا ليشمل نسخة جبرية من الحقل الكمي ..." }, "negative2": { "text": "تتناول هذه المقالة الخلفية التاريخية لتطوير نظرية الحقل ..." }, ... } ``` ### Language Coverage The dataset focuses exclusively on the **Arabic** subset of Mr. TyDi. ### Loading the Dataset You can load the dataset using the **datasets** library from Hugging Face: ```python from datasets import load_dataset dataset = load_dataset('NAMAA-Space/Arabic-Triplet-With-Multi-Negatives') dataset ``` ### Dataset Usage The new format facilitates training retrieval and re-ranking models by providing explicit negative passage fields. This structure simplifies the handling of negative examples during model training and evaluation. ### Citation Information If you use this dataset in your research, please cite the original Mr. TyDi paper and this dataset as follows: ``` @article{mrtydi, title={{Mr. TyDi}: A Multi-lingual Benchmark for Dense Retrieval}, author={Xinyu Zhang and Xueguang Ma and Peng Shi and Jimmy Lin}, year={2021}, journal={arXiv:2108.08787}, } @dataset{Namaa, title={Arabic Triplet With Multi Negatives}, author={Omer Nacar}, year={2024}, note={Hugging Face Dataset Repository} } ```
提供机构:
NAMAA-Space
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作