smartcat/ms_marco_sr

Name: smartcat/ms_marco_sr
Creator: smartcat
Published: 2024-10-03 13:22:19
License: 暂无描述

Hugging Face2024-10-03 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/smartcat/ms_marco_sr

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - question-answering language: - en - sr pretty_name: MS MARCO SR size_categories: - 10K<n<100K --- # Dataset Card for Serbian MS MARCO (Subset) ## Dataset Description - **Repository:** [ms_marco_sr](https://huggingface.co/datasets/smartcat/serbian-msmarco-subset) - **Point of Contact:** SmartCat.io ### Dataset Summary This dataset is a Serbian translation of the first 8,000 examples from Microsoft's MS MARCO (Machine Reading Comprehension) dataset. It contains pairs of questions and human-generated answers, automatically translated from English to Serbian. The dataset is designed for evaluating embedding models on Question Answering (QA) and Information Retrieval (IR) tasks in the Serbian language. The original MS MARCO dataset can be retrieved from: https://huggingface.co/datasets/microsoft/ms_marco ### Supported Tasks and Leaderboards - **Question Answering**: The dataset can be used to evaluate models' ability to answer questions in Serbian based on given passages. - **Information Retrieval**: It can also be used to assess models' performance in retrieving relevant information from a corpus of Serbian text. ### Languages The dataset is in Serbian (sr). ## Dataset Structure ### Data Instances Each instance in the dataset contains: - `id`: The original MS MARCO question ID - `query`: The question translated to Serbian - `answer`: The human-generated answer translated to Serbian ### Data Fields - `id`: string - `query`: string - `answer`: string ### Data Splits The dataset consists of 8,000 examples from the original MS MARCO dataset. There are no predefined train/validation/test splits. ## Dataset Creation ### Curation Rationale This dataset was created to provide a resource for evaluating NLP models on Serbian language tasks, particularly in the domains of question answering and information retrieval. ### Source Data #### Initial Data Collection and Normalization The source data is derived from the MS MARCO dataset, which contains around 1 million pairs of real Bing questions and human-generated answers. #### Who are the source language producers? The original questions were posed by real users on the Bing search engine. The answers were generated by human annotators. ### Annotations #### Annotation process The original English dataset was automatically translated to Serbian using the GPT-3.5-Turbo-0125 model. #### Who are the annotators? The translation was performed automatically by an AI model, without human intervention. ### Personal and Sensitive Information The dataset may contain personal information present in the original MS MARCO dataset. Users should be aware of this and handle the data accordingly. ## Considerations for Using the Data ### Social Impact of Dataset This dataset contributes to the development of NLP technologies for the Serbian language, potentially improving access to information and language technologies for Serbian speakers. ### Discussion of Biases The dataset may inherit biases present in the original MS MARCO dataset. Additionally, the automatic translation process may introduce its own biases or errors. ### Other Known Limitations - The quality of the Serbian translations has not been manually verified and may contain errors. - The dataset is limited to the first 8,000 examples of MS MARCO, which may not be fully representative of the entire dataset. ## Additional Information ### Dataset Curators [Your Name or Organization] ### Licensing Information [Specify the license, e.g., CC BY-NC 4.0] ### Citation Information If you use this dataset, please cite both the original MS MARCO dataset and this Serbian translation: ``` @article{nguyen2016ms, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li}, journal={arXiv preprint arXiv:1611.09268}, year={2016} } @misc{serbian-msmarco-subset, title={Serbian MS MARCO Subset}, author={[Smartcatio]}, year={2024}, howpublished={\url{https://huggingface.co/datasets/your-username/serbian-msmarco-subset}} } ``` ### Contributions Thanks to Microsoft for creating the original MS MARCO dataset. ## Loading the Dataset Here's a Python code example to load the dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("smartcatio/serbian-msmarco-subset") # Access the data for example in dataset['train']: print(f"ID: {example['id']}") print(f"Query: {example['query']}") print(f"Answer: {example['answer']}") print("---") ```

提供机构：

smartcat

5,000+

优质数据集

54 个

任务类型

进入经典数据集