five

data-is-better-together/MPEP_ARABIC

收藏
Hugging Face2024-07-18 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/data-is-better-together/MPEP_ARABIC
下载链接
链接失效反馈
官方服务:
资源简介:
--- size_categories: n<1K tags: - rlfh - argilla - human-feedback --- # Dataset Card for MPEP_ARABIC This dataset has been created with [Argilla](https://docs.argilla.io). As shown in the sections below, this dataset can be loaded into Argilla as explained in [Load with Argilla](#load-with-argilla), or used directly with the `datasets` library in [Load with `datasets`](#load-with-datasets). ## Dataset Description - **Homepage:** https://argilla.io - **Repository:** https://github.com/argilla-io/argilla - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset contains: * A dataset configuration file conforming to the Argilla dataset format named `argilla.yaml`. This configuration file will be used to configure the dataset when using the `FeedbackDataset.from_huggingface` method in Argilla. * Dataset records in a format compatible with HuggingFace `datasets`. These records will be loaded automatically when using `FeedbackDataset.from_huggingface` and can be loaded independently using the `datasets` library via `load_dataset`. * The [annotation guidelines](#annotation-guidelines) that have been used for building and curating the dataset, if they've been defined in Argilla. ### Load with Argilla To load with Argilla, you'll just need to install Argilla as `pip install argilla --upgrade` and then use the following code: ```python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("DIBT/MPEP_ARABIC") ``` ### Load with `datasets` To load this dataset with `datasets`, you'll just need to install `datasets` as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset ds = load_dataset("DIBT/MPEP_ARABIC") ``` ### Supported Tasks and Leaderboards This dataset can contain [multiple fields, questions and responses](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#feedback-dataset) so it can be used for different NLP tasks, depending on the configuration. The dataset structure is described in the [Dataset Structure section](#dataset-structure). There are no leaderboards associated with this dataset. ### Languages [More Information Needed] ## Dataset Structure ### Data in Argilla The dataset is created in Argilla with: **fields**, **questions**, **suggestions**, **metadata**, **vectors**, and **guidelines**. The **fields** are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. | Field Name | Title | Type | Required | Markdown | | ---------- | ----- | ---- | -------- | -------- | | source | Source | text | True | True | The **questions** are the questions that will be asked to the annotators. They can be of different types, such as rating, text, label_selection, multi_label_selection, or ranking. | Question Name | Title | Type | Required | Description | Values/Labels | | ------------- | ----- | ---- | -------- | ----------- | ------------- | | target | Target | text | True | Translate the text. | N/A | The **suggestions** are human or machine generated recommendations for each question to assist the annotator during the annotation process, so those are always linked to the existing questions, and named appending "-suggestion" and "-suggestion-metadata" to those, containing the value/s of the suggestion and its metadata, respectively. So on, the possible values are the same as in the table above, but the column name is appended with "-suggestion" and the metadata is appended with "-suggestion-metadata". The **metadata** is a dictionary that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. | Metadata Name | Title | Type | Values | Visible for Annotators | | ------------- | ----- | ---- | ------ | ---------------------- | The **guidelines**, are optional as well, and are just a plain string that can be used to provide instructions to the annotators. Find those in the [annotation guidelines](#annotation-guidelines) section. ### Data Instances An example of a dataset instance in Argilla looks as follows: ```json { "external_id": null, "fields": { "source": "If a recipe calls for 2 1/2 cups of sugar and you want to make a half portion of it, calculate the exact amount of sugar needed." }, "metadata": { "evolved_from": null, "kind": "synthetic", "source": "argilla/distilabel-reasoning-prompts" }, "responses": [ { "status": "submitted", "user_id": "6e3edb87-0ccc-47ef-bd61-3ed0e68b20de", "values": { "target": { "value": "\u0625\u0630\u0627 \u0643\u0627\u0646\u062a \u0627\u0644\u0648\u0635\u0641\u0629 \u062a\u062a\u0637\u0644\u0628 \u0643\u0648\u0628\u064a\u0646 \u0648\u0646\u0635\u0641 \u0645\u0646 \u0627\u0644\u0633\u0643\u0631 \u0648\u062a\u0631\u064a\u062f \u062a\u062d\u0636\u064a\u0631 \u0646\u0635\u0641 \u0647\u0630\u0647 \u0627\u0644\u0643\u0645\u064a\u0629\u060c \u0641\u0627\u062d\u0633\u0628 \u0643\u0645\u064a\u0629 \u0627\u0644\u0633\u0643\u0631 \u0627\u0644\u0645\u0637\u0644\u0648\u0628\u0629 \u0628\u0627\u0644\u0636\u0628\u0637." } } } ], "suggestions": [ { "agent": null, "question_name": "target", "score": null, "type": null, "value": "\u0625\u0630\u0627 \u0643\u0627\u0646\u062a \u0627\u0644\u0648\u0635\u0641\u0629 \u062a\u062a\u0637\u0644\u0628 \u0643\u0648\u0628\u064a\u0646 \u0648\u0646\u0635\u0641 \u0645\u0646 \u0627\u0644\u0633\u0643\u0631 \u0648\u062a\u0631\u064a\u062f \u062a\u062d\u0636\u064a\u0631 \u0646\u0635\u0641 \u0627\u0644\u0643\u0645\u064a\u0629\u060c \u0641\u0627\u062d\u0633\u0628 \u0627\u0644\u0643\u0645\u064a\u0629 \u0627\u0644\u062f\u0642\u064a\u0642\u0629 \u0645\u0646 \u0627\u0644\u0633\u0643\u0631 \u0627\u0644\u0645\u0637\u0644\u0648\u0628\u0629." } ], "vectors": {} } ``` While the same record in HuggingFace `datasets` looks as follows: ```json { "external_id": null, "metadata": "{\"source\": \"argilla/distilabel-reasoning-prompts\", \"kind\": \"synthetic\", \"evolved_from\": null}", "source": "If a recipe calls for 2 1/2 cups of sugar and you want to make a half portion of it, calculate the exact amount of sugar needed.", "target": [ { "status": "submitted", "user_id": "6e3edb87-0ccc-47ef-bd61-3ed0e68b20de", "value": "\u0625\u0630\u0627 \u0643\u0627\u0646\u062a \u0627\u0644\u0648\u0635\u0641\u0629 \u062a\u062a\u0637\u0644\u0628 \u0643\u0648\u0628\u064a\u0646 \u0648\u0646\u0635\u0641 \u0645\u0646 \u0627\u0644\u0633\u0643\u0631 \u0648\u062a\u0631\u064a\u062f \u062a\u062d\u0636\u064a\u0631 \u0646\u0635\u0641 \u0647\u0630\u0647 \u0627\u0644\u0643\u0645\u064a\u0629\u060c \u0641\u0627\u062d\u0633\u0628 \u0643\u0645\u064a\u0629 \u0627\u0644\u0633\u0643\u0631 \u0627\u0644\u0645\u0637\u0644\u0648\u0628\u0629 \u0628\u0627\u0644\u0636\u0628\u0637." } ], "target-suggestion": "\u0625\u0630\u0627 \u0643\u0627\u0646\u062a \u0627\u0644\u0648\u0635\u0641\u0629 \u062a\u062a\u0637\u0644\u0628 \u0643\u0648\u0628\u064a\u0646 \u0648\u0646\u0635\u0641 \u0645\u0646 \u0627\u0644\u0633\u0643\u0631 \u0648\u062a\u0631\u064a\u062f \u062a\u062d\u0636\u064a\u0631 \u0646\u0635\u0641 \u0627\u0644\u0643\u0645\u064a\u0629\u060c \u0641\u0627\u062d\u0633\u0628 \u0627\u0644\u0643\u0645\u064a\u0629 \u0627\u0644\u062f\u0642\u064a\u0642\u0629 \u0645\u0646 \u0627\u0644\u0633\u0643\u0631 \u0627\u0644\u0645\u0637\u0644\u0648\u0628\u0629.", "target-suggestion-metadata": { "agent": null, "score": null, "type": null } } ``` ### Data Fields Among the dataset fields, we differentiate between the following: * **Fields:** These are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. * **source** is of type `text`. * **Questions:** These are the questions that will be asked to the annotators. They can be of different types, such as `RatingQuestion`, `TextQuestion`, `LabelQuestion`, `MultiLabelQuestion`, and `RankingQuestion`. * **target** is of type `text`, and description "Translate the text.". * **Suggestions:** As of Argilla 1.13.0, the suggestions have been included to provide the annotators with suggestions to ease or assist during the annotation process. Suggestions are linked to the existing questions, are always optional, and contain not just the suggestion itself, but also the metadata linked to it, if applicable. * (optional) **target-suggestion** is of type `text`. Additionally, we also have two more fields that are optional and are the following: * **metadata:** This is an optional field that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. * **external_id:** This is an optional field that can be used to provide an external ID for the dataset record. This can be useful if you want to link the dataset record to an external resource, such as a database or a file. ### Data Splits The dataset contains a single split, which is `train`. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation guidelines This is a translation dataset that contains texts. Please translate the text in the text field. #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]

规模类别:样本数少于1000 标签: - 基于人类反馈的强化学习(RLFH) - Argilla - 人类反馈 # MPEP_ARABIC 数据集卡片 本数据集基于[Argilla](https://docs.argilla.io)构建。 如下文所述,本数据集既可按照[通过Argilla加载](#load-with-argilla)中的说明载入Argilla,也可通过`datasets`库直接加载,详见[通过`datasets`加载](#load-with-datasets)。 ## 数据集描述 - **主页**:https://argilla.io - **代码仓库**:https://github.com/argilla-io/argilla - **论文**: - **排行榜**: - **联系人**: ### 数据集摘要 本数据集包含以下内容: * 符合Argilla数据集格式的配置文件`argilla.yaml`。当在Argilla中使用`FeedbackDataset.from_huggingface`方法时,将通过该配置文件对数据集进行配置。 * 兼容HuggingFace `datasets`格式的数据集记录。使用`FeedbackDataset.from_huggingface`时将自动加载此类记录,也可通过`datasets`库的`load_dataset`函数独立加载。 * 若在Argilla中已定义,则包含用于数据集构建与整理的[标注指南](#annotation-guidelines)。 ### 通过Argilla加载 若要通过Argilla加载本数据集,只需执行`pip install argilla --upgrade`安装并升级Argilla,随后运行以下代码: python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("DIBT/MPEP_ARABIC") ### 通过`datasets`加载 若要通过`datasets`库加载本数据集,只需执行`pip install datasets --upgrade`安装并升级`datasets`,随后运行以下代码: python from datasets import load_dataset ds = load_dataset("DIBT/MPEP_ARABIC") ### 支持的任务与排行榜 本数据集包含[多字段、多问题与多回复](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#feedback-dataset),因此可根据配置用于多种自然语言处理(NLP)任务。数据集结构详见[数据集结构](#dataset-structure)章节。 本数据集暂无对应的排行榜。 ### 语言 [需补充更多信息] ## 数据集结构 ### Argilla中的数据 本数据集在Argilla中通过以下要素构建:**字段(fields)**、**问题(questions)**、**建议(suggestions)**、**元数据(metadata)**、**向量(vectors)**与**指南(guidelines)**。 **字段**即数据集记录本身,目前仅支持文本字段,用于接收针对问题的回复。 | 字段名称 | 标题 | 类型 | 是否必填 | 支持Markdown | | ---------- | ----- | ---- | -------- | -------- | | source | 源 | 文本 | 是 | 是 | **问题**即向标注人员提出的查询,支持多种类型,包括评分、文本、单标签选择、多标签选择与排序等。 | 问题名称 | 标题 | 类型 | 是否必填 | 描述 | 可选值/标签 | | ------------- | ----- | ---- | -------- | ----------- | ------------- | | target | 目标 | 文本 | 是 | 翻译该文本。 | N/A | **建议**指为辅助标注人员完成标注流程,针对每个问题生成的人工或机器推荐结果。此类建议始终与对应问题绑定,命名规则为在问题名称后追加`-suggestion`以表示建议内容,追加`-suggestion-metadata`以表示建议的元数据,分别存储建议值及其元信息。简言之,建议的可选值与上文表格中的问题可选值一致,但列名需追加上述后缀。 **元数据**为用于存储数据集记录附加信息的字典结构,可用于向标注人员提供额外上下文,或补充数据集记录自身的相关信息。例如,可通过元数据提供数据集记录的原始来源链接,或是记录作者、创建日期、来源等信息。元数据为可选字段,可与`argilla.yaml`数据集配置文件中定义的`metadata_properties`进行关联。 | 元数据名称 | 标题 | 类型 | 可选值 | 对标注人员可见 | | ------------- | ----- | ---- | ------ | ---------------------- | **标注指南**同样为可选字段,是用于向标注人员提供操作说明的纯文本字符串,详见[标注指南](#annotation-guidelines)章节。 ### 数据实例 Argilla中的一条数据集示例如以下JSON格式所示: json { "external_id": null, "fields": { "source": "If a recipe calls for 2 1/2 cups of sugar and you want to make a half portion of it, calculate the exact amount of sugar needed." }, "metadata": { "evolved_from": null, "kind": "synthetic", "source": "argilla/distilabel-reasoning-prompts" }, "responses": [ { "status": "submitted", "user_id": "6e3edb87-0ccc-47ef-bd61-3ed0e68b20de", "values": { "target": { "value": "إذا كانت الوصفة تتطلب كوبين ونصف من السكر وتريد تحضير نصف هذه الكمية، فاحسب كمية السكر المطلوبة بالضبط." } } } ], "suggestions": [ { "agent": null, "question_name": "target", "score": null, "type": null, "value": "إذا كانت الوصفة تتطلب كوبين ونصف من السكر وتريد تحضير نصف الكمية، فاحسب الكمية الدقيقة من السكر المطلوبة." } ], "vectors": {} } 而该记录在HuggingFace `datasets`中的格式如下所示: json { "external_id": null, "metadata": "{"source": "argilla/distilabel-reasoning-prompts", "kind": "synthetic", "evolved_from": null}", "source": "If a recipe calls for 2 1/2 cups of sugar and you want to make a half portion of it, calculate the exact amount of sugar needed.", "target": [ { "status": "submitted", "user_id": "6e3edb87-0ccc-47ef-bd61-3ed0e68b20de", "value": "إذا كانت الوصفة تتطلب كوبين ونصف من السكر وتريد تحضير نصف هذه الكمية، فاحسب كمية السكر المطلوبة بالضبط." } ], "target-suggestion": "إذا كانت الوصفة تتطلب كوبين ونصف من السكر وتريد تحضير نصف الكمية، فاحسب الكمية الدقيقة من السكر المطلوبة.", "target-suggestion-metadata": { "agent": null, "score": null, "type": null } } ### 数据字段 本数据集的字段可分为以下几类: * **字段**:即数据集记录本身,目前仅支持文本字段,用于接收针对问题的回复。 * **source(源字段)**:类型为`text`。 * **问题**:即向标注人员提出的查询,支持多种类型,包括`RatingQuestion`、`TextQuestion`、`LabelQuestion`、`MultiLabelQuestion`与`RankingQuestion`。 * **target(目标问题)**:类型为`text`,描述为“翻译该文本。”。 * **建议**:自Argilla 1.13.0版本起,建议功能用于辅助标注人员完成标注流程。建议与对应问题绑定,为可选字段,不仅包含建议内容本身,还可附带关联的元数据(若有)。 * (可选)**target-suggestion**:类型为`text`。 此外,本数据集还包含两个可选字段: * **metadata(元数据)**:用于存储数据集记录的附加信息,可向标注人员提供额外上下文,或补充数据集记录自身的相关信息。例如,可通过元数据提供数据集记录的原始来源链接,或是记录作者、创建日期、来源等信息。元数据为可选字段,可与`argilla.yaml`数据集配置文件中定义的`metadata_properties`进行关联。 * **external_id(外部ID)**:可选字段,可用于为数据集记录分配外部标识,便于将数据集记录与外部资源(如数据库或文件)进行关联。 ### 数据划分 本数据集仅包含一个划分,即`train`(训练集)。 ## 数据集创建 ### 整理依据 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源文本创作者是谁? [需补充更多信息] ### 标注 #### 标注指南 本数据集为翻译数据集,包含各类文本,请将文本字段中的内容进行翻译。 #### 标注流程 [需补充更多信息] #### 标注人员是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知限制 [需补充更多信息] ## 附加信息 ### 数据集整理者 [需补充更多信息] ### 授权信息 [需补充更多信息] ### 引用信息 [需补充更多信息] ### 贡献 [需补充更多信息]
提供机构:
data-is-better-together
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作