aaronemmanuel/fgan-annotate-dataset

Name: aaronemmanuel/fgan-annotate-dataset
Creator: aaronemmanuel
Published: 2024-01-15 13:42:02
License: 暂无描述

Hugging Face2024-01-15 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/aaronemmanuel/fgan-annotate-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是通过Argilla创建的，包含与HuggingFace `datasets`库兼容的记录。数据集的主要内容包括背景、提示和响应等文本字段，以及用于注释的问题和建议。数据集的结构包括字段、问题、建议、元数据和指南。数据集的加载可以通过Argilla或`datasets`库进行。

提供机构：

aaronemmanuel

原始信息汇总

数据集卡片 for fgan-annotate-dataset

数据集描述

数据集概述

该数据集包含：

符合 Argilla 数据集格式的配置文件 argilla.yaml。该配置文件将在使用 Argilla 的 FeedbackDataset.from_huggingface 方法时用于配置数据集。
兼容 HuggingFace datasets 格式的数据记录。这些记录在使用 FeedbackDataset.from_huggingface 时会自动加载，也可以通过 datasets 库的 load_dataset 方法独立加载。
用于构建和整理数据集的标注指南（如果在 Argilla 中定义）。

加载方式

使用 Argilla 加载

安装 Argilla 后，使用以下代码加载数据集：

python import argilla as rg

ds = rg.FeedbackDataset.from_huggingface("aaronemmanuel/fgan-annotate-dataset")

使用 `datasets` 库加载

安装 datasets 库后，使用以下代码加载数据集：

python from datasets import load_dataset

ds = load_dataset("aaronemmanuel/fgan-annotate-dataset")

支持的任务和排行榜

该数据集可以包含多个字段、问题和响应，因此可以用于不同的 NLP 任务，具体取决于配置。数据集结构在数据集结构部分中描述。

该数据集没有关联的排行榜。

语言

[更多信息需要]

数据集结构

数据在 Argilla 中

数据集在 Argilla 中包含以下内容：字段、问题、建议、元数据、向量和指南。

字段

字段是数据集记录本身，目前仅支持文本字段。这些字段将用于提供对问题的响应。

字段名称	标题	类型	必需	Markdown
background	Background	text	True	False
prompt	Prompt	text	True	False
response	Final Response	text	True	False

问题

问题是将向标注者提出的问题。它们可以是不同类型，如评分、文本、标签选择、多标签选择或排序。

问题名称	标题	类型	必需	描述	值/标签
response_correction	Response_correction	text	True	N/A	N/A

建议

建议是人为或机器生成的推荐，用于在标注过程中协助标注者。这些建议总是与现有问题相关联，并在名称中附加“-suggestion”和“-suggestion-metadata”，分别包含建议的值及其元数据。

元数据

元数据是一个字典，用于提供有关数据集记录的额外信息。这可以用于向标注者提供额外上下文，或提供有关数据集记录本身的额外信息。元数据总是可选的，并且可以与 argilla.yaml 中定义的 metadata_properties 相关联。

指南

指南是可选的，只是一个纯字符串，用于向标注者提供指示。请参阅标注指南部分。

数据实例

在 Argilla 中的数据集实例示例如下：

json { "external_id": null, "fields": { "background": "Background: ITU has published Use cases for Autonomous Networks. ITU focus group on autonomous networks studies the use cases and ITU-T SG13 has published many use cases on autonomous networks. These use cases are categorised into two main categories depending on whether they are related to application of autonomous networks or related to the core concepts of autonomous networks.", "prompt": "u003chumanu003e: who publishes use cases for autonomous networks?", "response": "u003cbotu003e: ITU publishes use cases for autonomous networks based on the work of ITU focus group on autonomous networks and ITU-T SG13." }, "metadata": {}, "responses": [], "suggestions": [], "vectors": {} }

在 HuggingFace datasets 中的相同记录示例如下：

json { "background": "Background: ITU has published Use cases for Autonomous Networks. ITU focus group on autonomous networks studies the use cases and ITU-T SG13 has published many use cases on autonomous networks. These use cases are categorised into two main categories depending on whether they are related to application of autonomous networks or related to the core concepts of autonomous networks.", "external_id": null, "metadata": "{}", "prompt": "u003chumanu003e: who publishes use cases for autonomous networks?", "response": "u003cbotu003e: ITU publishes use cases for autonomous networks based on the work of ITU focus group on autonomous networks and ITU-T SG13.", "response_correction": [], "response_correction-suggestion": null, "response_correction-suggestion-metadata": { "agent": null, "score": null, "type": null } }

数据字段

数据字段包括以下内容：

字段：这些是数据集记录本身，目前仅支持文本字段。这些字段将用于提供对问题的响应。
- background 类型为 text。
- prompt 类型为 text。
- response 类型为 text。
问题：这些问题将向标注者提出。它们可以是不同类型，如 RatingQuestion、TextQuestion、LabelQuestion、MultiLabelQuestion 和 RankingQuestion。
- response_correction 类型为 text。
建议：从 Argilla 1.13.0 开始，建议已包含在内，以向标注者提供建议，以便在标注过程中轻松或协助。建议与现有问题相关联，总是可选的，并且不仅包含建议本身，还包含与之相关的元数据（如果适用）。
- （可选）response_correction-suggestion 类型为 text。

此外，还有两个可选字段：

元数据：这是一个可选字段，用于提供有关数据集记录的额外信息。这可以用于向标注者提供额外上下文，或提供有关数据集记录本身的额外信息。元数据总是可选的，并且可以与 argilla.yaml 中定义的 metadata_properties 相关联。
external_id：这是一个可选字段，用于为数据集记录提供外部 ID。这可以用于将数据集记录与外部资源（如数据库或文件）相关联。

数据分割

数据集包含一个分割，即 train。

5,000+

优质数据集

54 个

任务类型

进入经典数据集