five

frank098/new_questions

收藏
Hugging Face2023-07-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/frank098/new_questions
下载链接
链接失效反馈
官方服务:
资源简介:
--- size_categories: n<1K tags: - rlfh - argilla - human-feedback --- # Dataset Card for new_questions This dataset has been created with [Argilla](https://docs.argilla.io). As shown in the sections below, this dataset can be loaded into Argilla as explained in [Load with Argilla](#load-with-argilla), or used directly with the `datasets` library in [Load with `datasets`](#load-with-datasets). ## Dataset Description - **Homepage:** https://argilla.io - **Repository:** https://github.com/argilla-io/argilla - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset contains: * A dataset configuration file conforming to the Argilla dataset format named `argilla.cfg`. This configuration file will be used to configure the dataset when using the `FeedbackDataset.from_huggingface` method in Argilla. * Dataset records in a format compatible with HuggingFace `datasets`. These records will be loaded automatically when using `FeedbackDataset.from_huggingface` and can be loaded independently using the `datasets` library via `load_dataset`. * The [annotation guidelines](#annotation-guidelines) that have been used for building and curating the dataset, if they've been defined in Argilla. ### Load with Argilla To load with Argilla, you'll just need to install Argilla as `pip install argilla --upgrade` and then use the following code: ```python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("frank098/new_questions") ``` ### Load with `datasets` To load this dataset with `datasets`, you'll just need to install `datasets` as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset ds = load_dataset("frank098/new_questions") ``` ### Supported Tasks and Leaderboards This dataset can contain [multiple fields, questions and responses](https://docs.argilla.io/en/latest/guides/llms/conceptual_guides/data_model.html) so it can be used for different NLP tasks, depending on the configuration. The dataset structure is described in the [Dataset Structure section](#dataset-structure). There are no leaderboards associated with this dataset. ### Languages [More Information Needed] ## Dataset Structure ### Data in Argilla The dataset is created in Argilla with: **fields**, **questions**, and **guidelines**. The **fields** are the dataset records themselves, for the moment just text fields are suppported. These are the ones that will be used to provide responses to the questions. | Field Name | Title | Type | Required | Markdown | | ---------- | ----- | ---- | -------- | -------- | | category | Task category | TextField | True | False | | context | Context | TextField | True | False | | template_1 | Template 1 | TextField | True | False | | example_1 | Example | TextField | True | False | | template_2 | Template 2 | TextField | True | False | | example_2 | Example | TextField | True | False | The **questions** are the questions that will be asked to the annotators. They can be of different types, such as rating, text, single choice, or multiple choice. | Question Name | Title | Type | Required | Description | Values/Labels | | ------------- | ----- | ---- | -------- | ----------- | ------------- | | new-instruction | First instruction | TextQuestion | True | Write the final version of the instruction, making sure that it matches the task category. If the original instruction is ok, copy and paste it here. | N/A | | new-input | Second instruction: | TextQuestion | True | Write the final version of the input, making sure that it makes sense with the task category. If the original input is ok, copy and paste it here. If an input is not needed, leave this empty. | N/A | Finally, the **guidelines** are just a plain string that can be used to provide instructions to the annotators. Find those in the [annotation guidelines](#annotation-guidelines) section. ### Data Instances An example of a dataset instance in Argilla looks as follows: ```json { "external_id": null, "fields": { "category": "Closed questions", "context": "Closed questions are designed to elicit a specific yes or no answer. They are used to gather factual information or confirm specific attributes or actions related to the topic.", "example_1": "Is JunOS compatible with Juniper Routers?", "example_2": "Does JunOS support MPLS (Multiprotocol Label Switching) technology?", "template_1": "Is [topic] [attribute]?", "template_2": "Does [subject] [action] [object]?" }, "metadata": null, "responses": [ { "status": "submitted", "user_id": "7a7c4dd3-769c-4a22-9658-581e4aee3577", "values": { "new-input": { "value": "Hello?" }, "new-instruction": { "value": "Hello?" } } } ] } ``` While the same record in HuggingFace `datasets` looks as follows: ```json { "category": "Closed questions", "context": "Closed questions are designed to elicit a specific yes or no answer. They are used to gather factual information or confirm specific attributes or actions related to the topic.", "example_1": "Is JunOS compatible with Juniper Routers?", "example_2": "Does JunOS support MPLS (Multiprotocol Label Switching) technology?", "external_id": null, "metadata": null, "new-input": { "status": [ "submitted" ], "user_id": [ "7a7c4dd3-769c-4a22-9658-581e4aee3577" ], "value": [ "Hello?" ] }, "new-instruction": { "status": [ "submitted" ], "user_id": [ "7a7c4dd3-769c-4a22-9658-581e4aee3577" ], "value": [ "Hello?" ] }, "template_1": "Is [topic] [attribute]?", "template_2": "Does [subject] [action] [object]?" } ``` ### Data Fields Among the dataset fields, we differentiate between the following: * **Fields:** These are the dataset records themselves, for the moment just text fields are suppported. These are the ones that will be used to provide responses to the questions. * **category** is of type `TextField`. * **context** is of type `TextField`. * **template_1** is of type `TextField`. * **example_1** is of type `TextField`. * **template_2** is of type `TextField`. * **example_2** is of type `TextField`. * **Questions:** These are the questions that will be asked to the annotators. They can be of different types, such as rating, text, single choice, or multiple choice. * **new-instruction** is of type `TextQuestion`, and description "Write the final version of the instruction, making sure that it matches the task category. If the original instruction is ok, copy and paste it here.". * **new-input** is of type `TextQuestion`, and description "Write the final version of the input, making sure that it makes sense with the task category. If the original input is ok, copy and paste it here. If an input is not needed, leave this empty.". Additionally, we also have one more field which is optional and is the following: * **external_id:** This is an optional field that can be used to provide an external ID for the dataset record. This can be useful if you want to link the dataset record to an external resource, such as a database or a file. ### Data Splits The dataset contains a single split, which is `train`. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation guidelines To create a varied range of questions, we collect 14 distinct question types. For each question, we create 2 example templates and provide an example for each template. Your task is to generate 2 fresh instructions that can be used with the given templates. If you come up with different ideas, feel free to deviate from the templates. Once you have generated both questions, click submit. #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
frank098
原始信息汇总

数据集卡片 for new_questions

数据集描述

数据集概述

该数据集包含:

  • 符合 Argilla 数据集格式的配置文件 argilla.cfg,用于在使用 FeedbackDataset.from_huggingface 方法时配置数据集。
  • 兼容 HuggingFace datasets 格式的数据记录,这些记录在使用 FeedbackDataset.from_huggingface 时会自动加载,也可以通过 datasets 库独立加载。
  • 用于构建和整理数据集的标注指南(如果已在 Argilla 中定义)。

加载数据集

使用 Argilla 加载

安装 Argilla:

python pip install argilla --upgrade

加载数据集:

python import argilla as rg

ds = rg.FeedbackDataset.from_huggingface("frank098/new_questions")

使用 datasets 加载

安装 datasets

python pip install datasets --upgrade

加载数据集:

python from datasets import load_dataset

ds = load_dataset("frank098/new_questions")

支持的任务和排行榜

该数据集可以包含多个字段、问题和响应,因此可以用于不同的 NLP 任务,具体取决于配置。数据集结构在数据集结构部分中描述。

该数据集没有关联的排行榜。

语言

[更多信息待补充]

数据集结构

数据在 Argilla 中

数据集在 Argilla 中创建,包含:字段问题指南

字段

字段名称 标题 类型 必需 Markdown
category 任务类别 TextField True False
context 上下文 TextField True False
template_1 模板1 TextField True False
example_1 示例 TextField True False
template_2 模板2 TextField True False
example_2 示例 TextField True False

问题

问题名称 标题 类型 必需 描述 值/标签
new-instruction 第一条指令 TextQuestion True 编写指令的最终版本,确保与任务类别匹配。如果原始指令没问题,请在此处复制粘贴。 N/A
new-input 第二条指令: TextQuestion True 编写输入的最终版本,确保与任务类别匹配。如果原始输入没问题,请在此处复制粘贴。如果不需要输入,请留空。 N/A

指南

指南是一个纯字符串,用于向标注者提供指令。请参阅标注指南部分。

数据实例

在 Argilla 中的数据实例示例

json { "external_id": null, "fields": { "category": "封闭式问题", "context": "封闭式问题旨在引出特定的是或否答案。它们用于收集与主题相关的具体信息或确认特定的属性或行为。", "example_1": "JunOS 是否兼容 Juniper 路由器?", "example_2": "JunOS 是否支持 MPLS(多协议标签交换)技术?", "template_1": "是否[主题][属性]?", "template_2": "是否[主体][行为][对象]?" }, "metadata": null, "responses": [ { "status": "submitted", "user_id": "7a7c4dd3-769c-4a22-9658-581e4aee3577", "values": { "new-input": { "value": "你好?" }, "new-instruction": { "value": "你好?" } } } ] }

在 HuggingFace datasets 中的数据实例示例

json { "category": "封闭式问题", "context": "封闭式问题旨在引出特定的是或否答案。它们用于收集与主题相关的具体信息或确认特定的属性或行为。", "example_1": "JunOS 是否兼容 Juniper 路由器?", "example_2": "JunOS 是否支持 MPLS(多协议标签交换)技术?", "external_id": null, "metadata": null, "new-input": { "status": [ "submitted" ], "user_id": [ "7a7c4dd3-769c-4a22-9658-581e4aee3577" ], "value": [ "你好?" ] }, "new-instruction": { "status": [ "submitted" ], "user_id": [ "7a7c4dd3-769c-4a22-9658-581e4aee3577" ], "value": [ "你好?" ] }, "template_1": "是否[主题][属性]?", "template_2": "是否[主体][行为][对象]?" }

数据字段

数据字段包括:

  • 字段:这些是数据集记录本身,目前仅支持文本字段。这些字段将用于提供问题的响应。

    • category 类型为 TextField
    • context 类型为 TextField
    • template_1 类型为 TextField
    • example_1 类型为 TextField
    • template_2 类型为 TextField
    • example_2 类型为 TextField
  • 问题:这些问题将向标注者提出。它们可以是不同类型,如评分、文本、单选或多选。

    • new-instruction 类型为 TextQuestion,描述为“编写指令的最终版本,确保与任务类别匹配。如果原始指令没问题,请在此处复制粘贴。”。
    • new-input 类型为 TextQuestion,描述为“编写输入的最终版本,确保与任务类别匹配。如果原始输入没问题,请在此处复制粘贴。如果不需要输入,请留空。”。

此外,还有一个可选字段:

  • external_id:这是一个可选字段,可用于为数据集记录提供外部 ID。如果需要将数据集记录链接到外部资源(如数据库或文件),这可能很有用。

数据分割

数据集包含一个分割,即 train

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作