five

do11/test

收藏
Hugging Face2023-08-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/do11/test
下载链接
链接失效反馈
官方服务:
资源简介:
--- size_categories: 10K<n<100K tags: - rlfh - argilla - human-feedback --- # Dataset Card for test This dataset has been created with [Argilla](https://docs.argilla.io). As shown in the sections below, this dataset can be loaded into Argilla as explained in [Load with Argilla](#load-with-argilla), or used directly with the `datasets` library in [Load with `datasets`](#load-with-datasets). ## Dataset Description - **Homepage:** https://argilla.io - **Repository:** https://github.com/argilla-io/argilla - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset contains: * A dataset configuration file conforming to the Argilla dataset format named `argilla.cfg`. This configuration file will be used to configure the dataset when using the `FeedbackDataset.from_huggingface` method in Argilla. * Dataset records in a format compatible with HuggingFace `datasets`. These records will be loaded automatically when using `FeedbackDataset.from_huggingface` and can be loaded independently using the `datasets` library via `load_dataset`. * The [annotation guidelines](#annotation-guidelines) that have been used for building and curating the dataset, if they've been defined in Argilla. ### Load with Argilla To load with Argilla, you'll just need to install Argilla as `pip install argilla --upgrade` and then use the following code: ```python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("do11/test") ``` ### Load with `datasets` To load this dataset with `datasets`, you'll just need to install `datasets` as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset ds = load_dataset("do11/test") ``` ### Supported Tasks and Leaderboards This dataset can contain [multiple fields, questions and responses](https://docs.argilla.io/en/latest/guides/llms/conceptual_guides/data_model.html) so it can be used for different NLP tasks, depending on the configuration. The dataset structure is described in the [Dataset Structure section](#dataset-structure). There are no leaderboards associated with this dataset. ### Languages [More Information Needed] ## Dataset Structure ### Data in Argilla The dataset is created in Argilla with: **fields**, **questions**, and **guidelines**. The **fields** are the dataset records themselves, for the moment just text fields are suppported. These are the ones that will be used to provide responses to the questions. | Field Name | Title | Type | Required | Markdown | | ---------- | ----- | ---- | -------- | -------- | | category | Task category | TextField | True | False | | instruction | Instruction | TextField | True | False | | context | Input | TextField | True | False | | response | Response | TextField | True | False | The **questions** are the questions that will be asked to the annotators. They can be of different types, such as rating, text, single choice, or multiple choice. | Question Name | Title | Type | Required | Description | Values/Labels | | ------------- | ----- | ---- | -------- | ----------- | ------------- | | new-instruction | Final instruction: | TextQuestion | True | Write the final version of the instruction, making sure that it matches the task category. If the original instruction is ok, copy and paste it here. | N/A | | new-input | Final input: | TextQuestion | True | Write the final version of the input, making sure that it makes sense with the task category. If the original input is ok, copy and paste it here. If an input is not needed, leave this empty. | N/A | | new-response | Final response: | TextQuestion | True | Write the final version of the response, making sure that it matches the task category and makes sense for the instruction (and input) provided. If the original response is ok, copy and paste it here. | N/A | Finally, the **guidelines** are just a plain string that can be used to provide instructions to the annotators. Find those in the [annotation guidelines](#annotation-guidelines) section. ### Data Instances An example of a dataset instance in Argilla looks as follows: ```json { "external_id": "0", "fields": { "category": "closed_qa", "context": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\u0027s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", "instruction": "When did Virgin Australia start operating?", "response": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route." }, "metadata": null, "responses": [] } ``` While the same record in HuggingFace `datasets` looks as follows: ```json { "category": "closed_qa", "context": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\u0027s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", "external_id": "0", "instruction": "When did Virgin Australia start operating?", "metadata": null, "new-input": null, "new-instruction": null, "new-response": null, "response": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route." } ``` ### Data Fields Among the dataset fields, we differentiate between the following: * **Fields:** These are the dataset records themselves, for the moment just text fields are suppported. These are the ones that will be used to provide responses to the questions. * **category** is of type `TextField`. * **instruction** is of type `TextField`. * (optional) **context** is of type `TextField`. * **response** is of type `TextField`. * **Questions:** These are the questions that will be asked to the annotators. They can be of different types, such as rating, text, single choice, or multiple choice. * **new-instruction** is of type `TextQuestion`, and description "Write the final version of the instruction, making sure that it matches the task category. If the original instruction is ok, copy and paste it here.". * (optional) **new-input** is of type `TextQuestion`, and description "Write the final version of the input, making sure that it makes sense with the task category. If the original input is ok, copy and paste it here. If an input is not needed, leave this empty.". * **new-response** is of type `TextQuestion`, and description "Write the final version of the response, making sure that it matches the task category and makes sense for the instruction (and input) provided. If the original response is ok, copy and paste it here.". Additionally, we also have one more field which is optional and is the following: * **external_id:** This is an optional field that can be used to provide an external ID for the dataset record. This can be useful if you want to link the dataset record to an external resource, such as a database or a file. ### Data Splits The dataset contains a single split, which is `train`. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation guidelines In this dataset, you will find a collection of records that show a category, an instruction, an input and a response to that instruction. The aim of the project is to correct the instructions, intput and responses to make sure they are of the highest quality and that they match the task category that they belong to. All three texts should be clear and include real information. In addition, the response should be as complete but concise as possible. To curate the dataset, you will need to provide an answer to the following text fields: 1 - Final instruction: The final version of the instruction field. You may copy it using the copy icon in the instruction field. Leave it as it is if it's ok or apply any necessary corrections. Remember to change the instruction if it doesn't represent well the task category of the record. 2 - Final input: The final version of the instruction field. You may copy it using the copy icon in the input field. Leave it as it is if it's ok or apply any necessary corrections. If the task category and instruction don't need of an input to be completed, leave this question blank. 3 - Final response: The final version of the response field. You may copy it using the copy icon in the response field. Leave it as it is if it's ok or apply any necessary corrections. Check that the response makes sense given all the fields above. You will need to provide at least an instruction and a response for all records. If you are not sure about a record and you prefer not to provide a response, click Discard. #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
do11
原始信息汇总

数据集概述

基本信息

  • 数据集名称: test
  • 大小分类: 10K<n<100K
  • 标签:
    • rlfh
    • argilla
    • human-feedback

数据集描述

数据集概要

  • 配置文件: argilla.cfg,符合Argilla数据集格式。
  • 记录格式: 兼容HuggingFace datasets
  • 标注指南: 如有定义,可在标注指南部分找到。

加载方式

  • 使用Argilla加载: python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("do11/test")

  • 使用datasets库加载: python from datasets import load_dataset ds = load_dataset("do11/test")

支持的任务和语言

  • 任务: 支持多种NLP任务,具体取决于配置。
  • 语言: [更多信息待补充]

数据集结构

数据字段

  • 字段:
    • category (TextField)
    • instruction (TextField)
    • context (TextField)
    • response (TextField)
  • 问题:
    • new-instruction (TextQuestion)
    • new-input (TextQuestion)
    • new-response (TextQuestion)
  • 外部ID: 可选,用于关联外部资源。

数据实例

  • 示例: json { "external_id": "0", "fields": { "category": "closed_qa", "context": "...", "instruction": "...", "response": "..." }, "metadata": null, "responses": [] }

数据分割

  • 分割: 仅包含train分割。

数据集创建

标注指南

  • 目的: 修正指令、输入和响应,确保高质量且与任务类别匹配。
  • 要求:
    • 指令、输入和响应应清晰并包含真实信息。
    • 响应应完整且简洁。
  • 标注步骤:
    • 修正指令。
    • 修正输入(可选)。
    • 修正响应。

其他信息

  • 数据集管理员、许可证、引用信息和贡献者信息: [待补充]
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作