frascuchon/oasst_response_quality

Name: frascuchon/oasst_response_quality
Creator: frascuchon
Published: 2023-11-10 13:59:55
License: 暂无描述

Hugging Face2023-11-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/frascuchon/oasst_response_quality

下载链接

链接失效反馈

官方服务：

资源简介：

--- size_categories: 1K<n<10K tags: - rlfh - argilla - human-feedback --- # Dataset Card for oasst_response_quality This dataset has been created with [Argilla](https://docs.argilla.io). As shown in the sections below, this dataset can be loaded into Argilla as explained in [Load with Argilla](#load-with-argilla), or used directly with the `datasets` library in [Load with `datasets`](#load-with-datasets). ## Dataset Description - **Homepage:** https://argilla.io - **Repository:** https://github.com/argilla-io/argilla - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset contains: * A dataset configuration file conforming to the Argilla dataset format named `argilla.yaml`. This configuration file will be used to configure the dataset when using the `FeedbackDataset.from_huggingface` method in Argilla. * Dataset records in a format compatible with HuggingFace `datasets`. These records will be loaded automatically when using `FeedbackDataset.from_huggingface` and can be loaded independently using the `datasets` library via `load_dataset`. * The [annotation guidelines](#annotation-guidelines) that have been used for building and curating the dataset, if they've been defined in Argilla. ### Load with Argilla To load with Argilla, you'll just need to install Argilla as `pip install argilla --upgrade` and then use the following code: ```python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("frascuchon/oasst_response_quality") ``` ### Load with `datasets` To load this dataset with `datasets`, you'll just need to install `datasets` as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset ds = load_dataset("frascuchon/oasst_response_quality") ``` ### Supported Tasks and Leaderboards This dataset can contain [multiple fields, questions and responses](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#feedback-dataset) so it can be used for different NLP tasks, depending on the configuration. The dataset structure is described in the [Dataset Structure section](#dataset-structure). There are no leaderboards associated with this dataset. ### Languages [More Information Needed] ## Dataset Structure ### Data in Argilla The dataset is created in Argilla with: **fields**, **questions**, **suggestions**, **metadata**, **vectors**, and **guidelines**. The **fields** are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. | Field Name | Title | Type | Required | Markdown | | ---------- | ----- | ---- | -------- | -------- | | prompt | Prompt | FieldTypes.text | True | True | | response | Response | FieldTypes.text | True | True | The **questions** are the questions that will be asked to the annotators. They can be of different types, such as rating, text, label_selection, multi_label_selection, or ranking. | Question Name | Title | Type | Required | Description | Values/Labels | | ------------- | ----- | ---- | -------- | ----------- | ------------- | | relevant | Is the response relevant for the given prompt? | QuestionTypes.label_selection | True | N/A | ['Yes', 'No'] | | content_class | Does the response include any of the following? | QuestionTypes.multi_label_selection | False | N/A | ['hate', 'inappropriate', 'not_english', 'pii', 'sexual', 'untruthful', 'violent'] | | rating | Rate the quality of the response: | QuestionTypes.rating | True | Rate the quality of the response based on its truthfulness and helpfulness. 1 is very bad and 10 is very good. | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | | corrected-text | Provide a correction to the response: | QuestionTypes.text | False | If the rating provided is below 4, please write a new version of the response. | N/A | The **suggestions** are human or machine generated recommendations for each question to assist the annotator during the annotation process, so those are always linked to the existing questions, and named appending "-suggestion" and "-suggestion-metadata" to those, containing the value/s of the suggestion and its metadata, respectively. So on, the possible values are the same as in the table above, but the column name is appended with "-suggestion" and the metadata is appended with "-suggestion-metadata". The **metadata** is a dictionary that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. **✨ NEW** The **vectors** are different columns that contain a vector in floating point, which is constraint to the pre-defined dimensions in the **vectors_settings** when configuring the vectors within the dataset itself, also the dimensions will always be 1-dimensional. The **vectors** are optional and identified by the pre-defined vector name in the dataset configuration file in `argilla.yaml`. | Vector Name | Title | Dimensions | |-------------|-------|------------| | prompt | Prompt | [1, 2] | | Metadata Name | Title | Type | Values | Visible for Annotators | | ------------- | ----- | ---- | ------ | ---------------------- | The **guidelines**, are optional as well, and are just a plain string that can be used to provide instructions to the annotators. Find those in the [annotation guidelines](#annotation-guidelines) section. ### Data Instances An example of a dataset instance in Argilla looks as follows: ```json { "external_id": null, "fields": { "prompt": "Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?", "response": "Sure! Let\u0027s say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldn\u0027t be able to tell cats and dogs apart. This is where contrastive learning comes in.\n\nThe point of contrastive learning is to take pairs of samples (in this case images of cats and dogs), then train the model to \"pull\" representations of similar pairs (cat-cat or dog-dog) closer to each other and \"push\" representations of different pairs (cat-dog) apart. After doing this for a sufficient number of steps, your model will be able to produce unique, reliable representations for cats and dogs, in essence tell them apart.\n\nThis method is not limited to images, you can typically use it with any dataset that has similar and dissimilar data points." }, "metadata": {}, "responses": [], "suggestions": [], "vectors": { "prompt": [ 1, 2 ] } } ``` While the same record in HuggingFace `datasets` looks as follows: ```json { "content_class": [], "content_class-suggestion": null, "content_class-suggestion-metadata": { "agent": null, "score": null, "type": null }, "corrected-text": [], "corrected-text-suggestion": null, "corrected-text-suggestion-metadata": { "agent": null, "score": null, "type": null }, "external_id": null, "metadata": "{}", "prompt": "Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?", "rating": [], "rating-suggestion": null, "rating-suggestion-metadata": { "agent": null, "score": null, "type": null }, "relevant": [], "relevant-suggestion": null, "relevant-suggestion-metadata": { "agent": null, "score": null, "type": null }, "response": "Sure! Let\u0027s say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldn\u0027t be able to tell cats and dogs apart. This is where contrastive learning comes in.\n\nThe point of contrastive learning is to take pairs of samples (in this case images of cats and dogs), then train the model to \"pull\" representations of similar pairs (cat-cat or dog-dog) closer to each other and \"push\" representations of different pairs (cat-dog) apart. After doing this for a sufficient number of steps, your model will be able to produce unique, reliable representations for cats and dogs, in essence tell them apart.\n\nThis method is not limited to images, you can typically use it with any dataset that has similar and dissimilar data points.", "vectors": { "prompt": [ 1.0, 2.0 ] } } ``` ### Data Fields Among the dataset fields, we differentiate between the following: * **Fields:** These are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. * **prompt** is of type `FieldTypes.text`. * **response** is of type `FieldTypes.text`. * **Questions:** These are the questions that will be asked to the annotators. They can be of different types, such as `RatingQuestion`, `TextQuestion`, `LabelQuestion`, `MultiLabelQuestion`, and `RankingQuestion`. * **relevant** is of type `QuestionTypes.label_selection` with the following allowed values ['Yes', 'No']. * (optional) **content_class** is of type `QuestionTypes.multi_label_selection` with the following allowed values ['hate', 'inappropriate', 'not_english', 'pii', 'sexual', 'untruthful', 'violent']. * **rating** is of type `QuestionTypes.rating` with the following allowed values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], and description "Rate the quality of the response based on its truthfulness and helpfulness. 1 is very bad and 10 is very good.". * (optional) **corrected-text** is of type `QuestionTypes.text`, and description "If the rating provided is below 4, please write a new version of the response.". * **Suggestions:** As of Argilla 1.13.0, the suggestions have been included to provide the annotators with suggestions to ease or assist during the annotation process. Suggestions are linked to the existing questions, are always optional, and contain not just the suggestion itself, but also the metadata linked to it, if applicable. * (optional) **relevant-suggestion** is of type `QuestionTypes.label_selection` with the following allowed values ['Yes', 'No']. * (optional) **content_class-suggestion** is of type `QuestionTypes.multi_label_selection` with the following allowed values ['hate', 'inappropriate', 'not_english', 'pii', 'sexual', 'untruthful', 'violent']. * (optional) **rating-suggestion** is of type `QuestionTypes.rating` with the following allowed values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. * (optional) **corrected-text-suggestion** is of type `QuestionTypes.text`. * **✨ NEW** **Vectors**: As of Argilla 1.19.0, the vectors have been included in order to add support for similarity search to explore similar records based on vector search powered by the search engine defined. The vectors are optional and cannot be seen within the UI, those are uploaded and internally used. Also the vectors will always be optional, and only the dimensions previously defined in their settings. * (optional) **prompt** is of type `float32` and has a dimension of (1, `2`). Additionally, we also have two more fields that are optional and are the following: * **metadata:** This is an optional field that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. * **external_id:** This is an optional field that can be used to provide an external ID for the dataset record. This can be useful if you want to link the dataset record to an external resource, such as a database or a file. ### Data Splits The dataset contains a single split, which is `train`. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation guidelines Answer the questions to assess the quality of the response given by the chat assistant. #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]

提供机构：

frascuchon

原始信息汇总

数据集卡片 for oasst_response_quality

数据集描述

数据集概述

该数据集包含：

符合 Argilla 数据集格式的配置文件 argilla.yaml，用于在使用 FeedbackDataset.from_huggingface 方法时配置数据集。
与 HuggingFace datasets 兼容的数据集记录，这些记录在使用 FeedbackDataset.from_huggingface 时会自动加载，也可以通过 datasets 库独立加载。
用于构建和整理数据集的标注指南（如果已在 Argilla 中定义）。

加载方式

使用 Argilla 加载

安装 Argilla：

python pip install argilla --upgrade

加载数据集：

python import argilla as rg

ds = rg.FeedbackDataset.from_huggingface("frascuchon/oasst_response_quality")

使用 `datasets` 加载

安装 datasets：

python pip install datasets --upgrade

加载数据集：

python from datasets import load_dataset

ds = load_dataset("frascuchon/oasst_response_quality")

支持的任务和排行榜

该数据集可以包含多个字段、问题和响应，因此可以用于不同的 NLP 任务，具体取决于配置。数据集结构在数据集结构部分中描述。

该数据集没有关联的排行榜。

语言

[更多信息需要]

数据集结构

数据在 Argilla 中的结构

数据集在 Argilla 中包含以下部分：字段、问题、建议、元数据、向量和指南。

字段

字段名称	标题	类型	必填	Markdown
prompt	提示	FieldTypes.text	True	True
response	响应	FieldTypes.text	True	True

问题

问题名称	标题	类型	必填	描述	值/标签
relevant	响应对给定的提示是否相关？	QuestionTypes.label_selection	True	N/A	[Yes, No]
content_class	响应是否包含以下任何内容？	QuestionTypes.multi_label_selection	False	N/A	[hate, inappropriate, not_english, pii, sexual, untruthful, violent]
rating	评价响应的质量：	QuestionTypes.rating	True	根据其真实性和有用性评价响应的质量。1 非常差，10 非常好。	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
corrected-text	提供对响应的修正：	QuestionTypes.text	False	如果提供的评分低于 4，请写一个新的响应版本。	N/A

建议

建议是与现有问题相关联的人工或机器生成的推荐，以协助标注者在标注过程中使用。

元数据

元数据是一个字典，用于提供有关数据集记录的额外信息。

向量

向量是包含浮点数的不同列，其维度在配置时预定义。

向量名称	标题	维度
prompt	提示	[1, 2]

指南

指南是可选的，用于向标注者提供指令。

数据实例

在 Argilla 中的数据实例

json { "external_id": null, "fields": { "prompt": "Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?", "response": "Sure! Lets say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldnt be able to tell cats and dogs apart. This is where contrastive learning comes in.

The point of contrastive learning is to take pairs of samples (in this case images of cats and dogs), then train the model to "pull" representations of similar pairs (cat-cat or dog-dog) closer to each other and "push" representations of different pairs (cat-dog) apart. After doing this for a sufficient number of steps, your model will be able to produce unique, reliable representations for cats and dogs, in essence tell them apart.

This method is not limited to images, you can typically use it with any dataset that has similar and dissimilar data points." }, "metadata": {}, "responses": [], "suggestions": [], "vectors": { "prompt": [ 1, 2 ] } }

在 HuggingFace `datasets` 中的数据实例

json { "content_class": [], "content_class-suggestion": null, "content_class-suggestion-metadata": { "agent": null, "score": null, "type": null }, "corrected-text": [], "corrected-text-suggestion": null, "corrected-text-suggestion-metadata": { "agent": null, "score": null, "type": null }, "external_id": null, "metadata": "{}", "prompt": "Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?", "rating": [], "rating-suggestion": null, "rating-suggestion-metadata": { "agent": null, "score": null, "type": null }, "relevant": [], "relevant-suggestion": null, "relevant-suggestion-metadata": { "agent": null, "score": null, "type": null }, "response": "Sure! Lets say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldnt be able to tell cats and dogs apart. This is where contrastive learning comes in.

This method is not limited to images, you can typically use it with any dataset that has similar and dissimilar data points.", "vectors": { "prompt": [ 1.0, 2.0 ] } }

数据字段

数据集字段包括：

字段：数据集记录本身，目前仅支持文本字段。
- prompt 类型为 FieldTypes.text。
- response 类型为 FieldTypes.text。
问题：向标注者提出的问题，可以是不同类型，如 RatingQuestion、TextQuestion、LabelQuestion、MultiLabelQuestion 和 RankingQuestion。
- relevant 类型为 QuestionTypes.label_selection，允许值为 [Yes, No]。
- content_class（可选）类型为 QuestionTypes.multi_label_selection，允许值为 [hate, inappropriate, not_english, pii, sexual, untruthful, violent]。
- rating 类型为 QuestionTypes.rating，允许值为 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]，描述为“根据其真实性和有用性评价响应的质量。1 非常差，10 非常好。”。
- corrected-text（可选）类型为 QuestionTypes.text，描述为“如果提供的评分低于 4，请写一个新的响应版本。”。
建议：自 Argilla 1.13.0 起，建议已包含在内，以提供标注者在标注过程中使用的建议。
- relevant-suggestion（可选）类型为 QuestionTypes.label_selection，允许值为 [Yes, No]。
- content_class-suggestion（可选）类型为 QuestionTypes.multi_label_selection，允许值为 [hate, inappropriate, not_english, pii, sexual, untruthful, violent]。
- rating-suggestion（可选）类型为 QuestionTypes.rating，允许值为 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]。
- corrected-text-suggestion（可选）类型为 QuestionTypes.text。
向量：自 Argilla 1.19.0 起，向量已包含在内，以支持基于向量搜索的相似性搜索。
- prompt（可选）类型为 float32，维度为 (1, 2)。

此外，还有两个可选字段：

metadata：用于提供有关数据集记录的额外信息。
external_id：用于提供数据集记录的外部 ID。

数据分割

数据集包含一个分割，即 train。

5,000+

优质数据集

54 个

任务类型

进入经典数据集