nataliaElv/oasst_quality_with_suggestions

Name: nataliaElv/oasst_quality_with_suggestions
Creator: nataliaElv
Published: 2023-11-02 12:58:21
License: 暂无描述

Hugging Face2023-11-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/nataliaElv/oasst_quality_with_suggestions

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是通过Argilla创建的，包含一个配置文件`argilla.yaml`和与HuggingFace `datasets`库兼容的记录。数据集的结构包括字段、问题、建议、元数据和指南。字段主要是文本类型，问题包括评分、多标签选择等类型，建议是为辅助标注过程而提供的。数据集包含一个训练集分割。

This dataset was created via Argilla, and includes a configuration file `argilla.yaml` as well as records compatible with the HuggingFace `datasets` library. The structure of the dataset covers fields, questions, suggestions, metadata and guidelines. Most fields are of text type, while question types include scoring, multi-label selection and others; suggestions are provided to facilitate the annotation process. The dataset includes a training set split.

提供机构：

nataliaElv

原始信息汇总

数据集卡片 for oasst_quality_with_suggestions

数据集描述

数据集摘要：
- 包含一个符合 Argilla 数据集格式的配置文件 argilla.yaml。
- 数据集记录格式兼容 HuggingFace datasets。
- 包含用于构建和整理数据集的标注指南（如果已在 Argilla 中定义）。

加载数据集

使用 Argilla 加载

python import argilla as rg

ds = rg.FeedbackDataset.from_huggingface("nataliaElv/oasst_quality_with_suggestions")

使用 `datasets` 加载

python from datasets import load_dataset

ds = load_dataset("nataliaElv/oasst_quality_with_suggestions")

支持的任务和排行榜

数据集可用于不同的 NLP 任务，具体取决于配置。
没有与该数据集相关的排行榜。

数据集结构

数据在 Argilla 中的结构

字段 (Fields)：
- prompt：类型为 text，必需。
- response：类型为 text，必需。
问题 (Questions)：
- quality：类型为 rating，必需，允许值为 [1, 2, 3, 4, 5]。
- inappropriate-content：类型为 multi_label_selection，必需，允许值为 [Hate speech, Sexual content, Violent content, Untruthful information]。
- corrected-response：类型为 text，可选。
建议 (Suggestions)：
- quality-suggestion：类型为 rating，可选，允许值为 [1, 2, 3, 4, 5]。
- inappropriate-content-suggestion：类型为 multi_label_selection，可选，允许值为 [Hate speech, Sexual content, Violent content, Untruthful information]。
- corrected-response-suggestion：类型为 text，可选。
元数据 (Metadata)：
- 可选字段，用于提供数据记录的额外信息。
external_id：
- 可选字段，用于提供数据记录的外部 ID。

数据实例

Argilla 中的数据实例： json { "external_id": null, "fields": { "prompt": "Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?", "response": "Sure! Letu0027s say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldnu0027t be able to tell cats and dogs apart. This is where contrastive learning comes in.

The point of contrastive learning is to take pairs of samples (in this case images of cats and dogs), then train the model to "pull" representations of similar pairs (cat-cat or dog-dog) closer to each other and "push" representations of different pairs (cat-dog) apart. After doing this for a sufficient number of steps, your model will be able to produce unique, reliable representations for cats and dogs, in essence tell them apart.

This method is not limited to images, you can typically use it with any dataset that has similar and dissimilar data points." }, "metadata": { "hate_speech": 0.18384182587122588, "sexual_content": 0.1985823986360316, "untruthful_information": 0.22438455309200378, "violent_content": 0.15704218457210023 }, "responses": [], "suggestions": [], "vectors": {} }

HuggingFace datasets 中的数据实例： json { "corrected-response": [], "corrected-response-suggestion": null, "corrected-response-suggestion-metadata": { "agent": null, "score": null, "type": null }, "external_id": null, "inappropriate-content": [], "inappropriate-content-suggestion": null, "inappropriate-content-suggestion-metadata": { "agent": null, "score": null, "type": null }, "metadata": "{"hate_speech": 0.18384182587122588, "sexual_content": 0.1985823986360316, "untruthful_information": 0.22438455309200378, "violent_content": 0.15704218457210023}", "prompt": "Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?", "quality": [], "quality-suggestion": null, "quality-suggestion-metadata": { "agent": null, "score": null, "type": null }, "response": "Sure! Letu0027s say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldnu0027t be able to tell cats and dogs apart. This is where contrastive learning comes in.

This method is not limited to images, you can typically use it with any dataset that has similar and dissimilar data points." }

数据字段

字段 (Fields)：
- prompt：类型为 text。
- response：类型为 text。
问题 (Questions)：
- quality：类型为 rating，允许值为 [1, 2, 3, 4, 5]。
- inappropriate-content：类型为 multi_label_selection，允许值为 [Hate speech, Sexual content, Violent content, Untruthful information]。
- corrected-response：类型为 text，可选。
建议 (Suggestions)：
- quality-suggestion：类型为 rating，可选，允许值为 [1, 2, 3, 4, 5]。
- inappropriate-content-suggestion：类型为 multi_label_selection，可选，允许值为 [Hate speech, Sexual content, Violent content, Untruthful information]。
- corrected-response-suggestion：类型为 text，可选。
元数据 (Metadata)：
- 可选字段，用于提供数据记录的额外信息。
external_id：
- 可选字段，用于提供数据记录的外部 ID。

数据分割

数据集包含一个分割，即 train。

5,000+

优质数据集

54 个

任务类型

进入经典数据集