nataliaElv/oasst_quality_with_suggestions
收藏数据集卡片 for oasst_quality_with_suggestions
数据集描述
- 数据集摘要:
- 包含一个符合 Argilla 数据集格式的配置文件
argilla.yaml。 - 数据集记录格式兼容 HuggingFace
datasets。 - 包含用于构建和整理数据集的标注指南(如果已在 Argilla 中定义)。
- 包含一个符合 Argilla 数据集格式的配置文件
加载数据集
使用 Argilla 加载
python import argilla as rg
ds = rg.FeedbackDataset.from_huggingface("nataliaElv/oasst_quality_with_suggestions")
使用 datasets 加载
python from datasets import load_dataset
ds = load_dataset("nataliaElv/oasst_quality_with_suggestions")
支持的任务和排行榜
- 数据集可用于不同的 NLP 任务,具体取决于配置。
- 没有与该数据集相关的排行榜。
数据集结构
数据在 Argilla 中的结构
-
字段 (Fields):
prompt:类型为text,必需。response:类型为text,必需。
-
问题 (Questions):
quality:类型为rating,必需,允许值为 [1, 2, 3, 4, 5]。inappropriate-content:类型为multi_label_selection,必需,允许值为 [Hate speech, Sexual content, Violent content, Untruthful information]。corrected-response:类型为text,可选。
-
建议 (Suggestions):
quality-suggestion:类型为rating,可选,允许值为 [1, 2, 3, 4, 5]。inappropriate-content-suggestion:类型为multi_label_selection,可选,允许值为 [Hate speech, Sexual content, Violent content, Untruthful information]。corrected-response-suggestion:类型为text,可选。
-
元数据 (Metadata):
- 可选字段,用于提供数据记录的额外信息。
-
external_id:
- 可选字段,用于提供数据记录的外部 ID。
数据实例
- Argilla 中的数据实例: json { "external_id": null, "fields": { "prompt": "Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?", "response": "Sure! Letu0027s say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldnu0027t be able to tell cats and dogs apart. This is where contrastive learning comes in.
The point of contrastive learning is to take pairs of samples (in this case images of cats and dogs), then train the model to "pull" representations of similar pairs (cat-cat or dog-dog) closer to each other and "push" representations of different pairs (cat-dog) apart. After doing this for a sufficient number of steps, your model will be able to produce unique, reliable representations for cats and dogs, in essence tell them apart.
This method is not limited to images, you can typically use it with any dataset that has similar and dissimilar data points." }, "metadata": { "hate_speech": 0.18384182587122588, "sexual_content": 0.1985823986360316, "untruthful_information": 0.22438455309200378, "violent_content": 0.15704218457210023 }, "responses": [], "suggestions": [], "vectors": {} }
- HuggingFace
datasets中的数据实例: json { "corrected-response": [], "corrected-response-suggestion": null, "corrected-response-suggestion-metadata": { "agent": null, "score": null, "type": null }, "external_id": null, "inappropriate-content": [], "inappropriate-content-suggestion": null, "inappropriate-content-suggestion-metadata": { "agent": null, "score": null, "type": null }, "metadata": "{"hate_speech": 0.18384182587122588, "sexual_content": 0.1985823986360316, "untruthful_information": 0.22438455309200378, "violent_content": 0.15704218457210023}", "prompt": "Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?", "quality": [], "quality-suggestion": null, "quality-suggestion-metadata": { "agent": null, "score": null, "type": null }, "response": "Sure! Letu0027s say you want to build a model which can distinguish between images of cats and dogs. You gather your dataset, consisting of many cat and dog pictures. Then you put them through a neural net of your choice, which produces some representation for each image, a sequence of numbers like [0.123, 0.045, 0.334, ...]. The problem is, if your model is unfamiliar with cat and dog images, these representations will be quite random. At one time a cat and a dog picture could have very similar representations (their numbers would be close to each other), while at others two cat images may be represented far apart. In simple terms, the model wouldnu0027t be able to tell cats and dogs apart. This is where contrastive learning comes in.
The point of contrastive learning is to take pairs of samples (in this case images of cats and dogs), then train the model to "pull" representations of similar pairs (cat-cat or dog-dog) closer to each other and "push" representations of different pairs (cat-dog) apart. After doing this for a sufficient number of steps, your model will be able to produce unique, reliable representations for cats and dogs, in essence tell them apart.
This method is not limited to images, you can typically use it with any dataset that has similar and dissimilar data points." }
数据字段
-
字段 (Fields):
prompt:类型为text。response:类型为text。
-
问题 (Questions):
quality:类型为rating,允许值为 [1, 2, 3, 4, 5]。inappropriate-content:类型为multi_label_selection,允许值为 [Hate speech, Sexual content, Violent content, Untruthful information]。corrected-response:类型为text,可选。
-
建议 (Suggestions):
quality-suggestion:类型为rating,可选,允许值为 [1, 2, 3, 4, 5]。inappropriate-content-suggestion:类型为multi_label_selection,可选,允许值为 [Hate speech, Sexual content, Violent content, Untruthful information]。corrected-response-suggestion:类型为text,可选。
-
元数据 (Metadata):
- 可选字段,用于提供数据记录的额外信息。
-
external_id:
- 可选字段,用于提供数据记录的外部 ID。
数据分割
- 数据集包含一个分割,即
train。



