wyzard-ai/Jayesh2732
收藏Hugging Face2024-11-22 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/wyzard-ai/Jayesh2732
下载链接
链接失效反馈官方服务:
资源简介:
---
size_categories: n<1K
tags:
- rlfh
- argilla
- human-feedback
---
# Dataset Card for Jayesh2732
This dataset has been created with [Argilla](https://github.com/argilla-io/argilla). As shown in the sections below, this dataset can be loaded into your Argilla server as explained in [Load with Argilla](#load-with-argilla), or used directly with the `datasets` library in [Load with `datasets`](#load-with-datasets).
## Using this dataset with Argilla
To load with Argilla, you'll just need to install Argilla as `pip install argilla --upgrade` and then use the following code:
```python
import argilla as rg
ds = rg.Dataset.from_hub("wyzard-ai/Jayesh2732", settings="auto")
```
This will load the settings and records from the dataset repository and push them to you Argilla server for exploration and annotation.
## Using this dataset with `datasets`
To load the records of this dataset with `datasets`, you'll just need to install `datasets` as `pip install datasets --upgrade` and then use the following code:
```python
from datasets import load_dataset
ds = load_dataset("wyzard-ai/Jayesh2732")
```
This will only load the records of the dataset, but not the Argilla settings.
## Dataset Structure
This dataset repo contains:
* Dataset records in a format compatible with HuggingFace `datasets`. These records will be loaded automatically when using `rg.Dataset.from_hub` and can be loaded independently using the `datasets` library via `load_dataset`.
* The [annotation guidelines](#annotation-guidelines) that have been used for building and curating the dataset, if they've been defined in Argilla.
* A dataset configuration folder conforming to the Argilla dataset format in `.argilla`.
The dataset is created in Argilla with: **fields**, **questions**, **suggestions**, **metadata**, **vectors**, and **guidelines**.
### Fields
The **fields** are the features or text of a dataset's records. For example, the 'text' column of a text classification dataset of the 'prompt' column of an instruction following dataset.
| Field Name | Title | Type | Required | Markdown |
| ---------- | ----- | ---- | -------- | -------- |
| instruction | User instruction | text | True | True |
### Questions
The **questions** are the questions that will be asked to the annotators. They can be of different types, such as rating, text, label_selection, multi_label_selection, or ranking.
| Question Name | Title | Type | Required | Description | Values/Labels |
| ------------- | ----- | ---- | -------- | ----------- | ------------- |
| relevance_score | How Relevant is the conversation based upon expert. Is the conversation highly curated for you or not. Please don't judge accuracy. | rating | True | N/A | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| accuracy_score | How accurate is the conversation based upon persona | rating | True | if | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| clarity_score | How clear is the conversation based upon persona | rating | True | Is the LLM getting confused | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| actionable_score | How actionable is the conversation based upon persona | rating | True | Is the LLM response to actionable for example, it shows comparison card on the right question. | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| engagement_score | How engaging is the conversation based upon persona | rating | True | Are there a lot of question that are being shown if yes, high score else low score | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| completeness_score | is the conversation complete based upon persona | rating | True | is the conversation complete based upon persona, not leaving any key aspect out | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| feedback | feedback | text | True | What do you think can be improved in the given conversation. How good was the conversation as per you? | N/A |
<!-- check length of metadata properties -->
### Metadata
The **metadata** is a dictionary that can be used to provide additional information about the dataset record.
| Metadata Name | Title | Type | Values | Visible for Annotators |
| ------------- | ----- | ---- | ------ | ---------------------- |
| conv_id | Conversation ID | | - | True |
| turn | Conversation Turn | | 0 - 100 | True |
### Data Instances
An example of a dataset instance in Argilla looks as follows:
```json
{
"_server_id": "63d40792-3def-4435-a591-af4506143733",
"fields": {
"instruction": "**user**: hi\n**assistant**: Hello Jayesh! How can I assist you today? Are you looking for insights on development software or perhaps some advice on the latest programming tools?"
},
"id": "e47d050a-0605-4511-8c25-b802c6fce8e8",
"metadata": {
"conv_id": "9999eb70-c3c7-4ff5-b533-db0b7b7ba963",
"turn": 0
},
"responses": {},
"status": "pending",
"suggestions": {},
"vectors": {}
}
```
While the same record in HuggingFace `datasets` looks as follows:
```json
{
"_server_id": "63d40792-3def-4435-a591-af4506143733",
"conv_id": "9999eb70-c3c7-4ff5-b533-db0b7b7ba963",
"id": "e47d050a-0605-4511-8c25-b802c6fce8e8",
"instruction": "**user**: hi\n**assistant**: Hello Jayesh! How can I assist you today? Are you looking for insights on development software or perhaps some advice on the latest programming tools?",
"status": "pending",
"turn": 0
}
```
### Data Splits
The dataset contains a single split, which is `train`.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation guidelines
Review the user interactions with the chatbot.
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
[More Information Needed]
### Contributions
[More Information Needed]
size_categories: 样本量类别:样本数少于1000
tags: 标签:
- 人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)
- Argilla
- 人类反馈
# Jayesh2732 数据集卡片
本数据集基于[Argilla](https://github.com/argilla-io/argilla)构建。如下文所述,该数据集既可按照[通过Argilla加载](#load-with-argilla)的步骤导入至您的Argilla服务器,也可直接通过`datasets`库按照[通过datasets加载](#load-with-datasets)的方式使用。
## 结合Argilla使用本数据集
若需通过Argilla加载该数据集,只需先执行`pip install argilla --upgrade`安装并升级Argilla,随后运行以下代码:
python
import argilla as rg
ds = rg.Dataset.from_hub("wyzard-ai/Jayesh2732", settings="auto")
该操作将从数据集仓库中加载配置与样本数据,并推送至您的Argilla服务器,以供探索与标注使用。
## 结合`datasets`库使用本数据集
若需通过`datasets`库加载本数据集的样本,只需先执行`pip install datasets --upgrade`安装并升级`datasets`库,随后运行以下代码:
python
from datasets import load_dataset
ds = load_dataset("wyzard-ai/Jayesh2732")
该操作仅会加载数据集的样本数据,不会导入Argilla相关配置。
## 数据集结构
本数据集仓库包含以下内容:
* 兼容HuggingFace `datasets`格式的数据集样本。通过`rg.Dataset.from_hub`加载时会自动导入此类样本,也可通过`datasets`库的`load_dataset`函数独立加载。
* 若在Argilla中定义了数据集构建与整理所用的[标注指南](#annotation-guidelines),则会包含该指南文件。
* 符合Argilla数据集格式的`.argilla`数据集配置文件夹。
本数据集在Argilla中通过以下要素构建:**字段(fields)**、**问题(questions)**、**建议(suggestions)**、**元数据(metadata)**、**向量(vectors)**与**指南(guidelines)**。
### 字段(fields)
**字段(fields)**指数据集样本的特征或文本内容。例如文本分类数据集中的`text`列,或指令遵循数据集中的`prompt`列。
| 字段名称 | 标题 | 类型 | 是否必填 | 是否支持Markdown |
| ---------- | ----- | ---- | -------- | -------- |
| instruction | 用户指令 | 文本 | 是 | 是 |
### 问题(questions)
**问题(questions)**指向标注者提出的查询内容,支持多种类型,包括评分(rating)、文本输入(text)、单标签选择(label_selection)、多标签选择(multi_label_selection)与排序(ranking)。
| 问题名称 | 标题 | 类型 | 是否必填 | 描述 | 可选值/标签 |
| ------------- | ----- | ---- | -------- | ----------- | ------------- |
| relevance_score | 相关性评分:基于专家视角评估对话相关性,判断对话是否经过精心定制,无需评判内容准确性 | 评分(rating) | 是 | 无 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| accuracy_score | 准确性评分:基于角色设定评估对话的准确性 | 评分(rating) | 是 | 无 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| clarity_score | 清晰度评分:基于角色设定评估对话的清晰度 | 评分(rating) | 是 | 判断大语言模型(Large Language Model, LLM)是否存在混淆 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| actionable_score | 可操作性评分:基于角色设定评估对话的可操作性 | 评分(rating) | 是 | 例如:大语言模型的回复是否可落地,如针对特定问题展示对比卡片 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| engagement_score | 互动性评分:基于角色设定评估对话的互动性 | 评分(rating) | 是 | 若对话中包含大量问题则评高分,反之则评低分 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| completeness_score | 完整性评分:基于角色设定评估对话的完整性 | 评分(rating) | 是 | 判断对话是否完整,未遗漏任何关键要点 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
| feedback | 反馈意见 | 文本输入(text) | 是 | 请给出您认为可优化该对话的方向,并评价该对话的整体表现 | 无 |
### 元数据(metadata)
**元数据(metadata)**是用于提供数据集样本额外信息的字典结构。
| 元数据名称 | 标题 | 类型 | 可选值 | 是否对标注者可见 |
| ------------- | ----- | ---- | ------ | ---------------------- |
| conv_id | 对话ID | 无 | 无 | 是 |
| turn | 对话轮次 | 无 | 0至100的整数 | 是 |
### 数据样本
以下为Argilla格式下的数据集样本示例:
json
{
"_server_id": "63d40792-3def-4435-a591-af4506143733",
"fields": {
"instruction": "**user**: hi
**assistant**: Hello Jayesh! How can I assist you today? Are you looking for insights on development software or perhaps some advice on the latest programming tools?"
},
"id": "e47d050a-0605-4511-8c25-b802c6fce8e8",
"metadata": {
"conv_id": "9999eb70-c3c7-4ff5-b533-db0b7b7ba963",
"turn": 0
},
"responses": {},
"status": "pending",
"suggestions": {},
"vectors": {}
}
以下为HuggingFace `datasets`格式下的同一样本示例:
json
{
"_server_id": "63d40792-3def-4435-a591-af4506143733",
"conv_id": "9999eb70-c3c7-4ff5-b533-db0b7b7ba963",
"id": "e47d050a-0605-4511-8c25-b802c6fce8e8",
"instruction": "**user**: hi
**assistant": Hello Jayesh! How can I assist you today? Are you looking for insights on development software or perhaps some advice on the latest programming tools?",
"status": "pending",
"turn": 0
}
### 数据划分
本数据集仅包含一个划分,即`train`(训练集)。
## 数据集构建
### 整理初衷
【需补充更多信息】
### 源数据
#### 初始数据收集与标准化
【需补充更多信息】
#### 源文本的创作者是谁?
【需补充更多信息】
### 标注信息
#### 标注指南
审阅用户与聊天机器人的交互内容。
#### 标注流程
【需补充更多信息】
#### 标注者是谁?
【需补充更多信息】
### 个人与敏感信息
【需补充更多信息】
## 数据集使用注意事项
### 数据集的社会影响
【需补充更多信息】
### 偏差分析
【需补充更多信息】
### 其他已知局限性
【需补充更多信息】
## 补充信息
### 数据集整理者
【需补充更多信息】
### 许可信息
【需补充更多信息】
### 引用信息
【需补充更多信息】
### 贡献信息
【需补充更多信息】
提供机构:
wyzard-ai



