hassanjbara/LONG
收藏Hugging Face2024-05-03 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/hassanjbara/LONG
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
size_categories:
- 10K<n<100K
task_categories:
- text-generation
- text2text-generation
pretty_name: LONG context queries dataset
dataset_info:
features:
- name: query
dtype: string
- name: response
dtype: string
splits:
- name: train
num_bytes: 68127488
num_examples: 25973
download_size: 37373894
dataset_size: 68127488
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
A dataset for generating long responses from language models generated from ot\her datasets after heavy filtering.
This dataset is high quality and includes over 25k prompts that elicit long answers, making it useful for benchmarking or training on long context responses.
Furthermore, the dataset contains responses generated by [Llama-3-8b-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to each prompt.
Main criteria we are aiming for with this dataset:
* Only English.
* Only creative writing prompts or similar (no coding or math).
* Prompts can't be answered adequately in less than 100 words.
* Responses are rated well by feedback/reward models.
For the script used to generate the dataset please see the `scripts` folder in the repository. Datasets used:
* [LDJnr/Pure-Dove](https://huggingface.co/datasets/LDJnr/Pure-Dove)
* [argilla/ultrafeedback-binarized-preferences-cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned)
* [Ghostbuster-prompts](https://huggingface.co/datasets/hassanjbara/ghostbuster-prompts)
提供机构:
hassanjbara
原始信息汇总
数据集概述
基本信息
- 语言: 英语
- 许可证: MIT
- 大小: 10K<n<100K
- 任务类别:
- 文本生成
- 文本到文本生成
- 美观名称: LONG context queries dataset
数据集详情
- 特征:
- query: 字符串类型
- response: 字符串类型
- 分割:
- 训练集:
- 字节数: 68127488
- 示例数: 25973
- 训练集:
- 下载大小: 37373894
- 数据集大小: 68127488
数据集用途
- 用于从语言模型生成长期响应的数据集,通过严格过滤其他数据集生成。
- 包含超过25k个提示,旨在引发长答案,适用于基准测试或训练长上下文响应。
- 每个提示包含由Llama-3-8b-Instruct生成的响应。
数据集标准
- 仅限英语
- 仅包含创意写作提示或类似内容(无编程或数学)
- 提示不能在少于100字内充分回答
- 响应由反馈/奖励模型评为良好



