hassanjbara/LONG

Name: hassanjbara/LONG
Creator: hassanjbara
Published: 2024-05-03 18:05:08
License: 暂无描述

Hugging Face2024-05-03 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/hassanjbara/LONG

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit size_categories: - 10K<n<100K task_categories: - text-generation - text2text-generation pretty_name: LONG context queries dataset dataset_info: features: - name: query dtype: string - name: response dtype: string splits: - name: train num_bytes: 68127488 num_examples: 25973 download_size: 37373894 dataset_size: 68127488 configs: - config_name: default data_files: - split: train path: data/train-* --- A dataset for generating long responses from language models generated from ot\her datasets after heavy filtering. This dataset is high quality and includes over 25k prompts that elicit long answers, making it useful for benchmarking or training on long context responses. Furthermore, the dataset contains responses generated by [Llama-3-8b-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to each prompt. Main criteria we are aiming for with this dataset: * Only English. * Only creative writing prompts or similar (no coding or math). * Prompts can't be answered adequately in less than 100 words. * Responses are rated well by feedback/reward models. For the script used to generate the dataset please see the `scripts` folder in the repository. Datasets used: * [LDJnr/Pure-Dove](https://huggingface.co/datasets/LDJnr/Pure-Dove) * [argilla/ultrafeedback-binarized-preferences-cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) * [Ghostbuster-prompts](https://huggingface.co/datasets/hassanjbara/ghostbuster-prompts)

提供机构：

hassanjbara

原始信息汇总

数据集概述

基本信息

语言: 英语
许可证: MIT
大小: 10K<n<100K
任务类别:
- 文本生成
- 文本到文本生成
美观名称: LONG context queries dataset

数据集详情

特征:
- query: 字符串类型
- response: 字符串类型
分割:
- 训练集:
  - 字节数: 68127488
  - 示例数: 25973
下载大小: 37373894
数据集大小: 68127488

数据集用途

用于从语言模型生成长期响应的数据集，通过严格过滤其他数据集生成。
包含超过25k个提示，旨在引发长答案，适用于基准测试或训练长上下文响应。
每个提示包含由Llama-3-8b-Instruct生成的响应。

数据集标准

仅限英语
仅包含创意写作提示或类似内容（无编程或数学）
提示不能在少于100字内充分回答
响应由反馈/奖励模型评为良好

数据集来源

5,000+

优质数据集

54 个

任务类型

进入经典数据集