xzuyn/tv-alpaca

Name: xzuyn/tv-alpaca
Creator: xzuyn
Published: 2023-08-03 14:07:46
License: 暂无描述

Hugging Face2023-08-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/xzuyn/tv-alpaca

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en size_categories: - 10K<n<100K task_categories: - text-generation - conversational --- 13,217 sets split to 512 LLaMa tokens. Each output was trimmed to the nearest `\n` at 512 or less tokens, since KoboldCPP's chat format always formats it with names like this; `\nJerry: Hello. I'm Jerry\nBob: Hi. I'm Bob.`. Since there's 980 episodes, and it needs to be chunked into 512 tokens, there ends up being 13,217 splits. That's ~13.49 splits for each episode. TV Shows Added: - 114 - Futurama - 568 - The Simpsons - 24 - Seinfeld - 274 - South Park Plans: - I need to augment the instructions so that is way more varied, and natural (like maybe some are capitalized, or some spelt slightly wrong, or some have weird wording, etc). - Possibly make each sample overlap by a small amount like 10-35%, potentially helping it understand contexts better? Original Datasets: - [Futurama](https://www.kaggle.com/datasets/josephvm/futurama-seasons-16-transcripts?select=only_spoken_text.csv) - [The Simpsons](https://www.kaggle.com/datasets/prashant111/the-simpsons-dataset?resource=download&select=simpsons_script_lines.csv) - [Seinfeld](https://www.kaggle.com/datasets/thec03u5/seinfeld-chronicles?select=scripts.csv) - [South Park](https://www.kaggle.com/datasets/thedevastator/south-park-scripts-dataset?select=All-seasons.csv)

提供机构：

xzuyn

原始信息汇总

数据集概述

语言

英语 (en)

数据规模

数据量：10K<n<100K

任务类别

文本生成
对话生成

数据处理

数据集包含13,217个片段，每个片段分割为512个LLaMa令牌。
每个输出片段被修剪至最近的换行符（），长度不超过512个令牌。
共有980集，每集平均分割为约13.49个片段。

包含的电视节目

《飞出个未来》(Futurama) - 114集
《辛普森一家》(The Simpsons) - 568集
《宋飞正传》(Seinfeld) - 24集
《南方公园》(South Park) - 274集

未来计划

增强指令的多样性和自然性，例如使用不同的大小写、拼写错误或奇怪的措辞。
可能使每个样本部分重叠，重叠比例为10-35%，以帮助更好地理解上下文。

原始数据集来源

5,000+

优质数据集

54 个

任务类型

进入经典数据集