xzuyn/tv-alpaca
收藏Hugging Face2023-08-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/xzuyn/tv-alpaca
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
size_categories:
- 10K<n<100K
task_categories:
- text-generation
- conversational
---
13,217 sets split to 512 LLaMa tokens. Each output was trimmed to the nearest `\n` at 512 or less tokens, since KoboldCPP's chat format always formats it with names like this;
`\nJerry: Hello. I'm Jerry\nBob: Hi. I'm Bob.`.
Since there's 980 episodes, and it needs to be chunked into 512 tokens, there ends up being 13,217 splits. That's ~13.49 splits for each episode.
TV Shows Added:
- 114 - Futurama
- 568 - The Simpsons
- 24 - Seinfeld
- 274 - South Park
Plans:
- I need to augment the instructions so that is way more varied, and natural (like maybe some are capitalized, or some spelt slightly wrong, or some have weird wording, etc).
- Possibly make each sample overlap by a small amount like 10-35%, potentially helping it understand contexts better?
Original Datasets:
- [Futurama](https://www.kaggle.com/datasets/josephvm/futurama-seasons-16-transcripts?select=only_spoken_text.csv)
- [The Simpsons](https://www.kaggle.com/datasets/prashant111/the-simpsons-dataset?resource=download&select=simpsons_script_lines.csv)
- [Seinfeld](https://www.kaggle.com/datasets/thec03u5/seinfeld-chronicles?select=scripts.csv)
- [South Park](https://www.kaggle.com/datasets/thedevastator/south-park-scripts-dataset?select=All-seasons.csv)
提供机构:
xzuyn
原始信息汇总
数据集概述
语言
- 英语 (en)
数据规模
- 数据量:10K<n<100K
任务类别
- 文本生成
- 对话生成
数据处理
- 数据集包含13,217个片段,每个片段分割为512个LLaMa令牌。
- 每个输出片段被修剪至最近的换行符(
),长度不超过512个令牌。 - 共有980集,每集平均分割为约13.49个片段。
包含的电视节目
- 《飞出个未来》(Futurama) - 114集
- 《辛普森一家》(The Simpsons) - 568集
- 《宋飞正传》(Seinfeld) - 24集
- 《南方公园》(South Park) - 274集
未来计划
- 增强指令的多样性和自然性,例如使用不同的大小写、拼写错误或奇怪的措辞。
- 可能使每个样本部分重叠,重叠比例为10-35%,以帮助更好地理解上下文。



