52AI/TinyStoriesZh|语言模型数据集|儿童故事数据集
收藏数据集概述
数据来源
- 数据集 TinyStories 是由研究者使用 GPT3.5 和 GPT4 生成的关于小故事的场景数据。
- 故事难度限制在 3~4岁小朋友能理解。
数据处理
- 这份中文数据是通过 翻译器 将英文故事数据翻译而成。
示例内容
-
英文故事示例:
Lily and Ben are friends. They like to play in the park. One day, they see a big tree with a swing. Lily wants to try the swing. She runs to the tree and climbs on the swing. "Push me, Ben!" she says. Ben pushes her gently. Lily feels happy. She swings higher and higher. She laughs and shouts. Ben watches Lily. He thinks she is cute. He wants to swing too. He waits for Lily to stop. But Lily does not stop. She swings faster and faster. She is having too much fun. "Can I swing too, Lily?" Ben asks. Lily does not hear him. She is too busy swinging. Ben feels sad. He walks away. Lily swings so high that she loses her grip. She falls off the swing. She lands on the ground. She hurts her foot. She cries. "Ow, ow, ow!" she says. She looks for Ben. She wants him to help her. But Ben is not there. He is gone. Lily feels sorry. She wishes she had shared the swing with Ben. She wishes he was there to hug her. She limps to the tree. She sees something hanging from a branch. It is Bens hat. He left it for her. Lily smiles. She thinks Ben is nice. She puts on his hat. She hopes he will come back. She wants to say sorry. She wants to be friends again.
-
中文翻译示例:
莉莉和本是朋友。他们喜欢在公园里玩。有一天,他们看到一棵有秋千的大树。莉莉想尝试秋千。她跑到树旁,爬上秋千。 “推我吧,本!”她说。本轻轻地推了她一下。莉莉感觉很幸福。她荡得越来越高。她又笑又叫。 本看着莉莉。他觉得她很可爱。他也想摇摆。他等着莉莉停下来。但莉莉并没有停下来。她摆动得越来越快。她玩得太开心了。 “我也可以荡秋千吗,莉莉?”本问。莉莉没有听见他的话。她正忙着荡秋千。本感到难过。他走开了。 莉莉荡得太高,以至于她失去了抓力。她从秋千上摔下来。她降落在地上。她的脚受伤了。她哭了。 “呜呜呜!”她说。她寻找本。她想要他帮助她。但本不在那儿。他已经去了。 莉莉感到抱歉。她希望自己能和本一起荡秋千。她希望他能在那里拥抱她。她一瘸一拐地走向树。她看到树枝上挂着什么东西。这是本的帽子。他留给她了。 莉莉微笑着。她认为本很好。她戴上他的帽子。她希望他能回来。她想说对不起。她想再次成为朋友。

豆瓣数据集
该数据集通过爬虫技术从豆瓣网站获取了48223条电影数据,并与movielens ml-latest数据集通过共同的imdb字段进行交集处理,最终得到15752条共同数据。数据存储格式为JSON,支持导入到MongoDB或其他数据库使用。
github 收录
FER2013
FER2013数据集是一个广泛用于面部表情识别领域的数据集,包含28,709个训练样本和7,178个测试样本。图像属性为48x48像素,标签包括愤怒、厌恶、恐惧、快乐、悲伤、惊讶和中性。
github 收录
YOLO Drone Detection Dataset
为了促进无人机检测模型的开发和评估,我们引入了一个新颖且全面的数据集,专门为训练和测试无人机检测算法而设计。该数据集来源于Kaggle上的公开数据集,包含在各种环境和摄像机视角下捕获的多样化的带注释图像。数据集包括无人机实例以及其他常见对象,以实现强大的检测和分类。
github 收录
PCLT20K
PCLT20K数据集是由湖南大学等机构创建的一个大规模PET-CT肺癌肿瘤分割数据集,包含来自605名患者的21,930对PET-CT图像,所有图像都带有高质量的像素级肿瘤区域标注。该数据集旨在促进医学图像分割研究,特别是在PET-CT图像中肺癌肿瘤的分割任务。
arXiv 收录
Adult Census Income dataset
该数据集由UCI机器学习库提供,包含个人的 demographic 信息及其收入水平。
github 收录