SkySyrup/tinystories_german
收藏Hugging Face2024-02-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/SkySyrup/tinystories_german
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- de
license: apache-2.0
---
### What have you done
this dataset is a german interpretation of the roneneldan/TinyStories dataset
that dataset is amazing- I wanted to make a german version to experiment with the bilinguality of tiny language models (more coming on that soon!!!) (i wrote a paper :D)
this is the result of a bunch of work and months of screwing around
it was made with basically 0 budget;
----
argos-translate contains 200k opennmt translated tinystories
german_GEMINI_async-combined contains about 180k synthetic generated german stories with some extremely generous token donations from google (potentially unwilling)
v2_GERMAN.txt contains about 80k stories generated with leo-hessenai-7b-chat
v4_GERMAN.txt contains about 50-60k stories generated with leo-mistral-hessenai-7b-chat
----
there were some smaller datasets that weren't uploaded because they sucked due to the models used to generate them
the models trained with these datasets and a custom tokenizer show a lot of promise, they aren't quite the level of the english stories but they're pretty good for my standards
it took me about 5 months of on-and-off generating and working to create these datasets;
## Dataset usage
do whatever you want with it, i don't like copyright
but if you build something cool maybe say it contains/ was based off this dataset and link to it pretty please
## Limitations and biases
seriously? this dataset was made with models not intended for nsfw/rp and also not prompted in such a way
get off my back ai safety people
提供机构:
SkySyrup
原始信息汇总
数据集概述
该数据集是基于roneneldan/TinyStories数据集的德语版本,旨在探索小型语言模型的双语能力。数据集的制作过程历时数月,且基本无预算。
数据集组成
argos-translate包含约200k条通过OpenNMT翻译的TinyStories。german_GEMINI_async-combined包含约180k条合成生成的德语故事,部分由Google慷慨捐赠(可能非自愿)。v2_GERMAN.txt包含约80k条通过leo-hessenai-7b-chat生成的故事。v4_GERMAN.txt包含约50-60k条通过leo-mistral-hessenai-7b-chat生成的故事。
数据集制作
数据集的制作过程中,有一些较小的数据集未上传,因为它们的质量较差,主要是因为生成它们所用的模型不佳。使用这些数据集和自定义分词器训练的模型显示出很大的潜力,虽然它们尚未达到英语故事的水平,但已符合我的标准。
数据集使用
该数据集的使用不受限制,作者不支持版权。如果使用该数据集构建了有趣的项目,建议提及该数据集并链接到其来源。
局限性与偏见
该数据集由不适用于NSFW/RP的模型生成,且未以相关方式提示。



