1024m/mMGTD-Corpus
收藏Hugging Face2024-07-22 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/1024m/mMGTD-Corpus
下载链接
链接失效反馈官方服务:
资源简介:
该数据集旨在为机器生成文本的检测和其他语言学任务提供良好的基础。数据集计划包含超过100种语言的500万条注释样本,使用10多种流行的开源和专有LLM生成。每个语言-LLM对的样本集包含10000条样本,其中1000条完全由人类撰写,1000条完全由机器生成,8000条部分由机器生成。数据集目前支持23种语言,未来将扩展到102种语言。数据集的许可证为cc-by-nc-nd-4.0,适用于非商业用途。
This dataset card aims to be a good foundation for machine generated text portion detection and other linguistic tasks. It will consist of 5M annotated samples from over 100 languages when ready using 10+ popular LLMs both open-source and proprietary. With 10000 samples from each Language-LLM pair and twice as much for English texts. Each of the 10000 sample sets would consist of 10%(1000) texts being completely human written, another 10%(1000) being completely machine generated, rest 80%(8000) being partially machine generated.
提供机构:
1024m



