tafseer-nayeem/KidLM-corpus

Name: tafseer-nayeem/KidLM-corpus
Creator: tafseer-nayeem
Published: 2024-11-10 06:40:24
License: 暂无描述

Hugging Face2024-11-10 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/tafseer-nayeem/KidLM-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

KidLM语料库包含高质量、适合儿童的内容，这些内容专门为儿童编写，偶尔也由儿童编写。所有内容都经过网站编辑或版主的仔细审查和验证，以确保其适宜性，并消除任何不适当的内容或耸人听闻的报道。我们的用户中心数据收集管道具有质量过滤功能，全面、多样，并精心定制，旨在开发面向年轻受众的语言模型。更多详细信息，请参阅我们的EMNLP 2024论文。

The KidLM corpus is a dataset specifically designed for developing language models aimed at young audiences. It consists of high-quality, child-appropriate content that has been meticulously reviewed and validated by website editors or moderators to ensure suitability and eliminate inappropriate content. The dataset includes a variety of genres such as science, sports, history, and more, collected from 21 sources globally. The content is tailored for children and occasionally written by them, ensuring a diverse and comprehensive collection. The dataset also includes information on data verification, filtering, and ethical considerations, emphasizing the importance of privacy and the avoidance of personal identifying information. The dataset is licensed under CC BY-NC-SA 4.0, restricting its use to non-commercial research purposes.

提供机构：

tafseer-nayeem

5,000+

优质数据集

54 个

任务类型

进入经典数据集