DataProvenanceInitiative/commercial_licenses_and_terms
收藏Hugging Face2024-09-09 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/DataProvenanceInitiative/commercial_licenses_and_terms
下载链接
链接失效反馈官方服务:
资源简介:
该数据集旨在解决当前语言模型训练中数据透明度和理解不足的问题,特别是关于数据许可、来源和出处的详细信息。数据集包含了多个子数据集,每个子数据集都有详细的描述和来源链接。数据集的主要目的是提供最详细和可靠的数据许可、来源和出处的元数据,以及细粒度的特征如语言、文本领域、主题、使用情况、收集时间和任务组成。通过提供这些信息,数据集希望能够促进未来语言模型的数据中心开发更加知情和负责任。
This dataset is part of the Data Provenance Initiative, which includes extensive metadata about data licenses, sources, and provenance, as well as detailed characteristics such as language, text domains, topics, usage, collection time, and task compositions. The dataset aims to enhance transparency and understanding in the use of large language models by providing detailed metadata about the datasets used in their training. The dataset includes nearly 40 popular instruction tuning collections and provides tools for downloading, filtering, and examining the training data. The README also includes a table listing the constituent data collections, their descriptions, and sources, as well as information about the languages included in the dataset. The dataset is intended to empower more informed and responsible data-centric development of future language models.
提供机构:
DataProvenanceInitiative



