PleIAs/Serbian-PD

Name: PleIAs/Serbian-PD
Creator: PleIAs
Published: 2024-07-29 19:10:08
License: 暂无描述

Hugging Face2024-07-29 更新2024-04-21 收录

下载链接：

https://hf-mirror.com/datasets/PleIAs/Serbian-PD

下载链接

链接失效反馈

官方服务：

资源简介：

Serbian-Public Domain（塞尔维亚公共领域）数据集是一个大型集合，旨在汇集所有塞尔维亚语的公共领域专著和期刊。截至2024年3月，它是最大的塞尔维亚语开放语料库。该集合包含1,405个标题，共计156,712,807个单词，这些数据从多个来源恢复，包括互联网档案馆和各种欧洲国家图书馆及文化遗产机构。每个parquet文件包含随机选择的2,000本书的完整文本。数据集的构建遵循欧盟的公共领域标准，并且计划在未来扩展到19世纪末和20世纪初的作品。该数据集的目标是为大型语言模型的训练提供开放的文本资源，并且可以无限制地用于模型训练和再出版。

The Serbian-Public Domain dataset is a large-scale collection dedicated to aggregating all Serbian-language public domain monographs and periodicals. As of March 2024, it is the largest open Serbian-language corpus currently available. This collection includes 1,405 titles with a total of 156,712,807 words, which were retrieved from multiple sources including the Internet Archive, various European national libraries and cultural heritage institutions. Each Parquet file contains the full text of 2,000 randomly selected books. The dataset is developed in compliance with EU public domain standards, and is planned to be expanded to cover works from the late 19th and early 20th centuries. The core goal of this dataset is to provide open text resources for large language model (LLM) training, and it can be used without any restrictions for model training and republication.

提供机构：

PleIAs

原始信息汇总

数据集概述

数据集名称

Serbian-Public Domain 或 Serbian-PD

数据集描述

规模：包含1,405个标题，总计156,712,807字。
内容来源：来自Internet Archive及多个欧洲国家图书馆和文化遗产权机构。
文件格式：每个parquet文件包含随机选择的2,000本书的全文。

数据集构成

选择标准：遵循欧盟及Berne国家对公共领域作品的定义，即作者去世超过70年的出版物。
当前限制：截至2024年3月，仅包含1884年前的出版物。
未来计划：将扩展至19世纪末至20世纪初的出版物，并验证其公共领域状态。

数据集用途

主要目的：用于大型语言模型的训练，支持无限制的模型训练和再发布，以促进可重复性研究。
创建理由：
- 科学：解决AI研究中训练语料库的封闭问题。
- 法律：适应AI法案对预训练语料库版权合规的要求。
- 文化：增强欧洲联盟语言多样性的代表性。
- 经济：减少对数据资源丰富者的经济依赖，促进创新使用。