Russian-PD
收藏魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/Russian-PD
下载链接
链接失效反馈官方服务:
资源简介:
# 🇷🇺 Russian Public Domain 🇷🇺
**Russian-Public Domain** or **Russian-PD** is a large collection aiming to aggregate all Russian monographies and periodicals in the public domain.
## Dataset summary
The collection contains 8525 titles making up 995,163,165 words recovered from the Internet Archive. Each parquet file has the full text of 2,000 books selected at random.
## Curation method
The composition of the dataset adheres to the criteria for public domain works in the Russian Federation: any publication whose author is dead for more than 70 years.
As of March 2024, to limit rights verification, we have retained exclusively titles published prior to 1884.
The corpus will be expanded at a later stage to encompass late 19th century and early 20th century publications, after checking for public domain validity.
## Uses
The collection aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes.
The rationales for creation of this collection are multifold:
* **Scientific**: We observe that the closure of training corpora represents a major barrier to AI research. Large language models face a real crisis of reproducibility.
* **Legal**: With the adoption of the AI Act with its obligations in terms of copyright law compliance for the pretraining corpora, the European AI ecosystem will have to change its provenance practices.
* **Cultural**: The linguistic diversity of the European Union is currently underrepresented. Unlike web archives, open, heritage, administrative, or scientific texts are often of high quality: they are long, multilingual, and editorialized publications.
* **Economical**: Today, value capture is concentrated on players whose financial resources are already considerable, allowing them to collect or purchase data at a high price. Making a royalty-free corpus available to as many people as possible frees innovation in uses and minimizes economic dependencies on dominant actors.
## License
The entire collection is in the public domain in all regions. This means that the patrimonial rights of each individual or collective right holders have expired.
## Future work
This dataset is not a one-time work but will continue to evolve significantly in three directions:
* Expansion of the dataset to the late 19th and early 20th century works and its further enhancement with currently unexploited collections coming from European patrimonial data repositories.
* Correction of computer generated errors in the text. All the texts have been transcribed automatically through the use of Optical Character Recognition (OCR) software. The original files have been digitized over a long time period (since the mid-2000s) and some documents should be. Future versions will strive either to re-OCRize the original text or use experimental LLM models for partial OCR correction.
* Enhancement of the structure/editorial presentation of the original text. Some parts of the original documents are likely unwanted for large scale analysis or model training (header, page count…). Additionally, some advanced document structures like tables or multi-column layout are unlikely to be well-formatted.
## Acknowledgements
The corpus was stored and processed with the generous support of Scaleway. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC).
Corpus collection has been largely facilitated thanks to the open science LLM community insights and cooperation (Occiglot, Eleuther AI, Allen AI).
<div style="text-align: center;">
<img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
</div>
# 🇷🇺 俄罗斯公有领域 🇷🇺
**俄罗斯公有领域(Russian-Public Domain)** 或 **俄罗斯-PD(Russian-PD)** 是一个大型数据集项目,旨在聚合所有处于公有领域的俄罗斯专著与期刊。
## 数据集概述
该数据集共收录8525部作品,总字数达995,163,165词,数据均从互联网档案馆(Internet Archive)抓取。每个Parquet文件包含随机选取的2000部图书的完整文本。
## 遴选方法
本数据集的构成严格遵循俄罗斯联邦公有领域作品的判定标准:作者去世超过70年的出版物。截至2024年3月,为简化版权核验流程,当前仅收录1884年之前出版的作品。后续将扩展数据集范围,纳入19世纪末至20世纪初的出版物,但需先完成公有领域有效性核验。
## 应用场景
本数据集旨在为大语言模型(Large Language Model,LLM)的训练提供更多开放可用的文本资源。相关文本可无限制地用于模型训练与再发布,以保障研究可复现性。
创建本数据集的多维度动因如下:
* **学术层面**:当前训练语料库的封闭化已成为人工智能研究的重大阻碍,大语言模型正面临切实的可复现性危机。
* **法律层面**:随着《人工智能法案》(AI Act)的通过,预训练语料库需符合版权合规要求,欧洲人工智能生态将不得不调整其数据源获取方式。
* **文化层面**:当前欧盟的语言多样性尚未得到充分体现。与网页档案不同,开放的遗产类、行政类或学术类文本往往具备更高质量:篇幅更长、多语种覆盖且经过专业编辑加工。
* **经济层面**:当下数据价值的获取高度集中于少数财力雄厚的企业,这些企业能够以高昂成本收集或采购数据。向尽可能多的群体提供免版税语料库,将推动相关应用创新,并降低对行业主导企业的经济依赖。
## 许可证
本数据集所有内容在全球范围内均属于公有领域。这意味着所有个人或集体权利人的财产性版权均已过期。
## 未来工作
本数据集并非一次性项目,后续将从三个方向持续优化升级:
* 扩展数据集覆盖范围至19世纪末至20世纪初的作品,并引入欧洲遗产数据仓库中尚未被利用的馆藏资源进一步丰富数据集。
* 修正文本中由计算机自动生成的错误。所有文本均通过光学字符识别(Optical Character Recognition,OCR)软件自动转录而来,原始文件自2000年代中期起历经多年数字化处理,部分文档存在识别误差。后续版本将要么重新对原始文件进行OCR识别,要么使用实验性大语言模型完成部分OCR纠错工作。
* 优化原始文本的结构与编辑呈现形式。部分原始文档中的冗余内容(如页眉、页码等)可能不适合大规模分析或模型训练。此外,部分复杂文档结构(如表格或多栏布局)的格式往往存在缺陷。
## 致谢
本语料库的存储与处理得到了Scaleway的慷慨支持。数据集的构建得到了法国文化部与DINUM支持的国家级初创企业LANGU:IA的协助与协同努力,该项目隶属于语言技术联盟(Alliance for Language technologies EDIC,ALT-EDIC)的服务预研工作。
语料库的收集工作还得益于开放科学大语言模型社区的洞见与合作(包括Occiglot、Eleuther AI、Allen AI)。
<div style="text-align: center;">
<img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
</div>
提供机构:
maas
创建时间:
2025-06-19



