Darknecrocities/Greek-PD
收藏Hugging Face2025-12-19 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Darknecrocities/Greek-PD
下载链接
链接失效反馈官方服务:
资源简介:
希腊公共领域数据集(Greek Public Domain或Greek-PD)是一个大型集合,旨在汇总所有希腊公共领域的专著和期刊。截至2024年3月,它是最大的希腊开放语料库。该集合包含1,405个标题,共计156,712,807个单词,这些数据从多个来源恢复,包括互联网档案馆和各种欧洲国家图书馆及文化遗产机构。每个parquet文件包含随机选择的2,000本书的全文。数据集的构建遵循欧盟公共领域作品的标准,因此适用于所有伯尔尼公约国家的欧盟作者:任何作者去世超过70年的出版物。截至2024年3月,为了限制权利验证,仅保留了1884年之前出版的标题。该语料库旨在扩大开放作品的可用性,用于大型语言模型的训练。文本可以用于模型训练,并且可以无限制地重新发布以用于重现性目的。整个集合在所有地区都属于公共领域,这意味着每个个人或集体权利持有者的遗产权利已过期。未来工作包括扩展到19世纪末和20世纪初的作品,纠正OCR错误,以及增强原始文本的结构/编辑呈现。
Greek Public Domain or Greek-PD is a large collection aiming to aggregate all Greek monographies and periodicals in the public domain. As of March 2024, it is the biggest Greek open corpus. The collection contains 1,405 titles making up 156,712,807 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet file has the full text of 2,000 books selected at random. The composition of the dataset adheres to the criteria for public domain works in the EU and, consequently, all Berne-countries for EU authors: any publication whose author is dead for more than 70 years. As of March 2024, to limit rights verification, we have retained exclusively titles published prior to 1884. The corpus aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes. The entire collection is in the public domain in all regions. This means that the patrimonial rights of each individual or collective right holders have expired. Future work includes expansion of the dataset to the late 19th and early 20th century works, correction of computer generated errors in the text, and enhancement of the structure/editorial presentation of the original text.
提供机构:
Darknecrocities



