five

mpilhlt/salamanca-abbr

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mpilhlt/salamanca-abbr
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 doi: 10.57967/hf/8278 language: - la - es tags: - history - humanities - early-modern - historical-text task_categories: - text-generation - token-classification pretty_name: Salamanca Abbreviation and Hyphenation Dataset size_categories: - 1M<n<10M --- # Salamanca Abbreviation and Hyphenation Dataset This is a dataset created from manually edited and curated digital edition texts of the so-called School of Salamanca, a group of 16th- and 17th-century theologians and jurists. The digital editions can be studied at the [School of Salamanca Website](https://salamanca.school/), together with a dictionary of the political-juridical language these authors were using and contributing to shape. The corpus contains printed texts of various genres (academic summae, in some cases an author's collected works, as well as pragmatic booklets for merchants or confessors) in Latin and Spanish, but all the texts are concerned with law, politics, and ethics. The pipeline extracting the dataset from the TEI XML sources as they have been prepared in the project is documented in the [SvSal-PoCo repository](https://github.com/digicademy/svsal-poco) at GitHub, more specifically in the [data/prepare_data subfolder](https://github.com/digicademy/svsal-poco/tree/main/data/prepare_data). The creation of the dataset happened in the course of an experiment aiming to establish machine learning tools to aid the project's editors in their work, i.e. detecting cases where a word has been broken to straddle two lines without this being indicated by a hyphenation dash, and expanding abbreviations (also at times straddling two or even three lines - yes, these exist). The experiment's pipeline code and tools can be accessed at the GitHub repository, too.

许可协议:CC BY 4.0(知识共享署名4.0国际许可协议) 数字对象标识符(DOI):10.57967/hf/8278 语言:拉丁语(la)、西班牙语(es) 标签:历史、人文学科、早期现代、历史文本 任务类别:文本生成(text-generation)、令牌分类(token-classification) 数据集名称:萨拉曼卡缩写与断字数据集 数据规模:100万<数据量<1000万 # 萨拉曼卡缩写与断字数据集 本数据集源自经人工编辑与整理的数字编辑版文本,这些文本属于所谓的“萨拉曼卡学派”——16至17世纪的神学家与法学家群体。相关数字编辑版文本可在[萨拉曼卡学派官网](https://salamanca.school/)查阅,同时可一并查阅该学派学者使用并参与塑造的政治-法律语言词典。 该语料库涵盖多种体裁的印刷文本:包括学术大全(summae)、部分作者的全集,以及面向商人或告解神父的实用手册,文本语言涵盖拉丁语与西班牙语,所有文本均围绕法律、政治与伦理学主题展开。 从本项目制备的文本编码倡议(Text Encoding Initiative, TEI)XML源数据中提取本数据集的流程,已在GitHub平台的[SvSal-PoCo仓库](https://github.com/digicademy/svsal-poco)中完成文档化说明,具体路径为[data/prepare_data子文件夹](https://github.com/digicademy/svsal-poco/tree/main/data/prepare_data)。 本数据集的构建源于一项实验研究,旨在开发机器学习工具以辅助本项目的编辑工作:具体可实现两类功能,一是检测未通过连字符标识的跨两行单词断行情况,二是对跨两行甚至三行的缩写进行补全——此类跨多行缩写确实存在。该实验的流程代码与工具同样可在上述GitHub仓库中获取。
提供机构:
mpilhlt
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作