mpilhlt/salamanca-abbr
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mpilhlt/salamanca-abbr
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
doi: 10.57967/hf/8278
language:
- la
- es
tags:
- history
- humanities
- early-modern
- historical-text
task_categories:
- text-generation
- token-classification
pretty_name: Salamanca Abbreviation and Hyphenation Dataset
size_categories:
- 1M<n<10M
---
# Salamanca Abbreviation and Hyphenation Dataset
This is a dataset created from manually edited and curated digital
edition texts of the so-called School of Salamanca, a group of
16th- and 17th-century theologians and jurists. The digital editions
can be studied at the
[School of Salamanca Website](https://salamanca.school/), together
with a dictionary of the political-juridical language these authors
were using and contributing to shape.
The corpus contains printed texts of various genres (academic summae,
in some cases an author's collected works, as well as pragmatic
booklets for merchants or confessors) in Latin and Spanish, but all
the texts are concerned with law, politics, and ethics.
The pipeline extracting the dataset from the TEI XML sources as they
have been prepared in the project is documented in the
[SvSal-PoCo repository](https://github.com/digicademy/svsal-poco) at
GitHub, more specifically in the
[data/prepare_data subfolder](https://github.com/digicademy/svsal-poco/tree/main/data/prepare_data).
The creation of the dataset happened in the course of an experiment
aiming to establish machine learning tools to aid the project's
editors in their work, i.e. detecting cases where a word has been
broken to straddle two lines without this being indicated by a
hyphenation dash, and expanding abbreviations (also at times
straddling two or even three lines - yes, these exist). The
experiment's pipeline code and tools can be accessed at the GitHub
repository, too.
许可协议:CC BY 4.0(知识共享署名4.0国际许可协议)
数字对象标识符(DOI):10.57967/hf/8278
语言:拉丁语(la)、西班牙语(es)
标签:历史、人文学科、早期现代、历史文本
任务类别:文本生成(text-generation)、令牌分类(token-classification)
数据集名称:萨拉曼卡缩写与断字数据集
数据规模:100万<数据量<1000万
# 萨拉曼卡缩写与断字数据集
本数据集源自经人工编辑与整理的数字编辑版文本,这些文本属于所谓的“萨拉曼卡学派”——16至17世纪的神学家与法学家群体。相关数字编辑版文本可在[萨拉曼卡学派官网](https://salamanca.school/)查阅,同时可一并查阅该学派学者使用并参与塑造的政治-法律语言词典。
该语料库涵盖多种体裁的印刷文本:包括学术大全(summae)、部分作者的全集,以及面向商人或告解神父的实用手册,文本语言涵盖拉丁语与西班牙语,所有文本均围绕法律、政治与伦理学主题展开。
从本项目制备的文本编码倡议(Text Encoding Initiative, TEI)XML源数据中提取本数据集的流程,已在GitHub平台的[SvSal-PoCo仓库](https://github.com/digicademy/svsal-poco)中完成文档化说明,具体路径为[data/prepare_data子文件夹](https://github.com/digicademy/svsal-poco/tree/main/data/prepare_data)。
本数据集的构建源于一项实验研究,旨在开发机器学习工具以辅助本项目的编辑工作:具体可实现两类功能,一是检测未通过连字符标识的跨两行单词断行情况,二是对跨两行甚至三行的缩写进行补全——此类跨多行缩写确实存在。该实验的流程代码与工具同样可在上述GitHub仓库中获取。
提供机构:
mpilhlt



