five

biglam/spanish_golden_age_sonnets

收藏
Hugging Face2022-08-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/biglam/spanish_golden_age_sonnets
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: [] language: - es language_creators: [] license: - cc-by-nc-4.0 multilinguality: - monolingual pretty_name: Spanish Golden-Age Sonnets size_categories: [] source_datasets: [] tags: [] task_categories: [] task_ids: [] --- [![DOI](https://zenodo.org/badge/46981468.svg)](https://zenodo.org/badge/latestdoi/46981468) # Corpus of Spanish Golden-Age Sonnets ## Introduction This corpus comprises sonnets written in Spanish between the 16th and 17th centuries. This corpus is a dataset saved in .csv, from a previous one in .xml. All the information of the original dataset can be consulted in [its original repository](https://github.com/bncolorado/CorpusSonetosSigloDeOro). Each sonnet has been annotated in accordance with the TEI standard. Besides the header and structural information, each sonnet includes the formal representation of each verse’s particular **metrical pattern**. The pattern consists of a sequence of unstressed syllables (represented by the "-" sign) and stressed syllables ("+" sign). Thus, each verse’s metrical pattern is represented as follows: "---+---+-+-" Each line in the metric_pattern codifies a line in the sonnet_text column. ## Column description - 'author' (string): Author of the sonnet described - 'sonnet_title' (string): Sonnet title - 'sonnet_text' (string): Full text of the specific sonnet, divided by lines ('\n') - 'metric_pattern' (string): Full metric pattern of the sonnet, in text, with TEI standard, divided by lines ('\n') - 'reference_id' (int): Id of the original XML file where the sonnet is extracted - 'publisher' (string): Name of the publisher - 'editor' (string): Name of the editor - 'research_author' (string): Name of the principal research author - 'metrical_patterns_annotator' (string): Name of the annotation's checker - 'research_group' (string): Name of the research group that processed the sonnet ## Poets With the purpose of having a corpus as representative as possible, every author from the 16th and 17th centuries with more than 10 digitalized and available sonnets has been included. All texts have been taken from the [Biblioteca Virtual Miguel de Cervantes](http://www.cervantesvirtual.com/). Currently, the corpus comprises more than 5,000 sonnets (more than 71,000 verses). ## Annotation The metrical pattern annotation has been carried out in a semi-automatic way. Firstly, all sonnets have been processed by an automatic metrical scansion system which assigns a distinct metrical pattern to each verse. Secondly, a part of the corpus has been manually checked and errors have been corrected. Currently the corpus is going through the manual validation phase, and each sonnet includes information about whether it has already been manually checked or not. ## How to cite this corpus If you would like to cite this corpus for academic research purposes, please use this reference: Navarro-Colorado, Borja; Ribes Lafoz, María, and Sánchez, Noelia (2015) "Metrical annotation of a large corpus of Spanish sonnets: representation, scansion and evaluation" 10th edition of the Language Resources and Evaluation Conference 2016 Portorož, Slovenia. ([PDF](http://www.dlsi.ua.es/~borja/navarro2016_MetricalPatternsBank.pdf)) ## Further Information This corpus is part of the [ADSO project](https://adsoen.wordpress.com/), developed at the [University of Alicante](http://www.ua.es) and funded by [Fundación BBVA](http://www.fbbva.es/TLFU/tlfu/ing/home/index.jsp). If you require further information about the metrical annotation, please consult the [Annotation Guide](https://github.com/bncolorado/CorpusSonetosSigloDeOro/blob/master/GuiaAnotacionMetrica.pdf) (in Spanish) or the following papers: - Navarro-Colorado, Borja; Ribes-Lafoz, María and Sánchez, Noelia (2016) "Metrical Annotation of a Large Corpus of Spanish Sonnets: Representation, Scansion and Evaluation" [Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)](http://www.lrec-conf.org/proceedings/lrec2016/pdf/453_Paper.pdf) Portorož, Slovenia. - Navarro-Colorado, Borja (2015) "A computational linguistic approach to Spanish Golden Age Sonnets: metrical and semantic aspects" [Computational Linguistics for Literature NAACL 2015](https://sites.google.com/site/clfl2015/), Denver (Co), USA ([PDF](https://aclweb.org/anthology/W/W15/W15-0712.pdf)). ## License The metrical annotation of this corpus is licensed under a Creative Commons Attribution-Non Commercial 4.0 International License. About the texts, "this digital object is protected by copyright and/or related rights. This digital object is accessible without charge, but its use is subject to the licensing conditions set by the organization giving access to it. Further information available at http://www.cervantesvirtual.com/marco-legal/ ".
提供机构:
biglam
原始信息汇总

西班牙黄金时代十四行诗数据集概述

数据集基本信息

  • 名称: 西班牙黄金时代十四行诗
  • 语言: 西班牙语
  • 许可证: 知识共享非商业性4.0国际许可(CC-BY-NC-4.0)
  • 多语言性: 单语种
  • 数据集大小: 超过5,000首十四行诗(超过71,000行)

数据集内容

  • 时间范围: 16世纪至17世纪
  • 数据来源: 所有文本来自Biblioteca Virtual Miguel de Cervantes
  • 数据格式: .csv格式,源自.xml格式
  • 数据集结构:
    • 列描述:
      • author (字符串): 十四行诗的作者
      • sonnet_title (字符串): 十四行诗的标题
      • sonnet_text (字符串): 完整的十四行诗文本,按行分割
      • metric_pattern (字符串): 十四行诗的完整韵律模式,按行分割
      • reference_id (整数): 原始XML文件中十四行诗的ID
      • publisher (字符串): 出版者名称
      • editor (字符串): 编辑名称
      • research_author (字符串): 主要研究作者名称
      • metrical_patterns_annotator (字符串): 韵律模式注释者名称
      • research_group (字符串): 处理十四行诗的研究组名称

数据集特点

  • 韵律模式: 每首十四行诗包含其韵律模式的文本表示,使用TEI标准,由一系列非重读音节("-"表示)和重读音节("+"表示)组成。
  • 注释方法: 韵律模式的注释采用半自动方式,首先通过自动韵律扫描系统处理,然后部分数据集进行人工检查和错误修正。

引用信息

  • 引用格式:
    • Navarro-Colorado, Borja; Ribes Lafoz, María, and Sánchez, Noelia (2015) "Metrical annotation of a large corpus of Spanish sonnets: representation, scansion and evaluation" 10th edition of the Language Resources and Evaluation Conference 2016 Portorož, Slovenia.

版权与许可

  • 韵律注释: 根据知识共享非商业性4.0国际许可(CC-BY-NC-4.0)授权。
  • 文本版权: 受版权和相关权利保护,使用受提供访问的组织设定的许可条件限制。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作