five

Opera Graeca Adnotata (OGA)

收藏
arXiv2024-04-01 更新2024-06-21 收录
下载链接:
https://zenodo.org/records/8158675
下载链接
链接失效反馈
官方服务:
资源简介:
Opera Graeca Adnotata (OGA) 是由莱比锡大学计算机科学研究所开发的一个针对古希腊语的大型开放访问多层语料库,包含1687部文学作品和超过3400万个Token。该数据集来源于PerseusDL和OpenGreekAndLatin GitHub仓库,涵盖了公元前800年至公元250年的文本。数据集通过七个不同的标注层进行丰富,包括分词、句子分割、词形化、形态学、依存关系、依存功能和CTS引用层。创建过程中,主要采用基于规则的算法进行分词、句子分割和CTS引用,而形态句法标注则依赖于COMBO解析器。OGA数据集的应用领域广泛,旨在解决古希腊语文本的数字化、分析和研究问题,支持语言学、历史学和文献学等多学科研究。

Opera Graeca Adnotata (OGA) is a large open-access multilayer corpus for Ancient Greek developed by the Institute of Computer Science at Leipzig University, comprising 1687 literary works and over 34 million Tokens. This dataset is sourced from the PerseusDL and OpenGreekAndLatin GitHub repositories, covering texts spanning from 800 BCE to 250 CE. The dataset is enriched with seven distinct annotation layers, including word segmentation, sentence segmentation, lemmatization, morphology, dependency relations, dependency functions, and CTS citation layers. During its development, rule-based algorithms were primarily utilized for word segmentation, sentence segmentation and CTS citation processing, while morphosyntactic annotation relies on the COMBO parser. The OGA dataset has broad application scenarios, aiming to address issues regarding the digitization, analysis and research of Ancient Greek texts, and supports interdisciplinary research across fields such as linguistics, history and philology.
提供机构:
莱比锡大学计算机科学研究所
创建时间:
2024-04-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作