Harrington Yowlumne Narrative Corpus

Name: Harrington Yowlumne Narrative Corpus
Creator: 西部濒危语言文档研究所
Published: 2022-05-19 18:52:10
License: 暂无描述

arXiv2022-05-19 更新2024-07-30 收录

下载链接：

http://corpus.ap-southeast-2.elasticbeanstalk.com/ywlcorpus/

下载链接

链接失效反馈

官方服务：

资源简介：

Harrington Yowlumne Narrative Corpus是由西部濒危语言文档研究所创建的一个包含20个叙事文本的数据集，源自1910至1925年间加州Kern县的Tejoneño Yowlumne社区。数据集包含57,136个转录字符和10,719个黄金标准文本规范化词，通过Levenshtein距离算法和手动校验进行文本规范化处理，并提供词性标签。该数据集旨在解决低资源语言在自然语言处理领域的数据匮乏问题，特别适用于测试新的语音学理论，因其复杂的形态音位学特性而备受理论语音学家关注。

The Harrington Yowlumne Narrative Corpus is a dataset containing 20 narrative texts, created by the Western Institute for Endangered Language Documentation. Derived from the Tejoneño Yowlumne community in Kern County, California, the corpus was compiled between 1910 and 1925. It encompasses 57,136 transcribed characters and 10,719 gold-standard manually normalized words. Text normalization was carried out using the Levenshtein distance algorithm in conjunction with manual validation, and part-of-speech tags are provided for the dataset. This dataset aims to address the scarcity of annotated data for low-resource languages in the field of natural language processing. Owing to its complex morphophonological properties, the corpus is particularly suitable for testing new phonological theories and has garnered significant attention from theoretical phonologists.

提供机构：

西部濒危语言文档研究所

创建时间：

2021-02-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集