DGB1-2023corpus_dict - Corpus and dictionary files for 2023
收藏Research Data Australia2025-01-18 收录
下载链接:
https://researchdata.edu.au/null/3451449
下载链接
链接失效反馈官方服务:
资源简介:
A compiled Matukar Panau corpus of 150,740 words, including words in context, speaker metadata, file metadata and where available parsing and glossing and translations. A subset of this corpus is included in a separate file as a morpheme corpus with parsing and glossing of 20,359 morphemes. Most files have been standardized for spelling. The spelling standardization script package for ELAN was developed by Jake Farrell, AI Specialist at Appen, for the use by CoEDL researchers. A lexicon from ELAN In xml format is included. An annotation guideline for clause chains is also included. Annotations are in tiers with the ELAN type "chain". . Language as given:
本数据集为编译完成的马图卡尔帕瑙语(Matukar Panau)语料库,总规模达150,740个词汇,涵盖上下文语境词汇、说话者元数据、文件元数据,以及可获取的句法分析、语素标注释义与译文。该语料库的子集以独立文件形式存储为词素语料库,包含20,359个词素的句法分析与语素标注释义。绝大多数文件已完成拼写规范化处理。针对ELAN的拼写规范化脚本包由Appen公司人工智能专家Jake Farrell开发,供CoEDL研究人员使用。数据集包含XML格式的ELAN词表,同时还附有链式从句标注指南。注释以ELAN类型为"chain"的标注层级形式呈现。语言标注如下:
提供机构:
not available
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



