GINCO Training Dataset
收藏arXiv2022-01-11 更新2024-06-21 收录
下载链接:
http://hdl.handle.net/11356/1467
下载链接
链接失效反馈官方服务:
资源简介:
GINCO Training Dataset是由斯洛文尼亚约瑟夫·斯特凡研究所创建的一个用于自动体裁识别的训练数据集,包含1,125个斯洛文尼亚网络文档,总计约65万字。该数据集通过新的标注方案进行手动标注,旨在提高标签清晰度和标注者间一致性。数据集涵盖了网络数据相关的多种挑战,如机器翻译内容、编码错误等,使得分类器能在真实条件下进行评估。该数据集主要用于解决网络文档的体裁识别问题,支持深入分析新提供语料库的质量和组成,同时也有助于提升语言技术在多个领域的应用,如词性标注、零样本依赖解析、自动摘要和机器翻译等。
The GINCO Training Dataset is a training dataset for automatic genre recognition, developed by the Jožef Stefan Institute in Slovenia. It comprises 1,125 Slovenian web documents, with a total of approximately 650,000 words. This dataset was manually annotated via a novel annotation scheme, with the goal of enhancing label clarity and inter-annotator agreement. The dataset encompasses various challenges inherent in web data, such as machine-translated content and encoding errors, allowing classifiers to be evaluated under real-world conditions. Primarily intended to solve the genre recognition task for web documents, the dataset supports in-depth analyses of the quality and composition of the newly released corpus, and also contributes to advancing language technology applications across multiple domains, including part-of-speech tagging, zero-shot dependency parsing, automatic summarization, and machine translation.
提供机构:
斯洛文尼亚约瑟夫·斯特凡研究所
创建时间:
2022-01-11



