five

博客与问答网站规范化语料库(BQNC)

收藏
arXiv2021-04-08 更新2024-06-21 收录
下载链接:
https://github.com/shigashiyama/jlexnorm
下载链接
链接失效反馈
官方服务:
资源简介:
博客与问答网站规范化语料库(BQNC)是由信息与通信技术国家研究所和奈良科学技术研究所共同创建的公开日本用户生成文本(UGT)语料库。该数据集包含929个句子,注释了形态和规范化信息,以及频繁出现的UGT特定现象的类别信息。BQNC旨在为UGT的形态分析(MA)和词汇规范化(LN)系统的评估和比较提供挑战性的基准。数据集的创建过程遵循了与现有代表性语料库兼容的标准,并详细评估了UGT特定问题。BQNC的应用领域包括UGT的MA和LN研究,旨在解决现有系统在非通用词和非标准形式上的性能问题。

The Blog and Question Answering Website Normalization Corpus (BQNC) is a publicly available Japanese user-generated text (UGT) corpus co-created by the National Institute of Information and Communications Technology and the Nara Institute of Science and Technology. This corpus contains 929 sentences annotated with morphological and normalization information, as well as category labels for frequently occurring UGT-specific linguistic phenomena. BQNC is designed to serve as a challenging benchmark for evaluating and comparing systems for UGT morphological analysis (MA) and lexical normalization (LN). The construction of BQNC follows standards compatible with those of existing representative corpora, and has conducted detailed assessments of UGT-specific challenges. The application scope of BQNC covers UGT-related MA and LN research, aiming to address the performance issues of existing systems when dealing with non-standard vocabulary items and non-standard linguistic forms.
提供机构:
信息与通信技术国家研究所
创建时间:
2021-04-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作