博客与问答网站规范化语料库（BQNC）

Name: 博客与问答网站规范化语料库（BQNC）
Creator: 信息与通信技术国家研究所
Published: 2021-04-08 13:53:46
License: 暂无描述

arXiv2021-04-08 更新2024-06-21 收录

下载链接：

https://github.com/shigashiyama/jlexnorm

下载链接

链接失效反馈

官方服务：

资源简介：

博客与问答网站规范化语料库（BQNC）是由信息与通信技术国家研究所和奈良科学技术研究所共同创建的公开日本用户生成文本（UGT）语料库。该数据集包含929个句子，注释了形态和规范化信息，以及频繁出现的UGT特定现象的类别信息。BQNC旨在为UGT的形态分析（MA）和词汇规范化（LN）系统的评估和比较提供挑战性的基准。数据集的创建过程遵循了与现有代表性语料库兼容的标准，并详细评估了UGT特定问题。BQNC的应用领域包括UGT的MA和LN研究，旨在解决现有系统在非通用词和非标准形式上的性能问题。

The Blog and Question Answering Website Normalization Corpus (BQNC) is a publicly available Japanese user-generated text (UGT) corpus co-created by the National Institute of Information and Communications Technology and the Nara Institute of Science and Technology. This corpus contains 929 sentences annotated with morphological and normalization information, as well as category labels for frequently occurring UGT-specific linguistic phenomena. BQNC is designed to serve as a challenging benchmark for evaluating and comparing systems for UGT morphological analysis (MA) and lexical normalization (LN). The construction of BQNC follows standards compatible with those of existing representative corpora, and has conducted detailed assessments of UGT-specific challenges. The application scope of BQNC covers UGT-related MA and LN research, aiming to address the performance issues of existing systems when dealing with non-standard vocabulary items and non-standard linguistic forms.

提供机构：

信息与通信技术国家研究所

创建时间：

2021-04-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集