READIN
收藏arXiv2023-05-25 更新2024-06-21 收录
下载链接:
https://github.com/thunlp/READIN
下载链接
链接失效反馈官方服务:
资源简介:
READIN是由清华大学自然语言处理组创建的中文多任务基准数据集,旨在模拟真实世界中用户输入的多样性和噪声。该数据集包含四个不同的任务,并要求标注者使用两种常用的中文输入方法(拼音输入和语音输入)重新输入原始测试数据。数据集的创建过程考虑了输入方法的多样性,例如通过指导标注者使用不同的输入法编辑器(IMEs)来增加键盘噪声,以及招募来自不同方言群体的说话者来增加语音噪声。READIN的应用领域广泛,包括语义解析、机器阅读理解等,旨在评估模型在面对真实世界噪声时的性能和鲁棒性。
READIN is a Chinese multi-task benchmark dataset developed by the Natural Language Processing Group of Tsinghua University, which aims to simulate the diversity and noise of user inputs in real-world scenarios. This dataset contains four distinct tasks, and requires annotators to re-enter the original test data via two commonly used Chinese input methods: pinyin input and speech input. The dataset creation process considers the diversity of input approaches: for instance, annotators are instructed to use different Input Method Editors (IMEs) to introduce keyboard noise, and speakers from various dialect groups are recruited to add speech noise. READIN has broad application scenarios including semantic parsing, machine reading comprehension and other fields, and is designed to evaluate the performance and robustness of models when facing real-world noise.
提供机构:
清华大学自然语言处理组
创建时间:
2023-02-15



