larrylawl/multilexnorm
收藏数据集概述
数据集名称
- MultiLexnorm数据集
数据集摘要
- MultiLexnorm数据集是huggingface版本的共享任务数据集,专注于多语言词汇规范化。该数据集提供了13种语言变体的公共多语言词汇规范化基准,并提出了一个统一的评估设置,包括内在和外在评估。
支持的任务和排行榜
- 任务类型:文本生成
- 评估指标:a-LAS, a-UAS, a-POS
语言
- 支持的语言:英语(en)、丹麦语(da)、德语(de)、西班牙语(es)、克罗地亚语(hr)、意大利语(it)、荷兰语(nl)、斯洛文尼亚语(sl)、塞尔维亚语(sr)、土耳其语(tr)、印度尼西亚语(id)
数据集结构
- 数据实例:未详细说明
- 数据字段:未详细说明
- 数据分割:未详细说明
数据集创建
- 数据收集和规范化:提供了13种语言变体的数据集
- 源语言生产者:未详细说明
- 注释过程:未详细说明
- 注释者:未详细说明
- 个人和敏感信息:未详细说明
使用数据的考虑
- 数据集的社会影响:未详细说明
- 偏见讨论:未详细说明
- 其他已知限制:未详细说明
附加信息
-
数据集维护者:未详细说明
-
许可信息:CC-BY-4.0
-
引用信息:
@inproceedings{van-der-goot-etal-2021-multilexnorm, title = "{M}ulti{L}ex{N}orm: A Shared Task on Multilingual Lexical Normalization", author = {van der Goot, Rob and Ramponi, Alan and Zubiaga, Arkaitz and Plank, Barbara and Muller, Benjamin and San Vicente Roncal, I{~n}aki and Ljube{v{s}}i{c}, Nikola and {c{C}}etino{u{g}}lu, {"O}zlem and Mahendra, Rahmad and {c{C}}olako{u{g}}lu, Talha and Baldwin, Timothy and Caselli, Tommaso and Sidorenko, Wladimir}, booktitle = "Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)", month = nov, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.wnut-1.55", doi = "10.18653/v1/2021.wnut-1.55", pages = "493--509", abstract = "Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. However, there exists a lack of a common benchmark for comparison of systems across languages with a homogeneous data and evaluation setup. The MultiLexNorm shared task sets out to fill this gap. We provide the largest publicly available multilingual lexical normalization benchmark including 13 language variants. We propose a homogenized evaluation setup with both intrinsic and extrinsic evaluation. As extrinsic evaluation, we use dependency parsing and part-of-speech tagging with adapted evaluation metrics (a-LAS, a-UAS, and a-POS) to account for alignment discrepancies. The shared task hosted at W-NUT 2021 attracted 9 participants and 18 submissions. The results show that neural normalization systems outperform the previous state-of-the-art system by a large margin. Downstream parsing and part-of-speech tagging performance is positively affected but to varying degrees, with improvements of up to 1.72 a-LAS, 0.85 a-UAS, and 1.54 a-POS for the winning system.", }
贡献者
- 贡献者:@larrylawl



