SEACrowd/massive
收藏Massive 数据集概述
基本信息
- 名称: Massive
- 许可证: Creative Commons Attribution 4.0 (cc-by-4.0)
- 语言:
- ind
- jav
- khm
- zlm
- mya
- tha
- tgl
- vie
- 任务类别:
- 意图分类 (Intent Classification)
- 槽填充 (Slot Filling)
- 标签:
- 意图分类
- 槽填充
数据集描述
- 内容: Massive 数据集是一个多语言的 Amazon SLURP 资源包,用于槽填充、意图分类和虚拟助手评估。该数据集包含 100 万条现实、并行、标记的虚拟助手话语,涵盖 18 个领域、60 个意图和 55 个槽。
- 语言多样性: 该数据集由专业翻译人员将仅限英语的 SLURP 数据集本地化为 50 种类型学上多样化的语言,包括 8 种本地语言和 2 种主要在东南亚使用的其他语言。
支持任务
- 意图分类
- 槽填充
数据集版本
- 源版本: 1.1.0
- SEACrowd 版本: 2024.06.20
引用
@misc{fitzgerald2022massive, title={MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages}, author={Jack FitzGerald and Christopher Hench and Charith Peris and Scott Mackie and Kay Rottmann and Ana Sanchez and Aaron Nash and Liam Urbach and Vishesh Kakarala and Richa Singh and Swetha Ranganath and Laurie Crist and Misha Britan and Wouter Leeuwis and Gokhan Tur and Prem Natarajan}, year={2022}, eprint={2204.08582}, archivePrefix={arXiv}, primaryClass={cs.CL} } @inproceedings{bastianelli-etal-2020-slurp, title = "{SLURP}: A Spoken Language Understanding Resource Package", author = "Bastianelli, Emanuele and Vanzo, Andrea and Swietojanski, Pawel and Rieser, Verena", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.emnlp-main.588", doi = "10.18653/v1/2020.emnlp-main.588", pages = "7252--7262", abstract = "Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https://github.com/pswietojanski/slurp." }
@article{lovenia2024seacrowd, title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya}, year={2024}, eprint={2406.10118}, journal={arXiv preprint arXiv: 2406.10118} }



