SEACrowd/massive

Name: SEACrowd/massive
Creator: SEACrowd
Published: 2024-06-24 13:32:03
License: 暂无描述

Hugging Face2024-06-24 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/SEACrowd/massive

下载链接

链接失效反馈

官方服务：

资源简介：

MASSIVE数据集是一个多语言的亚马逊SLU资源包，用于槽填充、意图分类和虚拟助手评估。该数据集包含100万条现实中的虚拟助手对话，涵盖18个领域、60种意图和55个槽位。MASSIVE数据集由专业翻译人员将英文SLURP数据集本地化为50种语言，其中包括8种东南亚本地语言和2种其他语言。

提供机构：

SEACrowd

原始信息汇总

Massive 数据集概述

基本信息

名称: Massive
许可证: Creative Commons Attribution 4.0 (cc-by-4.0)
语言:
- ind
- jav
- khm
- zlm
- mya
- tha
- tgl
- vie
任务类别:
- 意图分类 (Intent Classification)
- 槽填充 (Slot Filling)
标签:
- 意图分类
- 槽填充

数据集描述

内容: Massive 数据集是一个多语言的 Amazon SLURP 资源包，用于槽填充、意图分类和虚拟助手评估。该数据集包含 100 万条现实、并行、标记的虚拟助手话语，涵盖 18 个领域、60 个意图和 55 个槽。
语言多样性: 该数据集由专业翻译人员将仅限英语的 SLURP 数据集本地化为 50 种类型学上多样化的语言，包括 8 种本地语言和 2 种主要在东南亚使用的其他语言。

支持任务

意图分类
槽填充

数据集版本

源版本: 1.1.0
SEACrowd 版本: 2024.06.20

引用

@misc{fitzgerald2022massive, title={MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages}, author={Jack FitzGerald and Christopher Hench and Charith Peris and Scott Mackie and Kay Rottmann and Ana Sanchez and Aaron Nash and Liam Urbach and Vishesh Kakarala and Richa Singh and Swetha Ranganath and Laurie Crist and Misha Britan and Wouter Leeuwis and Gokhan Tur and Prem Natarajan}, year={2022}, eprint={2204.08582}, archivePrefix={arXiv}, primaryClass={cs.CL} } @inproceedings{bastianelli-etal-2020-slurp, title = "{SLURP}: A Spoken Language Understanding Resource Package", author = "Bastianelli, Emanuele and Vanzo, Andrea and Swietojanski, Pawel and Rieser, Verena", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.emnlp-main.588", doi = "10.18653/v1/2020.emnlp-main.588", pages = "7252--7262", abstract = "Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https://github.com/pswietojanski/slurp." }

@article{lovenia2024seacrowd, title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya}, year={2024}, eprint={2406.10118}, journal={arXiv preprint arXiv: 2406.10118} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集