PierreColombo/miam

Name: PierreColombo/miam
Creator: PierreColombo
Published: 2024-01-18 11:09:00
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/PierreColombo/miam

下载链接

链接失效反馈

官方服务：

资源简介：

MIAM数据集是一个多语言对话行为标注数据集，涵盖了英语、法语、德语、意大利语和西班牙语。数据集包含多个配置（如dihana、ilisten、loria、maptask、vm2），每个配置都有不同的对话行为标签和数据结构。数据集的主要用途是训练、评估和分析自然语言理解系统，特别是针对口语语言。数据集的结构包括多个字段，如说话者、话语、对话行为标签、对话ID和文件ID等。数据集还提供了训练、验证和测试集的划分。

The MIAM dataset is a multilingual dialogue act annotation dataset covering English, French, German, Italian and Spanish. It includes multiple configurations (e.g., dihana, ilisten, loria, maptask, vm2), each with distinct dialogue act tags and data structures. The primary applications of this dataset are training, evaluating and analyzing natural language understanding systems, especially for spoken language. The dataset structure comprises multiple fields such as speaker, utterance, dialogue act label, dialogue ID, file ID and so on. It also provides the data splits for training, validation and test sets.

提供机构：

PierreColombo

原始信息汇总

数据集卡片 for MIAM

数据集描述

数据集摘要

Multilingual dIalogAct benchMark（MIAM）是一个用于训练、评估和分析专门针对口语的自然语言理解系统的资源集合。数据集包括英语、法语、德语、意大利语和西班牙语，涵盖了自发言语、脚本场景和联合任务完成等多种领域。所有数据集都包含对话行为标签。

支持的任务和排行榜

[更多信息需要]

语言

英语、法语、德语、意大利语、西班牙语。

数据集结构

数据实例

Dihana 语料库

对于 dihana 配置，数据集中的一个示例是： json { "Speaker": "U", "Utterance": "Hola , quería obtener el horario para ir a Valencia", "Dialogue_Act": 9, # Pregunta (Request) "Dialogue_ID": "0", "File_ID": "B209_BA5c3" }

iLISTEN 语料库

对于 ilisten 配置，数据集中的一个示例是： json { "Speaker": "T_11_U11", "Utterance": "ok, grazie per le informazioni", "Dialogue_Act": 6, # KIND-ATTITUDE_SMALL-TALK "Dialogue_ID": "0" }

LORIA 语料库

对于 loria 配置，数据集中的一个示例是： json { "Speaker": "Samir", "Utterance": "Merci de votre visite, bonne chance, et à la prochaine !", "Dialogue_Act": 21, # quit "Dialogue_ID": "5", "File_ID": "Dial_20111128_113927" }

HCRC MapTask 语料库

对于 maptask 配置，数据集中的一个示例是： json { "Speaker": "f", "Utterance": "is it underneath the rope bridge or to the left", "Dialogue_Act": 6, # query_w "Dialogue_ID": "0", "File_ID": "q4ec1" }

VERBMOBIL

对于 vm2 配置，数据集中的一个示例是： json { "Utterance": "ja was sind viereinhalb Stunden Bahngerüttel gegen siebzig Minuten Turbulenzen im Flugzeug", "Dialogue_Act": "INFORM", "Speaker": "A", "Dialogue_ID": "66" }

数据字段

对于 dihana 配置，不同的字段是：

Speaker: 说话者的标识符，字符串类型。
Utterance: 话语，字符串类型。
Dialogue_Act: 话语的对话行为标签。可以是 Afirmacion (0) [Feedback_positive], Apertura (1) [Opening], Cierre (2) [Closing], Confirmacion (3) [Acknowledge], Espera (4) [Hold], Indefinida (5) [Undefined], Negacion (6) [Feedback_negative], No_entendido (7) [Request_clarify], Nueva_consulta (8) [New_request], Pregunta (9) [Request] 或 Respuesta (10) [Reply]。
Dialogue_ID: 对话的标识符，字符串类型。
File_ID: 源文件的标识符，字符串类型。

对于 ilisten 配置，不同的字段是：

Speaker: 说话者的标识符，字符串类型。
Utterance: 话语，字符串类型。
Dialogue_Act: 话语的对话行为标签。可以是 AGREE (0), ANSWER (1), CLOSING (2), ENCOURAGE-SORRY (3), GENERIC-ANSWER (4), INFO-REQUEST (5), KIND-ATTITUDE_SMALL-TALK (6), OFFER-GIVE-INFO (7), OPENING (8), PERSUASION-SUGGEST (9), QUESTION (10), REJECT (11), SOLICITATION-REQ_CLARIFICATION (12), STATEMENT (13) 或 TALK-ABOUT-SELF (14)。
Dialogue_ID: 对话的标识符，字符串类型。

对于 loria 配置，不同的字段是：

Speaker: 说话者的标识符，字符串类型。
Utterance: 话语，字符串类型。
Dialogue_Act: 话语的对话行为标签。可以是 ack (0), ask (1), find_mold (2), find_plans (3), first_step (4), greet (5), help (6), inform (7), inform_engine (8), inform_job (9), inform_material_space (10), informer_conditioner (11), informer_decoration (12), informer_elcomps (13), informer_end_manufacturing (14), kindAtt (15), manufacturing_reqs (16), next_step (17), no (18), other (19), quality_control (20), quit (21), reqRep (22), security_policies (23), staff_enterprise (24), staff_job (25), studies_enterprise (26), studies_job (27), todo_failure (28), todo_irreparable (29), yes (30)。
Dialogue_ID: 对话的标识符，字符串类型。
File_ID: 源文件的标识符，字符串类型。

对于 maptask 配置，不同的字段是：

Speaker: 说话者的标识符，字符串类型。
Utterance: 话语，字符串类型。
Dialogue_Act: 话语的对话行为标签。可以是 acknowledge (0), align (1), check (2), clarify (3), explain (4), instruct (5), query_w (6), query_yn (7), ready (8), reply_n (9), reply_w (10) 或 reply_y (11)。
Dialogue_ID: 对话的标识符，字符串类型。
File_ID: 源文件的标识符，字符串类型。

对于 vm2 配置，不同的字段是：

Utterance: 话语，字符串类型。
Dialogue_Act: 话语的对话行为标签。可以是 ACCEPT (0), BACKCHANNEL (1), BYE (2), CLARIFY (3), CLOSE (4), COMMIT (5), CONFIRM (6), DEFER (7), DELIBERATE (8), DEVIATE_SCENARIO (9), EXCLUDE (10), EXPLAINED_REJECT (11), FEEDBACK (12), FEEDBACK_NEGATIVE (13), FEEDBACK_POSITIVE (14), GIVE_REASON (15), GREET (16), INFORM (17), INIT (18), INTRODUCE (19), NOT_CLASSIFIABLE (20), OFFER (21), POLITENESS_FORMULA (22), REJECT (23), REQUEST (24), REQUEST_CLARIFY (25), REQUEST_COMMENT (26), REQUEST_COMMIT (27), REQUEST_SUGGEST (28), SUGGEST (29), THANK (30)。
Speaker: 说话者，字符串类型。
Dialogue_ID: 对话的标识符，字符串类型。

数据分割

数据集名称	训练集	验证集	测试集
dihana	19063	2123	2361
ilisten	1986	230	971
loria	8465	942	1047
maptask	25382	5221	5335
vm2	25060	2860	2855

数据集创建

策划理由

[更多信息需要]

源数据

初始数据收集和规范化

[更多信息需要]

源语言生产者是谁？

[更多信息需要]

注释

注释过程

[更多信息需要]

注释者是谁？

[更多信息需要]

个人和敏感信息

[更多信息需要]

使用数据的注意事项

数据集的社会影响

[更多信息需要]

偏见的讨论

[更多信息需要]

其他已知限制

[更多信息需要]

附加信息

数据集策展人

匿名。

许可信息

此作品根据Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Unported License进行许可。

引用信息

plaintext @inproceedings{colombo-etal-2021-code, title = "Code-switched inspired losses for spoken dialog representations", author = "Colombo, Pierre and Chapuis, Emile and Labeau, Matthieu and Clavel, Chlo{e}", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.656", doi = "10.18653/v1/2021.emnlp-main.656", pages = "8320--8337", abstract = "Spoken dialogue systems need to be able to handle both multiple languages and multilinguality inside a conversation ( extit{e.g} in case of code-switching). In this work, we introduce new pretraining losses tailored to learn generic multilingual spoken dialogue representations. The goal of these losses is to expose the model to code-switched language. In order to scale up training, we automatically build a pretraining corpus composed of multilingual conversations in five different languages (French, Italian, English, German and Spanish) from OpenSubtitles, a huge multilingual corpus composed of 24.3G tokens. We test the generic representations on MIAM, a new benchmark composed of five dialogue act corpora on the same aforementioned languages as well as on two novel multilingual tasks ( extit{i.e} multilingual mask utterance retrieval and multilingual inconsistency identification). Our experiments show that our new losses achieve a better performance in both monolingual and multilingual settings.", }

贡献

感谢 @eusip 和 @PierreColombo 添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集