PierreColombo/miam
收藏数据集卡片 for MIAM
数据集描述
数据集摘要
Multilingual dIalogAct benchMark(MIAM)是一个用于训练、评估和分析专门针对口语的自然语言理解系统的资源集合。数据集包括英语、法语、德语、意大利语和西班牙语,涵盖了自发言语、脚本场景和联合任务完成等多种领域。所有数据集都包含对话行为标签。
支持的任务和排行榜
[更多信息需要]
语言
英语、法语、德语、意大利语、西班牙语。
数据集结构
数据实例
Dihana 语料库
对于 dihana 配置,数据集中的一个示例是:
json
{
"Speaker": "U",
"Utterance": "Hola , quería obtener el horario para ir a Valencia",
"Dialogue_Act": 9, # Pregunta (Request)
"Dialogue_ID": "0",
"File_ID": "B209_BA5c3"
}
iLISTEN 语料库
对于 ilisten 配置,数据集中的一个示例是:
json
{
"Speaker": "T_11_U11",
"Utterance": "ok, grazie per le informazioni",
"Dialogue_Act": 6, # KIND-ATTITUDE_SMALL-TALK
"Dialogue_ID": "0"
}
LORIA 语料库
对于 loria 配置,数据集中的一个示例是:
json
{
"Speaker": "Samir",
"Utterance": "Merci de votre visite, bonne chance, et à la prochaine !",
"Dialogue_Act": 21, # quit
"Dialogue_ID": "5",
"File_ID": "Dial_20111128_113927"
}
HCRC MapTask 语料库
对于 maptask 配置,数据集中的一个示例是:
json
{
"Speaker": "f",
"Utterance": "is it underneath the rope bridge or to the left",
"Dialogue_Act": 6, # query_w
"Dialogue_ID": "0",
"File_ID": "q4ec1"
}
VERBMOBIL
对于 vm2 配置,数据集中的一个示例是:
json
{
"Utterance": "ja was sind viereinhalb Stunden Bahngerüttel gegen siebzig Minuten Turbulenzen im Flugzeug",
"Dialogue_Act": "INFORM",
"Speaker": "A",
"Dialogue_ID": "66"
}
数据字段
对于 dihana 配置,不同的字段是:
Speaker: 说话者的标识符,字符串类型。Utterance: 话语,字符串类型。Dialogue_Act: 话语的对话行为标签。可以是 Afirmacion (0) [Feedback_positive], Apertura (1) [Opening], Cierre (2) [Closing], Confirmacion (3) [Acknowledge], Espera (4) [Hold], Indefinida (5) [Undefined], Negacion (6) [Feedback_negative], No_entendido (7) [Request_clarify], Nueva_consulta (8) [New_request], Pregunta (9) [Request] 或 Respuesta (10) [Reply]。Dialogue_ID: 对话的标识符,字符串类型。File_ID: 源文件的标识符,字符串类型。
对于 ilisten 配置,不同的字段是:
Speaker: 说话者的标识符,字符串类型。Utterance: 话语,字符串类型。Dialogue_Act: 话语的对话行为标签。可以是 AGREE (0), ANSWER (1), CLOSING (2), ENCOURAGE-SORRY (3), GENERIC-ANSWER (4), INFO-REQUEST (5), KIND-ATTITUDE_SMALL-TALK (6), OFFER-GIVE-INFO (7), OPENING (8), PERSUASION-SUGGEST (9), QUESTION (10), REJECT (11), SOLICITATION-REQ_CLARIFICATION (12), STATEMENT (13) 或 TALK-ABOUT-SELF (14)。Dialogue_ID: 对话的标识符,字符串类型。
对于 loria 配置,不同的字段是:
Speaker: 说话者的标识符,字符串类型。Utterance: 话语,字符串类型。Dialogue_Act: 话语的对话行为标签。可以是 ack (0), ask (1), find_mold (2), find_plans (3), first_step (4), greet (5), help (6), inform (7), inform_engine (8), inform_job (9), inform_material_space (10), informer_conditioner (11), informer_decoration (12), informer_elcomps (13), informer_end_manufacturing (14), kindAtt (15), manufacturing_reqs (16), next_step (17), no (18), other (19), quality_control (20), quit (21), reqRep (22), security_policies (23), staff_enterprise (24), staff_job (25), studies_enterprise (26), studies_job (27), todo_failure (28), todo_irreparable (29), yes (30)。Dialogue_ID: 对话的标识符,字符串类型。File_ID: 源文件的标识符,字符串类型。
对于 maptask 配置,不同的字段是:
Speaker: 说话者的标识符,字符串类型。Utterance: 话语,字符串类型。Dialogue_Act: 话语的对话行为标签。可以是 acknowledge (0), align (1), check (2), clarify (3), explain (4), instruct (5), query_w (6), query_yn (7), ready (8), reply_n (9), reply_w (10) 或 reply_y (11)。Dialogue_ID: 对话的标识符,字符串类型。File_ID: 源文件的标识符,字符串类型。
对于 vm2 配置,不同的字段是:
Utterance: 话语,字符串类型。Dialogue_Act: 话语的对话行为标签。可以是 ACCEPT (0), BACKCHANNEL (1), BYE (2), CLARIFY (3), CLOSE (4), COMMIT (5), CONFIRM (6), DEFER (7), DELIBERATE (8), DEVIATE_SCENARIO (9), EXCLUDE (10), EXPLAINED_REJECT (11), FEEDBACK (12), FEEDBACK_NEGATIVE (13), FEEDBACK_POSITIVE (14), GIVE_REASON (15), GREET (16), INFORM (17), INIT (18), INTRODUCE (19), NOT_CLASSIFIABLE (20), OFFER (21), POLITENESS_FORMULA (22), REJECT (23), REQUEST (24), REQUEST_CLARIFY (25), REQUEST_COMMENT (26), REQUEST_COMMIT (27), REQUEST_SUGGEST (28), SUGGEST (29), THANK (30)。Speaker: 说话者,字符串类型。Dialogue_ID: 对话的标识符,字符串类型。
数据分割
| 数据集名称 | 训练集 | 验证集 | 测试集 |
|---|---|---|---|
| dihana | 19063 | 2123 | 2361 |
| ilisten | 1986 | 230 | 971 |
| loria | 8465 | 942 | 1047 |
| maptask | 25382 | 5221 | 5335 |
| vm2 | 25060 | 2860 | 2855 |
数据集创建
策划理由
[更多信息需要]
源数据
初始数据收集和规范化
[更多信息需要]
源语言生产者是谁?
[更多信息需要]
注释
注释过程
[更多信息需要]
注释者是谁?
[更多信息需要]
个人和敏感信息
[更多信息需要]
使用数据的注意事项
数据集的社会影响
[更多信息需要]
偏见的讨论
[更多信息需要]
其他已知限制
[更多信息需要]
附加信息
数据集策展人
匿名。
许可信息
此作品根据Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Unported License进行许可。
引用信息
plaintext @inproceedings{colombo-etal-2021-code, title = "Code-switched inspired losses for spoken dialog representations", author = "Colombo, Pierre and Chapuis, Emile and Labeau, Matthieu and Clavel, Chlo{e}", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.656", doi = "10.18653/v1/2021.emnlp-main.656", pages = "8320--8337", abstract = "Spoken dialogue systems need to be able to handle both multiple languages and multilinguality inside a conversation ( extit{e.g} in case of code-switching). In this work, we introduce new pretraining losses tailored to learn generic multilingual spoken dialogue representations. The goal of these losses is to expose the model to code-switched language. In order to scale up training, we automatically build a pretraining corpus composed of multilingual conversations in five different languages (French, Italian, English, German and Spanish) from OpenSubtitles, a huge multilingual corpus composed of 24.3G tokens. We test the generic representations on MIAM, a new benchmark composed of five dialogue act corpora on the same aforementioned languages as well as on two novel multilingual tasks ( extit{i.e} multilingual mask utterance retrieval and multilingual inconsistency identification). Our experiments show that our new losses achieve a better performance in both monolingual and multilingual settings.", }
贡献
感谢 @eusip 和 @PierreColombo 添加此数据集。



