越南语对话重排数据集

Name: 越南语对话重排数据集
Creator: maas
Published: 2026-05-15 14:28:07
License: 暂无描述

魔搭社区2026-05-15 更新2024-05-15 收录

下载链接：

https://modelscope.cn/datasets/DAMO_ConvAI/ViDoc2BotRerank

下载链接

链接失效反馈

官方服务：

资源简介：

### Clone with HTTP ```bash git clone https://www.modelscope.cn/datasets/AronXiang/ViDoc2BotRetrieval.git ``` # Documen-grounded dialogue Goal-oriented document-grounded dialogue systems enable end users to interactively query about domain-specific information based on the given documents. The tasks of querying document knowledge via conversational systems continue to attract a lot of attention from both research and industrial communities for various applications. The previous works addressed the task of English and Chinese document-grounded dialogue systems, leaving other languages less well explored. Thus, large communities of users are prevented access to automated services and information. We aim to extend the effort by introducing the Third ACL DialDoc Workshop shared task involving documents and dialogues in diverse languages. We present this multilingual DGD challenge to encourage researchers to explore effective solutions for (1) transferring a DGD model from a high-resource language to a low-resource language; (2) developing a DGD model that is capable of providing multilingual responses given multilingual documents. ### Description Specifically，we provide 797 dialogues in Vietnamese (3,446 turns), 816 dialogues in French (3,510 turns), and a corpus of 17272 paragraphs, where each dialogue turn is grounded in a paragraph from the corpus. We also organize the currently available Chinese and English document-grounded dialogue data. We hope that participants can leverage the linguistic similarities, for example, a large number of Vietnamese words are derived from Chinese, and English and French both belong to the Indo-European language family, to improve their models' performance in Vietnamese and French. So the task objective is to rerank relevant paragraphs from a corpus based on the dialogue history and generate a response. To address this issue, we provide a baseline model consisting of three modules: retrieving the top-K relevant paragraphs from the corpus based on the dialogue history, ranking the top-N most relevant paragraphs, and concatenating them with the dialogue history to generate a response using a generation module. **This Project contains the Vietnamese data for fine-tuning the rerank module.** ### Dataset Format Each piece of data contains three attributes: query, positive, and negative. The query is a concatenation of the conversation history in reverse order, with the last turn marked as "<last_turn>", and the rest marked with "<user>" for user input and "<agent>" for system output. For example： '<last_turn> Ai đã giới thiệu một trong những hệ thống phúc lợi đầu tiên cho giai cấp công nhân vào năm 1883? <agent> Đại khủng hoảng, khi các biện pháp cứu trợ khẩn cấp đã được giới thiệu dưới thời Tổng thống Franklin D. Roosevelt. <user> Khi nào Hoa Kỳ có một hệ thống phúc lợi xã hội có tổ chức?' "Positive" refers to the positive samples, which are the samples used to label the targets of the dialogue. The titles in reverse chronological order are concatenated and separated by "//". For example: 'Otto von Bismarck, Thủ tướng Đức, giới thiệu một trong những hệ thống phúc lợi đầu tiên cho các tầng lớp lao động vào năm 1883. // Lịch sử[sửa | sửa mã nguồn] // An sinh xã hội – Wikipedia tiếng Việt // vi-SocialSecurity' "Negative" refers to negative samples, which are obtained by retrieving the passage with the highest BM25 value from the dialogue history, excluding the positive samples. The format is the same as for positive samples. ### 数据集加载方式通过代码范例等方式，提供数据集通过MaaS/Dataset SDK进行加载和使用的详细说明。

### HTTP 克隆方式 bash git clone https://www.modelscope.cn/datasets/AronXiang/ViDoc2BotRetrieval.git ### 文档驱动对话（Document-grounded Dialogue）面向目标的文档驱动对话系统可让终端用户基于给定文档，交互式查询特定领域的信息。通过对话系统查询文档知识的任务，因诸多实际应用场景，持续受到学术界与工业界的广泛关注。此前相关研究主要聚焦于英语与汉语的文档驱动对话系统，对其他语言的相关探索仍较为匮乏，导致大量用户群体无法获取自动化服务与相关信息。为此，我们依托第三届国际计算语言学协会（Association for Computational Linguistics, ACL）DialDoc研讨会共享任务，推出涵盖多语言文档与对话的研究课题，以期拓展该领域的研究边界。我们发起此次多语言DGD（文档驱动对话）挑战赛，旨在鼓励研究者探索两类有效解决方案：(1) 将高资源语言下的DGD模型迁移至低资源语言；(2) 研发可基于多语言文档生成多语言回复的DGD模型。 ### 数据集说明具体而言，本次数据集包含797段越南语对话（共计3446轮）、816段法语对话（共计3510轮），以及一个包含17272段段落的语料库；每一轮对话均与语料库中的某一段落相关联。此外，我们还整理了当前已有的汉语与英语文档驱动对话数据。我们期望参赛选手可利用语言间的相似性提升模型在越南语与法语任务上的表现——例如，越南语中有大量词汇源自汉语，而英语与法语同属印欧语系。本次任务的目标为：基于对话历史从语料库中重新排序相关段落，并生成对应回复。为助力参赛选手快速上手，我们提供了包含三个模块的基线模型：首先基于对话历史从语料库中检索Top-K相关段落；随后对Top-N最相关的段落进行排序；最后将排序后的段落与对话历史拼接，通过生成模块生成回复。 **本项目包含用于微调排序模块的越南语数据。** ### 数据集格式每条数据包含三个属性：查询项（query）、正样本（positive）与负样本（negative）。查询项（query）为按逆序拼接的对话历史，其中最后一轮对话标记为`<last_turn>`，其余轮次分别以`<user>`标记用户输入、`<agent>`标记系统输出。示例如下： '<last_turn> Ai đã giới thiệu một trong những hệ thống phúc lợi đầu tiên cho giai cấp công nhân vào năm 1883? <agent> Đại khủng hoảng, khi các biện pháp cứu trợ khẩn cấp đã được giới thiệu dưới thời Tổng thống Franklin D. Roosevelt. <user> Khi nào Hoa Kỳ có một hệ thống phúc lợi xã hội có tổ chức?' 正样本（positive）为标注对话目标的样本，其内容为按逆序拼接的标题项，各项之间以`//`分隔。示例如下： 'Otto von Bismarck, Thủ tướng Đức, giới thiệu một trong những hệ thống phúc lợi đầu tiên cho các tầng lớp lao động vào năm 1883. // Lịch sử[sửa | sửa mã nguồn] // An sinh xã hội – Wikipedia tiếng Việt // vi-SocialSecurity' 负样本（negative）为从对话历史中检索BM25值最高的段落（排除正样本）得到的样本，其格式与正样本一致。 ### 数据集加载方式通过代码范例等方式，提供数据集通过MaaS/Dataset SDK进行加载和使用的详细说明。

提供机构：

maas

创建时间：

2023-02-17

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集专注于越南语文档对话任务，提供797个对话和17272个段落，用于训练重排模块以优化对话历史中的段落检索。数据格式包括查询、正样本和负样本，旨在支持多语言对话系统的开发。

以上内容由遇见数据集搜集并总结生成