smalltalk

Hugging Face2025-04-21 更新2025-04-22 收录

下载链接：

https://huggingface.co/datasets/SmallDoge/smalltalk

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含多个配置，每个配置都有其特定的名称、特征、分割方式、下载大小和数据集大小。每个配置都包含一个名为'messages'的特征，其中包含'content'和'role'作为字符串数据类型。此外，README还指定了每个分割（测试和训练）的字节数和示例数量，并提供了下载数据文件的路径。

This dataset contains multiple configurations, each with its specific name, features, splitting method, download size, and dataset size. Each configuration includes a feature named 'messages' that contains 'content' and 'role' with string data types. Additionally, the README specifies the byte count and sample count for each split (test and training), and provides the path for downloading the data files.

创建时间：

2025-04-20

原始信息汇总

数据集概述

基本信息

数据集名称：SmallDoge/smalltalk
包含多个子数据集，涵盖多种语言和主题

子数据集列表

英语子数据集

apigen-80k-en
- 特征：messages (content, role)
- 数据量：train(83,144), test(4,377)
- 大小：188MB(train), 9.5MB(test)
chinese_traditional-en
- 特征：messages (content, role)
- 数据量：train(1,093), test(10)
- 大小：943KB(train), 9KB(test)
coig_pc-en
- 特征：messages (content, role)
- 数据量：train(2,962), test(10)
- 大小：4.2MB(train), 5KB(test)
douban-en
- 特征：messages (content, role)
- 数据量：train(2,874), test(10)
- 大小：8MB(train), 19KB(test)
everyday-conversations-en
- 特征：full_topic, messages (content, role)
- 数据量：train(2,260), test(119)
- 大小：1.9MB(train), 100KB(test)
exam-en
- 特征：messages (content, role)
- 数据量：train(4,808), test(10)
- 大小：7.5MB(train), 11KB(test)
explore-instruct-rewriting-en
- 特征：messages (content, role)
- 数据量：train(30,400), test(1,600)
- 大小：12.6MB(train), 665KB(test)
finance-en
- 特征：messages (content, role)
- 数据量：train(9,159), test(10)
- 大小：70MB(train), 78KB(test)
human_value-en
- 特征：messages (content, role)
- 数据量：train(996), test(10)
- 大小：890KB(train), 8KB(test)
logi_qa-en
- 特征：messages (content, role)
- 数据量：train(409), test(10)
- 大小：572KB(train), 11KB(test)
longalign-en
- 特征：tokens, messages (content, role)
- 数据量：train(3,547), test(187)
- 大小：139MB(train), 7.5MB(test)
metamathqa-50k-en
- 特征：type, messages (content, role)
- 数据量：train(47,500), test(2,500)
- 大小：36MB(train), 1.8MB(test)
numina-cot-100k-en
- 特征：source, messages (content, role)
- 数据量：train(106,147), test(5,587)
- 大小：156MB(train), 8.3MB(test)
ruozhiba-en
- 特征：messages (content, role)
- 数据量：train(230), test(10)
- 大小：242KB(train), 11KB(test)
segmentfault-en
- 特征：messages (content, role)
- 数据量：train(447), test(10)
- 大小：843KB(train), 12KB(test)
self-oss-instruct-en
- 特征：messages (content, role)
- 数据量：train(48,127), test(2,534)
- 大小：63MB(train), 3.3MB(test)
systemchats-30k-en
- 特征：messages (content, role)
- 数据量：train(34,133), test(1,797)
- 大小：89MB(train), 4.7MB(test)
wiki-en
- 特征：messages (content, role)
- 数据量：train(9,778), test(10)
- 大小：35MB(train), 53KB(test)
wikihow-en
- 特征：messages (content, role)
- 数据量：train(1,163), test(10)
- 大小：9.3MB(train), 93KB(test)
xhs-en
- 特征：messages (content, role)
- 数据量：train(1,265), test(10)
- 大小：4.2MB(train), 33KB(test)
zhihu-en
- 特征：messages (content, role)
- 数据量：train(5,353), test(10)
- 大小：16MB(train), 29KB(test)

中文子数据集

apigen-80k-zh
- 特征：messages (content, role)
- 数据量：test(4,363)
- 大小：9MB(test)
chinese_traditional-zh
- 特征：messages (content, role)
- 数据量：train(1,101), test(10)
- 大小：625KB(train), 5KB(test)
coig_pc-zh
- 特征：messages (content, role)
- 数据量：train(2,990), test(10)
- 大小：3.8MB(train), 13KB(test)
douban-zh
- 特征：messages (content, role)
- 数据量：train(3,076), test(10)
- 大小：5MB(train), 16KB(test)
everyday-conversations-zh
- 特征：messages (content, role)
- 数据量：train(2,258), test(119)
- 大小：1.8MB(train), 98KB(test)
exam-zh
- 特征：messages (content, role)
- 数据量：train(4,846), test(10)
- 大小：5.2MB(train), 10KB(test)
explore-instruct-rewriting-zh
- 特征：messages (content, role)
- 数据量：train(30,378), test(1,598)
- 大小：10MB(train), 527KB(test)
finance-zh
- 特征：messages (content, role)
- 数据量：train(11,278), test(10)
- 大小：67MB(train), 59KB(test)
human_value-zh
- 特征：messages (content, role)
- 数据量：train(997), test(10)
- 大小：706KB(train), 7KB(test)
logi_qa-zh
- 特征：messages (content, role)
- 数据量：train(411), test(10)
- 大小：457KB(train), 11KB(test)
longalign-zh
- 特征：messages (content, role)
- 数据量：train(3,062), test(183)
- 大小：37MB(train), 1.3MB(test)
metamathqa-50k-zh
- 特征：messages (content, role)
- 数据量：train(47,500), test(2,500)
- 大小：33MB(train), 1.7MB(test)
numina-cot-100k-zh
- 特征：messages (content, role)
- 数据量：train(105,453), test(5,585)
- 大小：142MB(train), 7.6MB(test)
ruozhiba-zh
- 特征：messages (content, role)
- 数据量：train(230), test(10)
- 大小：199KB(train), 8KB(test)
segmentfault-zh
- 特征：messages (content, role)
- 数据量：train(448), test(10)
- 大小：742KB(train), 16KB(test)
self-oss-instruct-zh
- 特征：messages (content, role)
- 数据量：train(42,075), test(2,262)
- 大小：52MB(train), 2.7MB(test)
systemchats-30k-zh
- 特征：messages (content, role)
- 数据量：train(34,034), test(1,789)
- 大小：101MB(train), 4.7MB(test)
wiki-zh
- 特征：messages (content, role)
- 数据量：train(10,593), test(10)
- 大小：26MB(train), 25KB(test)
wikihow-zh
- 特征：messages (content, role)
- 数据量：train(1,475), test(10)
- 大小：10MB(train), 74KB(test)
xhs-zh
- 特征：messages (content, role)
- 数据量：train(1,498), test(10)
- 大小：2.3MB(train), 15KB(test)
zhihu-zh
- 特征：messages (content, role)
- 数据量：train(5,621), test(10)
- 大小：12MB(train), 22KB(test)

搜集汇总

数据集介绍

构建方式

smalltalk数据集通过多源异构数据整合构建，涵盖日常对话、专业知识问答、金融分析等多样化场景。采用结构化消息格式存储，每条记录包含角色标识和文本内容，支持中英双语平行语料。数据经过严格清洗和标准化处理，确保语义连贯性和领域覆盖广度，训练集与测试集按科学比例划分以保障模型评估效度。

使用方法

研究者可通过HuggingFace接口直接加载特定领域子集，支持按train-test拆分或全量数据调用。典型应用场景包括对话系统微调、跨语言迁移学习及领域适应性研究。使用时应根据config_name指定语种和主题，消息列表可直接转换为LLM训练所需的提示模板，注意不同子集间的数据分布差异可能影响模型泛化性能。

背景与挑战

背景概述

smalltalk数据集是一个多语言对话数据集，旨在支持自然语言处理领域的研究与应用。该数据集由多个子集构成，涵盖了日常对话、专业知识问答、逻辑推理、金融咨询等多个领域，同时支持中英双语。其构建目的是为了促进对话系统的开发，特别是在多轮对话理解和生成方面的研究。数据集的多样性和广泛性使其成为评估和训练对话模型的重要资源。

当前挑战

smalltalk数据集面临的挑战主要包括两个方面：领域问题的多样性与数据构建的复杂性。在领域问题方面，数据集需要覆盖从日常闲聊到专业咨询的广泛话题，这对模型的泛化能力提出了较高要求。在数据构建过程中，如何确保对话的自然性和信息的准确性是一大挑战，尤其是在跨语言场景下，保持语义一致性和文化适应性尤为重要。此外，数据规模的庞大也带来了存储和处理的效率问题。

常用场景

经典使用场景

在自然语言处理领域，smalltalk数据集因其丰富的多语言对话样本而成为研究对话系统的经典资源。该数据集覆盖日常交流、专业领域讨论及技术问答等多样化场景，特别适用于训练和评估生成式对话模型的语境理解与响应生成能力。其双语平行语料的结构为跨语言对话研究提供了独特价值。

解决学术问题

该数据集有效解决了对话系统中长期存在的语境连贯性保持、多轮对话逻辑一致性等核心问题。通过提供高质量标注的对话序列，研究者能够深入探究对话状态跟踪、意图识别等关键技术，尤其在低资源语言场景下缓解了数据稀缺的困境，推动了对话系统公平性研究。

实际应用

在实际应用层面，该数据集支撑了智能客服系统的语义理解模块开发，优化了电商平台的自动问答体验。金融和技术社区配置的专用子集可直接用于领域对话引擎训练，其社交媒体对话数据则为舆情分析系统提供了重要的语义标注基准。

数据集最近研究