gautisid/Bitext-customer-support-llm-chatbot-training-dataset
收藏Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/gautisid/Bitext-customer-support-llm-chatbot-training-dataset
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个混合合成数据集,专为微调大型语言模型(如GPT、Mistral和OpenELM)而设计,使用NLP/NLG技术和自动化数据标注工具生成。其目标是展示如何通过两步法微调LLM来实现客户支持领域的垂直化/领域适应。数据集适用于意图检测,覆盖客户服务垂直领域,包含27个意图(分为10个类别)、26872个问答对(每个意图约1000对)、30个实体/槽位类型和12种语言生成标签。类别和意图选自Bitext的20个垂直特定数据集,涵盖汽车、零售银行、教育、活动票务、现场服务、医疗保健、酒店、保险、法律服务、制造、媒体流媒体、抵押贷款、搬家存储、房地产/建筑、餐厅酒吧连锁、零售/电子商务、电信、旅行、公用事业和财富管理等行业。数据集通过混合方法生成,使用自然文本作为源文本,NLP技术提取种子,NLG技术扩展种子文本,整个过程由计算语言学家策划。每个条目包含标志、指令(用户请求)、类别(高层语义类别)、意图和响应(虚拟助手预期响应)字段。语言生成标签用于反映语言变化,如词汇、句法、语域和风格变异,以定制不同用户配置文件的数据集。
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using NLP/NLG technology and automated Data Labeling tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using a two-step approach to LLM Fine-Tuning. The dataset is for Use Case: Intent Detection, Vertical: Customer Service, with 27 intents assigned to 10 categories, 26872 question/answer pairs (around 1000 per intent), 30 entity/slot types, and 12 different types of language generation tags. The categories and intents are selected from Bitexts collection of 20 vertical-specific datasets, covering industries like Automotive, Retail Banking, Education, Events & Ticketing, Field Services, Healthcare, Hospitality, Insurance, Legal Services, Manufacturing, Media Streaming, Mortgages & Loans, Moving & Storage, Real Estate/Construction, Restaurant & Bar Chains, Retail/E-commerce, Telecommunications, Travel, Utilities, and Wealth Management. The question/answer pairs are generated using a hybrid methodology that uses natural texts as source text, NLP technology to extract seeds, and NLG technology to expand the seed texts, all curated by computational linguists. Each entry contains fields: flags (tags), instruction (user request), category (high-level semantic category), intent, and response (example expected response from virtual assistant). Language generation tags reflect linguistic variations such as lexical, syntactic, register, and stylistic changes to customize datasets for different user profiles.
提供机构:
gautisid



