MohammadOthman/mo-customer-support-tweets-945k

Name: MohammadOthman/mo-customer-support-tweets-945k
Creator: MohammadOthman
Published: 2024-04-18 13:17:41
License: 暂无描述

Hugging Face2024-04-18 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/MohammadOthman/mo-customer-support-tweets-945k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-sa-4.0 task_categories: - text-generation language: - en tags: - Customer Support - Twitter Data - Conversational AI - Fine-tuning size_categories: - 100K<n<1M --- # Customer Support on Twitter Dataset 945k ## Dataset Description ### Context This dataset provides a large corpus of real-world English conversations between consumers and customer support agents on Twitter, designed to drive innovation in Natural Language Processing (NLP) by providing data that better matches the actual language used in contemporary customer support interactions. ### Content Initially, the data included complex threads of conversations involving multiple exchanges between customers and support agents. To transform this data into a more structured format suitable for training language models, the following steps were taken: - **Conversation Extraction**: Each conversation was distilled down from potentially lengthy threads to essential exchanges. This involved identifying the start and end of customer support interactions and ensuring each input (customer's query) was paired with an immediate response (support agent's reply). - **Data Pairing**: The extracted conversations were restructured into pairs of inputs and outputs, where each 'input' is a customer's request or question, and each 'output' is the corresponding response from a support agent. - **Cleaning and Standardization**: To enhance the quality of the dataset for NLP tasks, extensive cleaning and preprocessing were applied, which included: - **Removing Noise**: Unnecessary content such as URLs, HTML tags, and user mentions were removed. - **Normalizing Text**: Emojis were replaced with words, and emoticons were removed or replaced with their textual descriptions to maintain the emotional and contextual nuances without visual elements. - **Expanding Abbreviations**: Internet slangs and contractions were expanded to their full forms to standardize the text and make it more understandable and accessible for language processing models. ### Preprocessing Details - **Chat Slang Conversion**: Common internet abbreviations and slang were expanded to their full words using a predefined dictionary to ensure clarity. - **Contraction Expansion**: Contractions were expanded to their full forms for consistency in language usage. - **Emoji and Emoticon Replacement**: Emojis and emoticons were replaced with corresponding text descriptions to preserve their emotional and contextual significance. - **Cleaning End-of-Line Noise**: Abbreviations at the end of responses, often irrelevant to the context, were removed to maintain the focus on the content relevant to customer support. ### Use Cases This dataset is suitable for various NLP applications, including: - **Fine-Tuning Language Models**: The structured format of input-output pairs makes this dataset ideal for fine-tuning language models on task-specific dialogue understanding and generation. - **Automated Response Suggestion**: Training models to predict customer support responses. - **Analysis of Response Effectiveness**: Evaluating how different response strategies affect customer satisfaction. - **Sentiment Analysis**: Examining how sentiment influences the interaction dynamics in customer support. - **Topic Modeling**: Identifying common themes or issues in customer inquiries to aid in strategic planning for support services.

提供机构：

MohammadOthman

原始信息汇总

Customer Support on Twitter Dataset 945k

数据集描述

背景

本数据集提供了一个大规模的真实世界英语对话语料库，这些对话发生在Twitter上的消费者与客户支持代理之间。该数据集旨在通过提供更符合当代客户支持交互中实际使用的语言的数据，推动自然语言处理（NLP）的创新。

内容

原始数据包含客户和支持代理之间多轮交互的复杂对话线程。为了将这些数据转换为更适合训练语言模型的结构化格式，采取了以下步骤：

对话提取：从可能较长的线程中提炼出每个对话的基本交换，确定客户支持交互的开始和结束，并确保每个输入（客户的查询）与相应的输出（支持代理的回复）配对。
数据配对：提取的对话被重构为输入和输出的配对，其中每个“输入”是客户的请求或问题，每个“输出”是对应的支持代理的回复。
清洗和标准化：为了提高数据集在NLP任务中的质量，进行了广泛的清洗和预处理，包括：
- 去除噪声：移除了不必要的如URL、HTML标签和用户提及等内容。
- 文本规范化：表情符号被替换为文字，表情被移除或替换为其文本描述，以保持情感和上下文细微差别。
- 扩展缩写：网络俚语和缩写被扩展为其完整形式，以标准化文本，使其更易于语言处理模型理解。

预处理细节

网络俚语转换：使用预定义字典将常见的互联网缩写和俚语扩展为其完整单词，以确保清晰度。
缩写扩展：缩写被扩展为其完整形式，以保持语言使用的一致性。
表情和表情符号替换：表情和表情符号被替换为相应的文本描述，以保留其情感和上下文重要性。
清理行末噪声：通常与上下文无关的响应末尾的缩写被移除，以保持对客户支持相关内容的焦点。

使用案例

该数据集适用于多种NLP应用，包括：

微调语言模型：输入-输出对的结构化格式使其成为微调语言模型在特定任务对话理解和生成方面的理想选择。
自动回复建议：训练模型预测客户支持回复。
回复有效性分析：评估不同回复策略对客户满意度的影响。
情感分析：研究情感如何影响客户支持中的交互动态。
主题建模：识别客户咨询中的常见主题或问题，以帮助支持服务的战略规划。

5,000+

优质数据集

54 个

任务类型

进入经典数据集