sdeakin/LLM-Tagged-GoEmotions

Name: sdeakin/LLM-Tagged-GoEmotions
Creator: sdeakin
Published: 2025-12-09 03:27:31
License: 暂无描述

Hugging Face2025-12-09 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/sdeakin/LLM-Tagged-GoEmotions

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-classification - sentence-similarity - feature-extraction language: - en tags: - emotion-classification - text-classification - explanations - rationales - goemotions - GoEmotions - synthetic - llm-generated - natural-language-processing - emotions - affect pretty_name: 'LLM-Tagged GoEmotions: Llama 3 Labeling of GoEmotions' size_categories: - 100K<n<1M --- # Dataset Card for **LLM-Tagged-GoEmotions** ## Dataset Summary `LLM-Simple-Emotions.jsonl` contains **211,225 synthetic emotion annotations** generated from the original GoEmotions corpus. Each Reddit utterance is re-annotated using **`llama3:instruct`** (via Ollama) with the **Simple Level-1 Prompt**, which instructs the model to: * Predict the **primary emotion label(s)** (from GoEmotions) * Provide a **natural-language explanation** of *why* those emotions were tagged This dataset is ideal for: * Single-label and multi-label emotion classification * Training models that use **rationale/explanation supervision** * Studying LLM emotional reasoning over text --- ## Supported Tasks ### **Emotion Classification** Use: * `data.labels` ### **Explanation Modeling (Optional)** Use: * `data.explanation` To train models to generate text rationales or explanations. --- ## Languages * **English (`en`)** --- ## Dataset Structure ### **Example Record** ```json { "src_id": "l1_0", "model": "llama3:instruct", "provider": "ollama-local", "prompt": "simple_level1", "text": "That game hurt.", "data": { "labels": ["disappointment"], "explanation": "The speaker expresses regret and sadness about the outcome of the game, indicating disappointment." } } ``` --- ## Size & Splits * **Total entries:** 211,225 * **Splits:** Single combined dataset (`train` only) Users may create custom train/validation/test splits. --- ## Data Collection & Processing ### **Source** * Original GoEmotions dataset (`CC BY 4.0`) ### **Generation Pipeline** 1. Load each GoEmotions utterance. 2. Apply the Simple Level-1 prompt to `llama3:instruct`. 3. Extract: * Emotion label(s) * Explanation text 4. Save the structured result into JSONL. ### **Post-Processing** Minimal cleanup: * Remove malformed outputs and nonsensical labels * Normalize labels * Ensure text + explanation are present --- ## Known Limitations ### **Model Bias** * Labels and explanations depend on Llama-3’s internal reasoning and biases. * Explanations may be overly confident or simplistic. --- ## Usage ### **Direct JSONL Reading** ```python import json with open("LLM-Simple-Emotions.jsonl", "r", encoding="utf-8") as f: for line in f: record = json.loads(line) print(record["text"], record["data"]["labels"], record["data"]["explanation"]) ``` ### **Load with Hugging Face Datasets** ```python from datasets import load_dataset ds = load_dataset( "json", data_files="LLM-Simple-Emotions.jsonl", split="train" ) ``` --- ## Citation Please cite both the **original GoEmotions dataset** and this LLM-generated extension: ```bibtex @article{demszky2020goemotions, title={GoEmotions: A Dataset of Fine-Grained Emotions}, author={Demszky, Dorottya and et al.}, journal={ACL}, year={2020} } @dataset{LLM-Tagged-GoEmotions, title={LLM-Tagged GoEmotions: Llama 3 Labeling of GoEmotions}, author={Sheryl D. and contributors}, year={2025}, url={https://huggingface.co/datasets/sdeakin/LLM-Tagged-GoEmotions} } ``` --- ## Contact For questions or issues, please open an issue on the dataset repository or contact me.

license: 知识共享署名4.0（CC BY 4.0） task_categories: - 文本分类（text-classification） - 句子相似度（sentence-similarity） - 特征提取（feature-extraction） language: - 英语（en） tags: - 情感分类（emotion-classification） - 文本分类（text-classification） - 解释（explanations） - 理由依据（rationales） - GoEmotions - GoEmotions - 合成数据（synthetic） - LLM生成数据（llm-generated） - 自然语言处理（natural-language-processing） - 情感（emotions） - 情感状态（affect） pretty_name: "LLM标注GoEmotions数据集：Llama 3对GoEmotions的标注结果" size_categories: - 100K<n<1M --- # LLM标注版GoEmotions数据集卡片 ## 数据集概览 `LLM-Simple-Emotions.jsonl` 包含源自原始GoEmotions语料库生成的**211,225条合成情感标注**。每条Reddit平台发言均通过Ollama调用`llama3:instruct`模型，配合**简单一级提示词**进行重新标注，该提示词要求模型完成两项任务： * 预测来自GoEmotions的**主要情感标签**（支持单标签或多标签） * 以自然语言形式解释**标注该情感的依据** 本数据集适用于以下场景： * 单标签与多标签情感分类任务 * 训练具备**理由/解释监督能力**的模型 * 研究大语言模型（Large Language Model，LLM）对文本的情感推理过程 --- ## 支持任务 ### 情感分类任务使用字段： * `data.labels` ### 解释建模（可选）使用字段： * `data.explanation` 用于训练能够生成文本理由或解释的模型。 --- ## 语言 * **英语（`en`）** --- ## 数据集结构 ### 示例数据记录 json { "src_id": "l1_0", "model": "llama3:instruct", "provider": "ollama本地部署版", "prompt": "simple_level1", "text": "That game hurt.", "data": { "labels": ["disappointment"], "explanation": "The speaker expresses regret and sadness about the outcome of the game, indicating disappointment." } } --- ## 规模与数据集拆分 * **总条目数：** 211,225 * **数据集拆分：** 仅包含单一合并训练集（`train`），用户可自行划分自定义的训练/验证/测试集。 --- ## 数据收集与处理流程 ### 数据来源 * 原始GoEmotions数据集（知识共享署名4.0许可协议） ### 生成流程 1. 加载每条原始GoEmotions发言 2. 对`llama3:instruct`模型应用简单一级提示词 3. 提取以下内容： * 情感标签 * 解释文本 4. 将结构化结果保存为JSONL格式文件。 ### 后处理步骤仅进行极简清理操作： * 移除格式错误的输出与无意义标签 * 标准化标签格式 * 确保文本与解释字段均完整存在 --- ## 已知局限性 ### 模型偏见 * 生成的标签与解释结果依赖Llama-3的内部推理逻辑与固有偏见 * 生成的解释可能过于绝对或过于简化 --- ## 使用方法 ### 直接读取JSONL文件 python import json with open("LLM-Simple-Emotions.jsonl", "r", encoding="utf-8") as f: for line in f: record = json.loads(line) print(record["text"], record["data"]["labels"], record["data"]["explanation"]) ### 使用Hugging Face Datasets库加载 python from datasets import load_dataset ds = load_dataset( "json", data_files="LLM-Simple-Emotions.jsonl", split="train" ) --- ## 引用信息请同时引用原始GoEmotions数据集与本次LLM生成的扩展数据集： bibtex @article{demszky2020goemotions, title={GoEmotions: A Dataset of Fine-Grained Emotions}, author={Demszky, Dorottya and et al.}, journal={ACL}, year={2020} } @dataset{LLM-Tagged-GoEmotions, title={LLM-Tagged GoEmotions: Llama 3 Labeling of GoEmotions}, author={Sheryl D. and contributors}, year={2025}, url={https://huggingface.co/datasets/sdeakin/LLM-Tagged-GoEmotions} } --- ## 联系方式如有疑问或问题，请在数据集仓库中提交Issue或联系作者。

提供机构：

sdeakin

5,000+

优质数据集

54 个

任务类型

进入经典数据集