five

sdeakin/LLM-Tagged-GoEmotions

收藏
Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/sdeakin/LLM-Tagged-GoEmotions
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-classification - sentence-similarity - feature-extraction language: - en tags: - emotion-classification - text-classification - explanations - rationales - goemotions - GoEmotions - synthetic - llm-generated - natural-language-processing - emotions - affect pretty_name: 'LLM-Tagged GoEmotions: Llama 3 Labeling of GoEmotions' size_categories: - 100K<n<1M --- # Dataset Card for **LLM-Tagged-GoEmotions** ## Dataset Summary `LLM-Simple-Emotions.jsonl` contains **211,225 synthetic emotion annotations** generated from the original GoEmotions corpus. Each Reddit utterance is re-annotated using **`llama3:instruct`** (via Ollama) with the **Simple Level-1 Prompt**, which instructs the model to: * Predict the **primary emotion label(s)** (from GoEmotions) * Provide a **natural-language explanation** of *why* those emotions were tagged This dataset is ideal for: * Single-label and multi-label emotion classification * Training models that use **rationale/explanation supervision** * Studying LLM emotional reasoning over text --- ## Supported Tasks ### **Emotion Classification** Use: * `data.labels` ### **Explanation Modeling (Optional)** Use: * `data.explanation` To train models to generate text rationales or explanations. --- ## Languages * **English (`en`)** --- ## Dataset Structure ### **Example Record** ```json { "src_id": "l1_0", "model": "llama3:instruct", "provider": "ollama-local", "prompt": "simple_level1", "text": "That game hurt.", "data": { "labels": ["disappointment"], "explanation": "The speaker expresses regret and sadness about the outcome of the game, indicating disappointment." } } ``` --- ## Size & Splits * **Total entries:** 211,225 * **Splits:** Single combined dataset (`train` only) Users may create custom train/validation/test splits. --- ## Data Collection & Processing ### **Source** * Original GoEmotions dataset (`CC BY 4.0`) ### **Generation Pipeline** 1. Load each GoEmotions utterance. 2. Apply the Simple Level-1 prompt to `llama3:instruct`. 3. Extract: * Emotion label(s) * Explanation text 4. Save the structured result into JSONL. ### **Post-Processing** Minimal cleanup: * Remove malformed outputs and nonsensical labels * Normalize labels * Ensure text + explanation are present --- ## Known Limitations ### **Model Bias** * Labels and explanations depend on Llama-3’s internal reasoning and biases. * Explanations may be overly confident or simplistic. --- ## Usage ### **Direct JSONL Reading** ```python import json with open("LLM-Simple-Emotions.jsonl", "r", encoding="utf-8") as f: for line in f: record = json.loads(line) print(record["text"], record["data"]["labels"], record["data"]["explanation"]) ``` ### **Load with Hugging Face Datasets** ```python from datasets import load_dataset ds = load_dataset( "json", data_files="LLM-Simple-Emotions.jsonl", split="train" ) ``` --- ## Citation Please cite both the **original GoEmotions dataset** and this LLM-generated extension: ```bibtex @article{demszky2020goemotions, title={GoEmotions: A Dataset of Fine-Grained Emotions}, author={Demszky, Dorottya and et al.}, journal={ACL}, year={2020} } @dataset{LLM-Tagged-GoEmotions, title={LLM-Tagged GoEmotions: Llama 3 Labeling of GoEmotions}, author={Sheryl D. and contributors}, year={2025}, url={https://huggingface.co/datasets/sdeakin/LLM-Tagged-GoEmotions} } ``` --- ## Contact For questions or issues, please open an issue on the dataset repository or contact me.

license: 知识共享署名4.0(CC BY 4.0) task_categories: - 文本分类(text-classification) - 句子相似度(sentence-similarity) - 特征提取(feature-extraction) language: - 英语(en) tags: - 情感分类(emotion-classification) - 文本分类(text-classification) - 解释(explanations) - 理由依据(rationales) - GoEmotions - GoEmotions - 合成数据(synthetic) - LLM生成数据(llm-generated) - 自然语言处理(natural-language-processing) - 情感(emotions) - 情感状态(affect) pretty_name: "LLM标注GoEmotions数据集:Llama 3对GoEmotions的标注结果" size_categories: - 100K<n<1M --- # LLM标注版GoEmotions数据集卡片 ## 数据集概览 `LLM-Simple-Emotions.jsonl` 包含源自原始GoEmotions语料库生成的**211,225条合成情感标注**。每条Reddit平台发言均通过Ollama调用`llama3:instruct`模型,配合**简单一级提示词**进行重新标注,该提示词要求模型完成两项任务: * 预测来自GoEmotions的**主要情感标签**(支持单标签或多标签) * 以自然语言形式解释**标注该情感的依据** 本数据集适用于以下场景: * 单标签与多标签情感分类任务 * 训练具备**理由/解释监督能力**的模型 * 研究大语言模型(Large Language Model,LLM)对文本的情感推理过程 --- ## 支持任务 ### 情感分类任务 使用字段: * `data.labels` ### 解释建模(可选) 使用字段: * `data.explanation` 用于训练能够生成文本理由或解释的模型。 --- ## 语言 * **英语(`en`)** --- ## 数据集结构 ### 示例数据记录 json { "src_id": "l1_0", "model": "llama3:instruct", "provider": "ollama本地部署版", "prompt": "simple_level1", "text": "That game hurt.", "data": { "labels": ["disappointment"], "explanation": "The speaker expresses regret and sadness about the outcome of the game, indicating disappointment." } } --- ## 规模与数据集拆分 * **总条目数:** 211,225 * **数据集拆分:** 仅包含单一合并训练集(`train`),用户可自行划分自定义的训练/验证/测试集。 --- ## 数据收集与处理流程 ### 数据来源 * 原始GoEmotions数据集(知识共享署名4.0许可协议) ### 生成流程 1. 加载每条原始GoEmotions发言 2. 对`llama3:instruct`模型应用简单一级提示词 3. 提取以下内容: * 情感标签 * 解释文本 4. 将结构化结果保存为JSONL格式文件。 ### 后处理步骤 仅进行极简清理操作: * 移除格式错误的输出与无意义标签 * 标准化标签格式 * 确保文本与解释字段均完整存在 --- ## 已知局限性 ### 模型偏见 * 生成的标签与解释结果依赖Llama-3的内部推理逻辑与固有偏见 * 生成的解释可能过于绝对或过于简化 --- ## 使用方法 ### 直接读取JSONL文件 python import json with open("LLM-Simple-Emotions.jsonl", "r", encoding="utf-8") as f: for line in f: record = json.loads(line) print(record["text"], record["data"]["labels"], record["data"]["explanation"]) ### 使用Hugging Face Datasets库加载 python from datasets import load_dataset ds = load_dataset( "json", data_files="LLM-Simple-Emotions.jsonl", split="train" ) --- ## 引用信息 请同时引用原始GoEmotions数据集与本次LLM生成的扩展数据集: bibtex @article{demszky2020goemotions, title={GoEmotions: A Dataset of Fine-Grained Emotions}, author={Demszky, Dorottya and et al.}, journal={ACL}, year={2020} } @dataset{LLM-Tagged-GoEmotions, title={LLM-Tagged GoEmotions: Llama 3 Labeling of GoEmotions}, author={Sheryl D. and contributors}, year={2025}, url={https://huggingface.co/datasets/sdeakin/LLM-Tagged-GoEmotions} } --- ## 联系方式 如有疑问或问题,请在数据集仓库中提交Issue或联系作者。
提供机构:
sdeakin
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作