sdeakin/LLM-Tagged-GoEmotions
收藏Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/sdeakin/LLM-Tagged-GoEmotions
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-classification
- sentence-similarity
- feature-extraction
language:
- en
tags:
- emotion-classification
- text-classification
- explanations
- rationales
- goemotions
- GoEmotions
- synthetic
- llm-generated
- natural-language-processing
- emotions
- affect
pretty_name: 'LLM-Tagged GoEmotions: Llama 3 Labeling of GoEmotions'
size_categories:
- 100K<n<1M
---
# Dataset Card for **LLM-Tagged-GoEmotions**
## Dataset Summary
`LLM-Simple-Emotions.jsonl` contains **211,225 synthetic emotion annotations** generated from the original GoEmotions corpus.
Each Reddit utterance is re-annotated using **`llama3:instruct`** (via Ollama) with the **Simple Level-1 Prompt**, which instructs the model to:
* Predict the **primary emotion label(s)** (from GoEmotions)
* Provide a **natural-language explanation** of *why* those emotions were tagged
This dataset is ideal for:
* Single-label and multi-label emotion classification
* Training models that use **rationale/explanation supervision**
* Studying LLM emotional reasoning over text
---
## Supported Tasks
### **Emotion Classification**
Use:
* `data.labels`
### **Explanation Modeling (Optional)**
Use:
* `data.explanation`
To train models to generate text rationales or explanations.
---
## Languages
* **English (`en`)**
---
## Dataset Structure
### **Example Record**
```json
{
"src_id": "l1_0",
"model": "llama3:instruct",
"provider": "ollama-local",
"prompt": "simple_level1",
"text": "That game hurt.",
"data": {
"labels": ["disappointment"],
"explanation": "The speaker expresses regret and sadness about the outcome of the game, indicating disappointment."
}
}
```
---
## Size & Splits
* **Total entries:** 211,225
* **Splits:** Single combined dataset (`train` only)
Users may create custom train/validation/test splits.
---
## Data Collection & Processing
### **Source**
* Original GoEmotions dataset (`CC BY 4.0`)
### **Generation Pipeline**
1. Load each GoEmotions utterance.
2. Apply the Simple Level-1 prompt to `llama3:instruct`.
3. Extract:
* Emotion label(s)
* Explanation text
4. Save the structured result into JSONL.
### **Post-Processing**
Minimal cleanup:
* Remove malformed outputs and nonsensical labels
* Normalize labels
* Ensure text + explanation are present
---
## Known Limitations
### **Model Bias**
* Labels and explanations depend on Llama-3’s internal reasoning and biases.
* Explanations may be overly confident or simplistic.
---
## Usage
### **Direct JSONL Reading**
```python
import json
with open("LLM-Simple-Emotions.jsonl", "r", encoding="utf-8") as f:
for line in f:
record = json.loads(line)
print(record["text"], record["data"]["labels"], record["data"]["explanation"])
```
### **Load with Hugging Face Datasets**
```python
from datasets import load_dataset
ds = load_dataset(
"json",
data_files="LLM-Simple-Emotions.jsonl",
split="train"
)
```
---
## Citation
Please cite both the **original GoEmotions dataset** and this LLM-generated extension:
```bibtex
@article{demszky2020goemotions,
title={GoEmotions: A Dataset of Fine-Grained Emotions},
author={Demszky, Dorottya and et al.},
journal={ACL},
year={2020}
}
@dataset{LLM-Tagged-GoEmotions,
title={LLM-Tagged GoEmotions: Llama 3 Labeling of GoEmotions},
author={Sheryl D. and contributors},
year={2025},
url={https://huggingface.co/datasets/sdeakin/LLM-Tagged-GoEmotions}
}
```
---
## Contact
For questions or issues, please open an issue on the dataset repository or contact me.
license: 知识共享署名4.0(CC BY 4.0)
task_categories:
- 文本分类(text-classification)
- 句子相似度(sentence-similarity)
- 特征提取(feature-extraction)
language:
- 英语(en)
tags:
- 情感分类(emotion-classification)
- 文本分类(text-classification)
- 解释(explanations)
- 理由依据(rationales)
- GoEmotions
- GoEmotions
- 合成数据(synthetic)
- LLM生成数据(llm-generated)
- 自然语言处理(natural-language-processing)
- 情感(emotions)
- 情感状态(affect)
pretty_name: "LLM标注GoEmotions数据集:Llama 3对GoEmotions的标注结果"
size_categories:
- 100K<n<1M
---
# LLM标注版GoEmotions数据集卡片
## 数据集概览
`LLM-Simple-Emotions.jsonl` 包含源自原始GoEmotions语料库生成的**211,225条合成情感标注**。每条Reddit平台发言均通过Ollama调用`llama3:instruct`模型,配合**简单一级提示词**进行重新标注,该提示词要求模型完成两项任务:
* 预测来自GoEmotions的**主要情感标签**(支持单标签或多标签)
* 以自然语言形式解释**标注该情感的依据**
本数据集适用于以下场景:
* 单标签与多标签情感分类任务
* 训练具备**理由/解释监督能力**的模型
* 研究大语言模型(Large Language Model,LLM)对文本的情感推理过程
---
## 支持任务
### 情感分类任务
使用字段:
* `data.labels`
### 解释建模(可选)
使用字段:
* `data.explanation`
用于训练能够生成文本理由或解释的模型。
---
## 语言
* **英语(`en`)**
---
## 数据集结构
### 示例数据记录
json
{
"src_id": "l1_0",
"model": "llama3:instruct",
"provider": "ollama本地部署版",
"prompt": "simple_level1",
"text": "That game hurt.",
"data": {
"labels": ["disappointment"],
"explanation": "The speaker expresses regret and sadness about the outcome of the game, indicating disappointment."
}
}
---
## 规模与数据集拆分
* **总条目数:** 211,225
* **数据集拆分:** 仅包含单一合并训练集(`train`),用户可自行划分自定义的训练/验证/测试集。
---
## 数据收集与处理流程
### 数据来源
* 原始GoEmotions数据集(知识共享署名4.0许可协议)
### 生成流程
1. 加载每条原始GoEmotions发言
2. 对`llama3:instruct`模型应用简单一级提示词
3. 提取以下内容:
* 情感标签
* 解释文本
4. 将结构化结果保存为JSONL格式文件。
### 后处理步骤
仅进行极简清理操作:
* 移除格式错误的输出与无意义标签
* 标准化标签格式
* 确保文本与解释字段均完整存在
---
## 已知局限性
### 模型偏见
* 生成的标签与解释结果依赖Llama-3的内部推理逻辑与固有偏见
* 生成的解释可能过于绝对或过于简化
---
## 使用方法
### 直接读取JSONL文件
python
import json
with open("LLM-Simple-Emotions.jsonl", "r", encoding="utf-8") as f:
for line in f:
record = json.loads(line)
print(record["text"], record["data"]["labels"], record["data"]["explanation"])
### 使用Hugging Face Datasets库加载
python
from datasets import load_dataset
ds = load_dataset(
"json",
data_files="LLM-Simple-Emotions.jsonl",
split="train"
)
---
## 引用信息
请同时引用原始GoEmotions数据集与本次LLM生成的扩展数据集:
bibtex
@article{demszky2020goemotions,
title={GoEmotions: A Dataset of Fine-Grained Emotions},
author={Demszky, Dorottya and et al.},
journal={ACL},
year={2020}
}
@dataset{LLM-Tagged-GoEmotions,
title={LLM-Tagged GoEmotions: Llama 3 Labeling of GoEmotions},
author={Sheryl D. and contributors},
year={2025},
url={https://huggingface.co/datasets/sdeakin/LLM-Tagged-GoEmotions}
}
---
## 联系方式
如有疑问或问题,请在数据集仓库中提交Issue或联系作者。
提供机构:
sdeakin



