LlamaLens-English
收藏魔搭社区2025-10-09 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/QCRI/LlamaLens-English
下载链接
链接失效反馈官方服务:
资源简介:
# LlamaLens: Specialized Multilingual LLM Dataset
This dataset supports the research presented in the paper [LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content](https://huggingface.co/papers/2410.15308).
## Overview
LlamaLens is a specialized multilingual LLM designed for analyzing news and social media content. It focuses on 18 NLP tasks, leveraging 52 datasets across Arabic, English, and Hindi. This repository contains the English-language portion of the data.
## Dataset Details
This dataset comprises various sub-datasets focusing on different text classification tasks related to news and social media analysis. A detailed breakdown of the datasets and their statistics is provided in the metadata section above.
## File Format
Each JSONL file in the dataset follows a structured format with the following fields:
- `id`: Unique identifier for each data entry.
- `original_id`: Identifier from the original dataset, if available.
- `input`: The original text that needs to be analyzed.
- `output`: The label assigned to the text after analysis.
- `dataset`: Name of the dataset the entry belongs.
- `task`: The specific task type.
- `lang`: The language of the input text.
- `instructions`: A brief set of instructions describing how the text should be labeled.
**Example entry in JSONL file:**
```
{
"id": "fb6dd1bb-2ab4-4402-adaa-9be9eea6ca18",
"original_id": null,
"input": "I feel that worldviews that lack the divine tend toward the solipsistic.",
"output": "joy",
"dataset": "Emotion",
"task": "Emotion",
"lang": "en",
"instructions": "Identify if the given text expresses an emotion and specify whether it is joy, love, fear, anger, sadness, or surprise. Return only the label without any explanation, justification, or additional text."
}
```
## Model & Code
- **LlamaLens Model on Hugging Face:** [https://huggingface.co/QCRI/LlamaLens](https://huggingface.co/QCRI/LlamaLens)
- **LlamaLens GitHub Repository:** [https://github.com/firojalam/LlamaLens](https://github.com/firojalam/LlamaLens)
## 📢 Citation
If you use this dataset, please cite our [paper](https://arxiv.org/pdf/2410.15308):
```
@article{kmainasi2024llamalensspecializedmultilingualllm,
title={LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content},
author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Maram Hasanain and Sahinur Rahman Laskar and Naeemul Hassan and Firoj Alam},
year={2024},
journal={arXiv preprint arXiv:2410.15308},
volume={},
number={},
pages={},
url={https://arxiv.org/abs/2410.15308},
eprint={2410.15308},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
# LlamaLens:专业化多语言大语言模型(Large Language Model, LLM)数据集
本数据集用于支撑发表于论文《LlamaLens:用于新闻与社交媒体内容分析的专业化多语言大语言模型》的相关研究,论文链接:https://huggingface.co/papers/2410.15308
## 数据集概览
LlamaLens是一款专为新闻与社交媒体内容分析打造的专业化多语言大语言模型。该模型依托覆盖阿拉伯语、英语、印地语的52个数据集,完成18项自然语言处理(Natural Language Processing, NLP)任务。本仓库仅包含该数据集的英语语料子集。
## 数据集详情
本数据集包含多个子数据集,均围绕新闻与社交媒体分析相关的各类文本分类任务构建。数据集的详细分类与统计信息已在上方的元数据部分给出。
## 文件格式
本数据集中的每一个JSONL文件均遵循标准化格式,包含以下字段:
- `id`:每条数据条目的唯一标识符。
- `original_id`:原始数据集中的原有标识符(如存在)。
- `input`:待分析的原始文本。
- `output`:经分析后为该文本赋予的标签。
- `dataset`:该数据条目所属的数据集名称。
- `task`:具体任务类型。
- `lang`:输入文本的语言。
- `instructions`:用于说明文本标注规则的简短指令集。
**JSONL文件示例条目:**
{
"id": "fb6dd1bb-2ab4-4402-adaa-9be9eea6ca18",
"original_id": null,
"input": "I feel that worldviews that lack the divine tend toward the solipsistic.",
"output": "joy",
"dataset": "Emotion",
"task": "Emotion",
"lang": "en",
"instructions": "判断给定文本是否表达情感,并指定其为喜悦、喜爱、恐惧、愤怒、悲伤或惊讶之一。仅返回标签,不得添加任何解释、说明或额外文本。"
}
## 模型与代码
- **Hugging Face平台的LlamaLens模型**:[https://huggingface.co/QCRI/LlamaLens](https://huggingface.co/QCRI/LlamaLens)
- **LlamaLens GitHub代码仓库**:[https://github.com/firojalam/LlamaLens](https://github.com/firojalam/LlamaLens)
## 📢 引用规范
若您使用本数据集,请引用我们的论文:[LlamaLens:用于新闻与社交媒体内容分析的专业化多语言大语言模型](https://arxiv.org/pdf/2410.15308)
@article{kmainasi2024llamalensspecializedmultilingualllm,
title={LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content},
author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Maram Hasanain and Sahinur Rahman Laskar and Naeemul Hassan and Firoj Alam},
year={2024},
journal={arXiv preprint arXiv:2410.15308},
volume={},
number={},
pages={},
url={https://arxiv.org/abs/2410.15308},
eprint={2410.15308},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
提供机构:
maas
创建时间:
2025-06-17



