LlamaLens-Hindi
收藏魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/QCRI/LlamaLens-Hindi
下载链接
链接失效反馈官方服务:
资源简介:
# LlamaLens: Specialized Multilingual LLM Dataset
## Overview
LlamaLens is a specialized multilingual LLM designed for analyzing news and social media content. It focuses on 18 NLP tasks, leveraging 52 datasets across Arabic, English, and Hindi.
<p align="center"> <img src="https://huggingface.co/datasets/QCRI/LlamaLens-Arabic/resolve/main/capablities_tasks_datasets.png" style="width: 40%;" id="title-icon"> </p>
## LlamaLens
This repo includes scripts needed to run our full pipeline, including data preprocessing and sampling, instruction dataset creation, model fine-tuning, inference and evaluation.
### Features
- Multilingual support (Arabic, English, Hindi)
- 18 NLP tasks with 52 datasets
- Optimized for news and social media content analysis
## 📂 Dataset Overview
### Hindi Datasets
| **Task** | **Dataset** | **# Labels** | **# Train** | **# Test** | **# Dev** |
| -------------------------- | ----------------------------------------- | ------------ | ----------- | ---------- | --------- |
| Cyberbullying | MC-Hinglish1.0 | 7 | 7,400 | 1,000 | 2,119 |
| Factuality | fake-news | 2 | 8,393 | 2,743 | 1,417 |
| Hate Speech | hate-speech-detection | 2 | 3,327 | 951 | 476 |
| Hate Speech | Hindi-Hostility-Detection-CONSTRAINT-2021 | 15 | 5,718 | 1,651 | 811 |
| Natural_Language_Inference | Natural_Language_Inference | 2 | 1,251 | 447 | 537 |
| Summarization | xlsum | -- | 70,754 | 8,847 | 8,847 |
| Offensive Speech | Offensive_Speech_Detection | 3 | 2,172 | 636 | 318 |
| Sentiment | Sentiment_Analysis | 3 | 10,039 | 1,259 | 1,258 |
---
## Results
Below, we present the performance of **L-Lens: LlamaLens** , where *"Eng"* refers to the English-instructed model and *"Native"* refers to the model trained with native language instructions. The results are compared against the SOTA (where available) and the Base: **Llama-Instruct 3.1 baseline**. The **Δ** (Delta) column indicates the difference between LlamaLens and the SOTA performance, calculated as (LlamaLens – SOTA).
---
| **Task** | **Dataset** | **Metric** | **SOTA** | **Base** | **L-Lens-Eng** | **L-Lens-Native** | **Δ (L-Lens (Eng) - SOTA)** |
|:----------------------------------:|:--------------------------------------------:|:----------:|:--------:|:---------------------:|:---------------------:|:--------------------:|:------------------------:|
| Factuality | fake-news | Mi-F1 | -- | 0.759 | 0.994 | 0.993 | -- |
| Hate Speech Detection | hate-speech-detection | Mi-F1 | 0.639 | 0.750 | 0.963 | 0.963 | 0.324 |
| Hate Speech Detection | Hindi-Hostility-Detection-CONSTRAINT-2021 | W-F1 | 0.841 | 0.469 | 0.753 | 0.753 | -0.088 |
| Natural Language Inference | Natural Language Inference | W-F1 | 0.646 | 0.633 | 0.568 | 0.679 | -0.078 |
| News Summarization | xlsum | R-2 | 0.136 | 0.078 | 0.171 | 0.170 | 0.035 |
| Offensive Language Detection | Offensive Speech Detection | Mi-F1 | 0.723 | 0.621 | 0.862 | 0.865 | 0.139 |
| Cyberbullying Detection | MC_Hinglish1 | Acc | 0.609 | 0.233 | 0.625 | 0.627 | 0.016 |
| Sentiment Classification | Sentiment Analysis | Acc | 0.697 | 0.552 | 0.647 | 0.654 | -0.050
## File Format
Each JSONL file in the dataset follows a structured format with the following fields:
- `id`: Unique identifier for each data entry.
- `original_id`: Identifier from the original dataset, if available.
- `input`: The original text that needs to be analyzed.
- `output`: The label assigned to the text after analysis.
- `dataset`: Name of the dataset the entry belongs.
- `task`: The specific task type.
- `lang`: The language of the input text.
- `instructions`: A brief set of instructions describing how the text should be labeled.
**Example entry in JSONL file:**
```
{
"id": "5486ee85-4a70-4b33-8711-fb2a0b6d81e1",
"original_id": null,
"input": "आप और बाकी सभी मुसलमान समाज के लिए आशीर्वाद हैं.",
"output": "not-hateful",
"dataset": "hate-speech-detection",
"task": "Factuality",
"lang": "hi",
"instructions": "Classify the given text as either 'not-hateful' or 'hateful'. Return only the label without any explanation, justification, or additional text."
}
```
## Model
[**LlamaLens on Hugging Face**](https://huggingface.co/QCRI/LlamaLens)
## Replication Scripts
[**LlamaLens GitHub Repository**](https://github.com/firojalam/LlamaLens)
## 📢 Citation
If you use this dataset, please cite our [paper](https://arxiv.org/pdf/2410.15308):
```
@article{kmainasi2024llamalensspecializedmultilingualllm,
title={LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content},
author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Maram Hasanain and Sahinur Rahman Laskar and Naeemul Hassan and Firoj Alam},
year={2024},
journal={arXiv preprint arXiv:2410.15308},
volume={},
number={},
pages={},
url={https://arxiv.org/abs/2410.15308},
eprint={2410.15308},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
# LlamaLens:面向专业场景的多语言大语言模型数据集
## 概述
LlamaLens是一款专为新闻与社交媒体内容分析打造的专业多语言大语言模型(Large Language Model,LLM)。其聚焦18项自然语言处理(Natural Language Processing,NLP)任务,依托覆盖阿拉伯语、英语、印地语的52个数据集构建。
<p align="center"> <img src="https://huggingface.co/datasets/QCRI/LlamaLens-Arabic/resolve/main/capablities_tasks_datasets.png" style="width: 40%;" id="title-icon"> </p>
## LlamaLens
本仓库包含运行完整流程所需的脚本,涵盖数据预处理与采样、指令数据集构建、模型微调、推理与评估。
### 特性
- 多语言支持(阿拉伯语、英语、印地语)
- 覆盖18项NLP任务,共计52个数据集
- 针对新闻与社交媒体内容分析场景优化
## 📂 数据集概览
### 印地语数据集
| **任务** | **数据集** | **标签数** | **训练样本数** | **测试样本数** | **开发集样本数** |
| -------------------------- | ----------------------------------------- | ------------ | ----------- | ---------- | --------- |
| 网络欺凌检测 | MC-Hinglish1.0 | 7 | 7,400 | 1,000 | 2,119 |
| 事实性检测 | fake-news | 2 | 8,393 | 2,743 | 1,417 |
| 仇恨言论检测 | hate-speech-detection | 2 | 3,327 | 951 | 476 |
| 仇恨言论检测 | Hindi-Hostility-Detection-CONSTRAINT-2021 | 15 | 5,718 | 1,651 | 811 |
| 自然语言推理 | Natural_Language_Inference | 2 | 1,251 | 447 | 537 |
| 摘要生成 | xlsum | -- | 70,754 | 8,847 | 8,847 |
| 冒犯性言论检测 | Offensive_Speech_Detection | 3 | 2,172 | 636 | 318 |
| 情感分析 | Sentiment_Analysis | 3 | 10,039 | 1,259 | 1,258 |
---
## 实验结果
下文将展示**L-Lens:LlamaLens**的性能表现,其中*"Eng"*指代基于英语指令微调的模型,*"Native"*指代基于原生语言指令微调的模型。我们将结果与当前最优(State-of-the-Art,SOTA)以及基线模型**Llama-Instruct 3.1基线**进行对比。**Δ(Delta)** 列表示LlamaLens与SOTA模型的性能差值,计算方式为(LlamaLens – SOTA)。
---
| **任务** | **数据集** | **评价指标** | **SOTA** | **基线模型** | **L-Lens-Eng** | **L-Lens-Native** | **Δ(L-Lens(英文) - SOTA)** |
|:----------------------------------:|:--------------------------------------------:|:----------:|:--------:|:---------------------:|:---------------------:|:--------------------:|:------------------------:|
| 事实性检测 | fake-news | Mi-F1 | -- | 0.759 | 0.994 | 0.993 | -- |
| 仇恨言论检测 | hate-speech-detection | Mi-F1 | 0.639 | 0.750 | 0.963 | 0.963 | 0.324 |
| 仇恨言论检测 | Hindi-Hostility-Detection-CONSTRAINT-2021 | W-F1 | 0.841 | 0.469 | 0.753 | 0.753 | -0.088 |
| 自然语言推理 | Natural Language Inference | W-F1 | 0.646 | 0.633 | 0.568 | 0.679 | -0.078 |
| 新闻摘要生成 | xlsum | R-2 | 0.136 | 0.078 | 0.171 | 0.170 | 0.035 |
| 冒犯性语言检测 | Offensive Speech Detection | Mi-F1 | 0.723 | 0.621 | 0.862 | 0.865 | 0.139 |
| 网络欺凌检测 | MC_Hinglish1 | Acc | 0.609 | 0.233 | 0.625 | 0.627 | 0.016 |
| 情感分类 | Sentiment Analysis | Acc | 0.697 | 0.552 | 0.647 | 0.654 | -0.050
## 文件格式
数据集中的每个JSONL文件均遵循结构化格式,包含以下字段:
- `id`:每条数据条目的唯一标识符
- `original_id`:原始数据集提供的标识符(若可用)
- `input`:待分析的原始文本
- `output`:经分析后为文本分配的标签
- `dataset`:该条目所属的数据集名称
- `task`:具体任务类型
- `lang`:输入文本的语言
- `instructions`:描述文本标注规则的简短指令集
**JSONL文件示例条目:**
{
"id": "5486ee85-4a70-4b33-8711-fb2a0b6d81e1",
"original_id": null,
"input": "आप और बाकी सभी मुसलमान समाज के लिए आशीर्वाद हैं.",
"output": "not-hateful",
"dataset": "hate-speech-detection",
"task": "Factuality",
"lang": "hi",
"instructions": "将给定文本分类为“非仇恨言论”或“仇恨言论”。仅返回标签,不得添加任何解释、论证或额外文本。"
}
## 模型
[**LlamaLens 开源于 Hugging Face**](https://huggingface.co/QCRI/LlamaLens)
## 复现脚本
[**LlamaLens GitHub 仓库**](https://github.com/firojalam/LlamaLens)
## 📢 引用
若您使用本数据集,请引用我们的[论文](https://arxiv.org/pdf/2410.15308):
@article{kmainasi2024llamalensspecializedmultilingualllm,
title={LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content},
author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Maram Hasanain and Sahinur Rahman Laskar and Naeemul Hassan and Firoj Alam},
year={2024},
journal={arXiv preprint arXiv:2410.15308},
volume={},
number={},
pages={},
url={https://arxiv.org/abs/2410.15308},
eprint={2410.15308},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
提供机构:
maas
创建时间:
2025-06-17



