下载链接：

https://modelscope.cn/datasets/QCRI/LlamaLens-Hindi

下载链接

链接失效反馈

官方服务：

资源简介：

# LlamaLens: Specialized Multilingual LLM Dataset ## Overview LlamaLens is a specialized multilingual LLM designed for analyzing news and social media content. It focuses on 18 NLP tasks, leveraging 52 datasets across Arabic, English, and Hindi. <p align="center"> <img src="https://huggingface.co/datasets/QCRI/LlamaLens-Arabic/resolve/main/capablities_tasks_datasets.png" style="width: 40%;" id="title-icon"> </p> ## LlamaLens This repo includes scripts needed to run our full pipeline, including data preprocessing and sampling, instruction dataset creation, model fine-tuning, inference and evaluation. ### Features - Multilingual support (Arabic, English, Hindi) - 18 NLP tasks with 52 datasets - Optimized for news and social media content analysis ## 📂 Dataset Overview ### Hindi Datasets | **Task** | **Dataset** | **# Labels** | **# Train** | **# Test** | **# Dev** | | -------------------------- | ----------------------------------------- | ------------ | ----------- | ---------- | --------- | | Cyberbullying | MC-Hinglish1.0 | 7 | 7,400 | 1,000 | 2,119 | | Factuality | fake-news | 2 | 8,393 | 2,743 | 1,417 | | Hate Speech | hate-speech-detection | 2 | 3,327 | 951 | 476 | | Hate Speech | Hindi-Hostility-Detection-CONSTRAINT-2021 | 15 | 5,718 | 1,651 | 811 | | Natural_Language_Inference | Natural_Language_Inference | 2 | 1,251 | 447 | 537 | | Summarization | xlsum | -- | 70,754 | 8,847 | 8,847 | | Offensive Speech | Offensive_Speech_Detection | 3 | 2,172 | 636 | 318 | | Sentiment | Sentiment_Analysis | 3 | 10,039 | 1,259 | 1,258 | --- ## Results Below, we present the performance of **L-Lens: LlamaLens** , where *"Eng"* refers to the English-instructed model and *"Native"* refers to the model trained with native language instructions. The results are compared against the SOTA (where available) and the Base: **Llama-Instruct 3.1 baseline**. The **Δ** (Delta) column indicates the difference between LlamaLens and the SOTA performance, calculated as (LlamaLens – SOTA). --- | **Task** | **Dataset** | **Metric** | **SOTA** | **Base** | **L-Lens-Eng** | **L-Lens-Native** | **Δ (L-Lens (Eng) - SOTA)** | |:----------------------------------:|:--------------------------------------------:|:----------:|:--------:|:---------------------:|:---------------------:|:--------------------:|:------------------------:| | Factuality | fake-news | Mi-F1 | -- | 0.759 | 0.994 | 0.993 | -- | | Hate Speech Detection | hate-speech-detection | Mi-F1 | 0.639 | 0.750 | 0.963 | 0.963 | 0.324 | | Hate Speech Detection | Hindi-Hostility-Detection-CONSTRAINT-2021 | W-F1 | 0.841 | 0.469 | 0.753 | 0.753 | -0.088 | | Natural Language Inference | Natural Language Inference | W-F1 | 0.646 | 0.633 | 0.568 | 0.679 | -0.078 | | News Summarization | xlsum | R-2 | 0.136 | 0.078 | 0.171 | 0.170 | 0.035 | | Offensive Language Detection | Offensive Speech Detection | Mi-F1 | 0.723 | 0.621 | 0.862 | 0.865 | 0.139 | | Cyberbullying Detection | MC_Hinglish1 | Acc | 0.609 | 0.233 | 0.625 | 0.627 | 0.016 | | Sentiment Classification | Sentiment Analysis | Acc | 0.697 | 0.552 | 0.647 | 0.654 | -0.050 ## File Format Each JSONL file in the dataset follows a structured format with the following fields: - `id`: Unique identifier for each data entry. - `original_id`: Identifier from the original dataset, if available. - `input`: The original text that needs to be analyzed. - `output`: The label assigned to the text after analysis. - `dataset`: Name of the dataset the entry belongs. - `task`: The specific task type. - `lang`: The language of the input text. - `instructions`: A brief set of instructions describing how the text should be labeled. **Example entry in JSONL file:** ``` { "id": "5486ee85-4a70-4b33-8711-fb2a0b6d81e1", "original_id": null, "input": "आप और बाकी सभी मुसलमान समाज के लिए आशीर्वाद हैं.", "output": "not-hateful", "dataset": "hate-speech-detection", "task": "Factuality", "lang": "hi", "instructions": "Classify the given text as either 'not-hateful' or 'hateful'. Return only the label without any explanation, justification, or additional text." } ``` ## Model [**LlamaLens on Hugging Face**](https://huggingface.co/QCRI/LlamaLens) ## Replication Scripts [**LlamaLens GitHub Repository**](https://github.com/firojalam/LlamaLens) ## 📢 Citation If you use this dataset, please cite our [paper](https://arxiv.org/pdf/2410.15308): ``` @article{kmainasi2024llamalensspecializedmultilingualllm, title={LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content}, author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Maram Hasanain and Sahinur Rahman Laskar and Naeemul Hassan and Firoj Alam}, year={2024}, journal={arXiv preprint arXiv:2410.15308}, volume={}, number={}, pages={}, url={https://arxiv.org/abs/2410.15308}, eprint={2410.15308}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# LlamaLens：面向专业场景的多语言大语言模型数据集 ## 概述 LlamaLens是一款专为新闻与社交媒体内容分析打造的专业多语言大语言模型（Large Language Model，LLM）。其聚焦18项自然语言处理（Natural Language Processing，NLP）任务，依托覆盖阿拉伯语、英语、印地语的52个数据集构建。 <p align="center"> <img src="https://huggingface.co/datasets/QCRI/LlamaLens-Arabic/resolve/main/capablities_tasks_datasets.png" style="width: 40%;" id="title-icon"> </p> ## LlamaLens 本仓库包含运行完整流程所需的脚本，涵盖数据预处理与采样、指令数据集构建、模型微调、推理与评估。 ### 特性 - 多语言支持（阿拉伯语、英语、印地语） - 覆盖18项NLP任务，共计52个数据集 - 针对新闻与社交媒体内容分析场景优化 ## 📂 数据集概览 ### 印地语数据集 | **任务** | **数据集** | **标签数** | **训练样本数** | **测试样本数** | **开发集样本数** | | -------------------------- | ----------------------------------------- | ------------ | ----------- | ---------- | --------- | | 网络欺凌检测 | MC-Hinglish1.0 | 7 | 7,400 | 1,000 | 2,119 | | 事实性检测 | fake-news | 2 | 8,393 | 2,743 | 1,417 | | 仇恨言论检测 | hate-speech-detection | 2 | 3,327 | 951 | 476 | | 仇恨言论检测 | Hindi-Hostility-Detection-CONSTRAINT-2021 | 15 | 5,718 | 1,651 | 811 | | 自然语言推理 | Natural_Language_Inference | 2 | 1,251 | 447 | 537 | | 摘要生成 | xlsum | -- | 70,754 | 8,847 | 8,847 | | 冒犯性言论检测 | Offensive_Speech_Detection | 3 | 2,172 | 636 | 318 | | 情感分析 | Sentiment_Analysis | 3 | 10,039 | 1,259 | 1,258 | --- ## 实验结果下文将展示**L-Lens：LlamaLens**的性能表现，其中*"Eng"*指代基于英语指令微调的模型，*"Native"*指代基于原生语言指令微调的模型。我们将结果与当前最优（State-of-the-Art，SOTA）以及基线模型**Llama-Instruct 3.1基线**进行对比。**Δ（Delta）** 列表示LlamaLens与SOTA模型的性能差值，计算方式为（LlamaLens – SOTA）。 --- | **任务** | **数据集** | **评价指标** | **SOTA** | **基线模型** | **L-Lens-Eng** | **L-Lens-Native** | **Δ（L-Lens(英文) - SOTA）** | |:----------------------------------:|:--------------------------------------------:|:----------:|:--------:|:---------------------:|:---------------------:|:--------------------:|:------------------------:| | 事实性检测 | fake-news | Mi-F1 | -- | 0.759 | 0.994 | 0.993 | -- | | 仇恨言论检测 | hate-speech-detection | Mi-F1 | 0.639 | 0.750 | 0.963 | 0.963 | 0.324 | | 仇恨言论检测 | Hindi-Hostility-Detection-CONSTRAINT-2021 | W-F1 | 0.841 | 0.469 | 0.753 | 0.753 | -0.088 | | 自然语言推理 | Natural Language Inference | W-F1 | 0.646 | 0.633 | 0.568 | 0.679 | -0.078 | | 新闻摘要生成 | xlsum | R-2 | 0.136 | 0.078 | 0.171 | 0.170 | 0.035 | | 冒犯性语言检测 | Offensive Speech Detection | Mi-F1 | 0.723 | 0.621 | 0.862 | 0.865 | 0.139 | | 网络欺凌检测 | MC_Hinglish1 | Acc | 0.609 | 0.233 | 0.625 | 0.627 | 0.016 | | 情感分类 | Sentiment Analysis | Acc | 0.697 | 0.552 | 0.647 | 0.654 | -0.050 ## 文件格式数据集中的每个JSONL文件均遵循结构化格式，包含以下字段： - `id`：每条数据条目的唯一标识符 - `original_id`：原始数据集提供的标识符（若可用） - `input`：待分析的原始文本 - `output`：经分析后为文本分配的标签 - `dataset`：该条目所属的数据集名称 - `task`：具体任务类型 - `lang`：输入文本的语言 - `instructions`：描述文本标注规则的简短指令集 **JSONL文件示例条目：** { "id": "5486ee85-4a70-4b33-8711-fb2a0b6d81e1", "original_id": null, "input": "आप और बाकी सभी मुसलमान समाज के लिए आशीर्वाद हैं.", "output": "not-hateful", "dataset": "hate-speech-detection", "task": "Factuality", "lang": "hi", "instructions": "将给定文本分类为“非仇恨言论”或“仇恨言论”。仅返回标签，不得添加任何解释、论证或额外文本。" } ## 模型 [**LlamaLens 开源于 Hugging Face**](https://huggingface.co/QCRI/LlamaLens) ## 复现脚本 [**LlamaLens GitHub 仓库**](https://github.com/firojalam/LlamaLens) ## 📢 引用若您使用本数据集，请引用我们的[论文](https://arxiv.org/pdf/2410.15308)： @article{kmainasi2024llamalensspecializedmultilingualllm, title={LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content}, author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Maram Hasanain and Sahinur Rahman Laskar and Naeemul Hassan and Firoj Alam}, year={2024}, journal={arXiv preprint arXiv:2410.15308}, volume={}, number={}, pages={}, url={https://arxiv.org/abs/2410.15308}, eprint={2410.15308}, archivePrefix={arXiv}, primaryClass={cs.CL} }

应用场景：