ClusterlabAi/InstAr-500k
收藏Hugging Face2024-07-30 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/ClusterlabAi/InstAr-500k
下载链接
链接失效反馈官方服务:
资源简介:
数据集名为InstAr-500k,包含近500,000条阿拉伯语指令和响应,旨在用于微调大型语言模型(LLMs)以执行阿拉伯语自然语言处理任务。数据集结合了合成数据和人工制作的数据,涵盖多个领域和指令类型。数据集的结构包括多个特征,如指令、输出、来源、任务、系统、uuid、主题等。数据集分为训练集,包含481,281个样本,总大小为1,090,145,730字节。数据集支持的任务类别包括问答、摘要和文本分类。数据集的语言为阿拉伯语,规模在100K到1M之间。
The InstAr-500k dataset comprises almost 500,000 Arabic instructions and responses designed for fine-tuning large language models (LLMs) for Arabic NLP tasks. It includes a combination of synthetic and human-crafted data across various domains and instruction types. This extensive dataset aims to improve the performance of LLMs on Arabic-specific tasks. The dataset features include instruction, output, source, task, system, uuid, and topic. It is licensed under Apache-2.0 and is available in the train split with 481,281 examples. The dataset supports tasks such as question-answering, summarization, and text-classification.
提供机构:
ClusterlabAi



