BHOSAI/Translated_English_Wikipedia_on_Azerbaijani

Name: BHOSAI/Translated_English_Wikipedia_on_Azerbaijani
Creator: BHOSAI
Published: 2024-04-24 12:13:04
License: 暂无描述

Hugging Face2024-04-24 更新2024-04-19 收录

下载链接：

https://hf-mirror.com/datasets/BHOSAI/Translated_English_Wikipedia_on_Azerbaijani

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为Translated English Wikipedia Dataset to Azerbaijani，由巴库高等石油学校的人工智能研究与开发中心创建。由于阿塞拜疆语资源的缺乏，该数据集通过将英文维基百科翻译成阿塞拜疆语来提供有意义的知识。翻译过程使用了知名的翻译模型，尽管翻译部分是合成的，但翻译质量高。该数据集主要用于大型语言模型（LLMs）的预训练，以构建基础模型。数据集的来源是wiki40B仓库，他们从数据集中提取了训练部分，并使用1.3B翻译模型在4个RTX4090 GPU上开始翻译。源数据集中有250万篇文章，而他们成功翻译了28万篇文章作为v1.1版本。

This dataset, named Translated English Wikipedia Dataset to Azerbaijani, was created by the Artificial Intelligence Research and Development Center of Baku Higher Oil School. Given the scarcity of Azerbaijani language resources, this dataset provides valuable knowledge by translating English Wikipedia articles into Azerbaijani. The translation process utilized well-known translation models. Although the translated content is synthetic, the translation quality remains high. This dataset is primarily intended for pre-training Large Language Models (LLMs) to build foundational models. The source of the dataset is the wiki40B repository. The creators extracted the training split from this dataset, and began translation using a 1.3B translation model across 4 RTX4090 GPUs. The original source dataset contains 2.5 million articles, and the creators successfully translated 280,000 of them for the v1.1 release.

提供机构：

BHOSAI

原始信息汇总

数据集概述

数据集名称

Translated English Wikipedia Dataset to Azerbaijani

数据集来源与制作

来源：该数据集源自wiki40B仓库。
制作过程：使用了1.3B翻译模型，并在4个RTX4090 GPU上进行翻译。
翻译内容：从源数据集的2.5M文章中成功翻译了280k篇文章，版本为v1.1。

数据集用途

主要用于预训练大型语言模型（LLMs），以构建基础模型。

数据集特点

翻译质量：翻译部分为合成，但翻译质量高。
语言：将英语维基百科翻译为阿塞拜疆语。

许可证

数据集遵循CC-BY-SA-4.0许可证。

5,000+

优质数据集

54 个

任务类型

进入经典数据集