five

LocalDoc/climbmix-40b-az

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LocalDoc/climbmix-40b-az
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - az license: mit task_categories: - text-generation - fill-mask pretty_name: ClimbMix 40B Azerbaijani size_categories: - 10M<n<100M source_datasets: - karpathy/climbmix-400b-shuffle tags: - azerbaijani - machine-translation - pretraining - llm --- # ClimbMix 40B — Azerbaijani A large-scale Azerbaijani text dataset created by translating the English [karpathy/climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle) dataset into Azerbaijani using Google Translate. ## Dataset Summary This dataset contains approximately **40 billion tokens** of Azerbaijani text, making it one of the largest publicly available Azerbaijani language corpora. It is intended for pretraining and fine-tuning large language models (LLMs) for the Azerbaijani language. | Property | Value | |------------------|------------------------------------| | Language | Azerbaijani (`az`) | | Size | ~40B tokens | | Format | Parquet | | Number of files | 348 | | Rows per file | ~85,000 | | Source language | English | | Translation | Google Translate (en → az) | | Source dataset | karpathy/climbmix-400b-shuffle | | License | MIT | ## Dataset Structure Each row contains a single field: | Column | Type | Description | |--------|--------|-----------------------| | `text` | string | Azerbaijani text body | ## Usage ```python from datasets import load_dataset ds = load_dataset("LocalDoc/climbmix-40b-az") print(ds) ``` Streaming mode (recommended for large-scale use): ```python from datasets import load_dataset ds = load_dataset("LocalDoc/climbmix-40b-az", streaming=True) for sample in ds["train"]: print(sample["text"]) break ``` ## Source Data The original [climbmix-400b-shuffle](https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle) dataset was created by Andrej Karpathy and contains a diverse mixture of English web text covering a wide range of topics including science, technology, history, literature, Q&A, and more. ## Translation All text was translated from English to Azerbaijani using **Google Translate** via automated batch processing. While machine translation introduces certain limitations (see below), the scale of this dataset makes it a valuable resource for Azerbaijani NLP given the scarcity of large native corpora. ## Limitations - **Machine translation quality**: Text was translated automatically and may contain unnatural phrasing, calques from English, or translation errors — especially for idiomatic expressions and domain-specific terminology. - **Cultural context**: Some content may not reflect native Azerbaijani cultural context, as it originates from English-language sources. - **Agglutinative morphology**: Azerbaijani is an agglutinative language with complex suffix structures; machine translation does not always handle these correctly. - **Recommended use**: Suitable for LLM pretraining and vocabulary learning. Not recommended as the sole resource for high-precision fine-tuning tasks. ## Citation If you use this dataset in your research, please cite both this dataset and the original source: ```bibtex @misc{climbmix40b-az, title = {ClimbMix 40B Azerbaijani}, author = {LocalDoc}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/LocalDoc/climbmix-40b-az}}, note = {Machine-translated from karpathy/climbmix-400b-shuffle} } ``` ## License This dataset is released under the [MIT License](https://opensource.org/licenses/MIT), consistent with the license of the original source dataset.
提供机构:
LocalDoc
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作