five

karana657/multilingual-nanochat

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/karana657/multilingual-nanochat
下载链接
链接失效反馈
官方服务:
资源简介:
# Multilingual Dataset (English + Hindi) ## Dataset Description This dataset contains text data in English and Hindi, prepared for language model training. ## Statistics - **Total Shards**: 24 - **English Shards**: 12 - **Hindi Shards**: 12 - **Total Size**: 2.63 GB - **Mixing Strategy**: random ## Shard Format - Format: Parquet files with zstd compression - Schema: Single 'text' column containing the text data - Row Group Size: 1024 documents per row group - Compression: zstd level 3 ## Language Distribution - English: 1.05 GB (~50.0% of shards) - Hindi: 1.58 GB (~50.0% of shards) ## Usage ```python from datasets import load_dataset # Load the entire dataset dataset = load_dataset("parquet", data_files="*.parquet") # Load specific shards dataset = load_dataset("parquet", data_files=["shard_00000.parquet", "shard_00001.parquet"]) ``` ## Mixing Strategies - **interleave**: Shards alternate between English and Hindi - **random**: All shards are randomly shuffled - **sequential**: All English shards first, then all Hindi shards - **ratio:X:Y**: X English shards for every Y Hindi shards
提供机构:
karana657
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作