karana657/multilingual-nanochat
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/karana657/multilingual-nanochat
下载链接
链接失效反馈官方服务:
资源简介:
# Multilingual Dataset (English + Hindi)
## Dataset Description
This dataset contains text data in English and Hindi, prepared for language model training.
## Statistics
- **Total Shards**: 24
- **English Shards**: 12
- **Hindi Shards**: 12
- **Total Size**: 2.63 GB
- **Mixing Strategy**: random
## Shard Format
- Format: Parquet files with zstd compression
- Schema: Single 'text' column containing the text data
- Row Group Size: 1024 documents per row group
- Compression: zstd level 3
## Language Distribution
- English: 1.05 GB (~50.0% of shards)
- Hindi: 1.58 GB (~50.0% of shards)
## Usage
```python
from datasets import load_dataset
# Load the entire dataset
dataset = load_dataset("parquet", data_files="*.parquet")
# Load specific shards
dataset = load_dataset("parquet", data_files=["shard_00000.parquet", "shard_00001.parquet"])
```
## Mixing Strategies
- **interleave**: Shards alternate between English and Hindi
- **random**: All shards are randomly shuffled
- **sequential**: All English shards first, then all Hindi shards
- **ratio:X:Y**: X English shards for every Y Hindi shards
提供机构:
karana657



