damerajee/long_context_hindi
收藏Hugging Face2024-05-06 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/damerajee/long_context_hindi
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: doc_id
dtype: string
- name: type
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 25324509618
num_examples: 806930
download_size: 9419131940
dataset_size: 25324509618
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-4.0
task_categories:
- text-generation
language:
- hi
- en
pretty_name: long_context
size_categories:
- 100K<n<1M
---
# Dataset
This dataset was filtered from AI4BHarat dataset [sangraha](https://huggingface.co/datasets/ai4bharat/sangraha),which is the largest high-quality, cleaned Indic language pretraining data containing 251B tokens summed up over 22 languages, extracted from curated sources, existing multilingual corpora and large scale translations.
This dataset contains only Hindi as of now
# Information
* First this dataset is mainly for long context training
* The minimum len is `6000` and maximum len is `3754718`
# Getting started
For downloading the entire dataset:
```python
from datasets import load_dataset
dataset = load_dataset("damerajee/long_context_hindi")
```
If dataset is too big you can simply stream:
```python
from datasets import load_dataset
dataset = load_dataset("damerajee/long_context_hindi",split='train',streaming=True)
```
```python
dataset.take(2)
```
提供机构:
damerajee
原始信息汇总
数据集概述
基本信息
- 名称: long_context
- 语言:
- 印地语 (hi)
- 英语 (en)
- 任务类别: 文本生成
- 大小类别: 100K<n<1M
- 许可证: cc-by-4.0
数据集结构
- 特征:
doc_id: 字符串类型type: 字符串类型text: 字符串类型
数据分割
- 训练集:
- 示例数量: 806930
- 字节数: 25324509618
数据集大小
- 下载大小: 9419131940
- 数据集大小: 25324509618
配置
- 默认配置:
- 数据文件路径:
data/train-*
- 数据文件路径:



