duongttr/vi-dataset-for-pretrain
收藏Hugging Face2023-08-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/duongttr/vi-dataset-for-pretrain
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 77360702833
num_examples: 23891116
- name: validation
num_bytes: 4064634081
num_examples: 1257428
download_size: 2126869688
dataset_size: 81425336914
task_categories:
- text-generation
language:
- vi
size_categories:
- 10M<n<100M
tags:
- LM
---
# Dataset Card for "vi-dataset-for-pretrain"
This is a combination of multiple Vietnamese dataset for pretraining CLMs such as GPT, GPT2, etc.
The dataset consists of:
- [`vietgpt/covid_19_news_vi`](https://huggingface.co/datasets/vietgpt/covid_19_news_vi)
- [`hieunguyen1053/binhvq-news-corpus`](https://huggingface.co/datasets/hieunguyen1053/binhvq-news-corpus)
- [`oscar (unshuffled_deduplicated_vi)`](https://huggingface.co/datasets/oscar)
- [`vietgpt/wikipedia_vi`](https://huggingface.co/datasets/vietgpt/wikipedia_vi)
# Dataset info
| Splits | N.o examples | Size |
| --- | --- | --- |
| Train | 23,891,116 | 77.36 GB |
| Validation | 1,257,428 | 4.06 GB |
| **Total** | **25,148,544** | **81.43 GB** |
提供机构:
duongttr
原始信息汇总
数据集概述
基本信息
- 名称: vi-dataset-for-pretrain
- 任务类别: 文本生成
- 语言: 越南语
- 大小类别: 10M<n<100M
- 标签: LM
数据集组成
vietgpt/covid_19_news_vihieunguyen1053/binhvq-news-corpusoscar (unshuffled_deduplicated_vi)vietgpt/wikipedia_vi
数据集特征
- 特征名称: text
- 数据类型: string
数据集拆分
| 拆分 | 示例数量 | 大小 |
|---|---|---|
| 训练集 | 23,891,116 | 77.36 GB |
| 验证集 | 1,257,428 | 4.06 GB |
| 总计 | 25,148,544 | 81.43 GB |
下载信息
- 下载大小: 2.13 GB
- 数据集总大小: 81.43 GB



