duyhngoc/OV_Text
收藏Hugging Face2023-07-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/duyhngoc/OV_Text
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language:
- vi
license:
- apache-2.0
multilinguality:
- monolingual
pretty_name: OV_Text
size_categories:
- 10K<n<100K
task_categories:
- text-generation
---
# Dataset Card for OV_Text
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
The OV_Text dataset is a collection of 100,000 sentences sourced from various news articles.
Out of the 10,000 sentences in the dataset, 5,000 sentences have a length ranging from 50 to 150, while the other 5,000 sentences have a length ranging from 20 to 50. This distribution of sentence lengths provides a diverse range of text samples that can be used to train and test natural language processing models.
### Dataset Summary
### Supported Tasks and Leaderboards
### Languages
## Dataset Structure
### Data Instances
### Data Fields
### Data Splits
| name | train | validation | test |
|---------|--------:|-----------:|-------:|
| small | 1600 | 200 | 200 |
| base | 8000 | 1000 | 1000 |
| large | 95000 | 2500 | 2500 |
## Dataset Creation
### Curation Rationale
### Source Data
### Annotations
## Additional Information
### Licensing Information
The dataset is released under Apache 2.0.
### Citation Information
### Contributions
提供机构:
duyhngoc
原始信息汇总
数据集概述
基本信息
- 名称: OV_Text
- 语言: 越南语 (vi)
- 许可证: Apache-2.0
- 多语言性: 单语种
- 大小: 10,000至100,000条数据
- 任务类别: 文本生成
数据集描述
- 数据来源: 来自各种新闻文章的100,000个句子
- 句子长度分布: 5,000句子长度为50至150,另外5,000句子长度为20至50
数据集结构
- 数据分割:
- small: 训练集1600条,验证集200条,测试集200条
- base: 训练集8000条,验证集1000条,测试集1000条
- large: 训练集95000条,验证集2500条,测试集2500条
许可证信息
- 许可证: Apache 2.0



