duyhngoc/OV_Text

Name: duyhngoc/OV_Text
Creator: duyhngoc
Published: 2023-07-05 04:59:06
License: 暂无描述

Hugging Face2023-07-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/duyhngoc/OV_Text

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language: - vi license: - apache-2.0 multilinguality: - monolingual pretty_name: OV_Text size_categories: - 10K<n<100K task_categories: - text-generation --- # Dataset Card for OV_Text ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description The OV_Text dataset is a collection of 100,000 sentences sourced from various news articles. Out of the 10,000 sentences in the dataset, 5,000 sentences have a length ranging from 50 to 150, while the other 5,000 sentences have a length ranging from 20 to 50. This distribution of sentence lengths provides a diverse range of text samples that can be used to train and test natural language processing models. ### Dataset Summary ### Supported Tasks and Leaderboards ### Languages ## Dataset Structure ### Data Instances ### Data Fields ### Data Splits | name | train | validation | test | |---------|--------:|-----------:|-------:| | small | 1600 | 200 | 200 | | base | 8000 | 1000 | 1000 | | large | 95000 | 2500 | 2500 | ## Dataset Creation ### Curation Rationale ### Source Data ### Annotations ## Additional Information ### Licensing Information The dataset is released under Apache 2.0. ### Citation Information ### Contributions

提供机构：

duyhngoc

原始信息汇总

数据集概述

基本信息

名称: OV_Text
语言: 越南语 (vi)
许可证: Apache-2.0
多语言性: 单语种
大小: 10,000至100,000条数据
任务类别: 文本生成

数据集描述

数据来源: 来自各种新闻文章的100,000个句子
句子长度分布: 5,000句子长度为50至150，另外5,000句子长度为20至50

数据集结构

数据分割:
- small: 训练集1600条，验证集200条，测试集200条
- base: 训练集8000条，验证集1000条，测试集1000条
- large: 训练集95000条，验证集2500条，测试集2500条

许可证信息

许可证: Apache 2.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集