five

duyhngoc/OV_Text

收藏
Hugging Face2023-07-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/duyhngoc/OV_Text
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language: - vi license: - apache-2.0 multilinguality: - monolingual pretty_name: OV_Text size_categories: - 10K<n<100K task_categories: - text-generation --- # Dataset Card for OV_Text ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description The OV_Text dataset is a collection of 100,000 sentences sourced from various news articles. Out of the 10,000 sentences in the dataset, 5,000 sentences have a length ranging from 50 to 150, while the other 5,000 sentences have a length ranging from 20 to 50. This distribution of sentence lengths provides a diverse range of text samples that can be used to train and test natural language processing models. ### Dataset Summary ### Supported Tasks and Leaderboards ### Languages ## Dataset Structure ### Data Instances ### Data Fields ### Data Splits | name | train | validation | test | |---------|--------:|-----------:|-------:| | small | 1600 | 200 | 200 | | base | 8000 | 1000 | 1000 | | large | 95000 | 2500 | 2500 | ## Dataset Creation ### Curation Rationale ### Source Data ### Annotations ## Additional Information ### Licensing Information The dataset is released under Apache 2.0. ### Citation Information ### Contributions
提供机构:
duyhngoc
原始信息汇总

数据集概述

基本信息

  • 名称: OV_Text
  • 语言: 越南语 (vi)
  • 许可证: Apache-2.0
  • 多语言性: 单语种
  • 大小: 10,000至100,000条数据
  • 任务类别: 文本生成

数据集描述

  • 数据来源: 来自各种新闻文章的100,000个句子
  • 句子长度分布: 5,000句子长度为50至150,另外5,000句子长度为20至50

数据集结构

  • 数据分割:
    • small: 训练集1600条,验证集200条,测试集200条
    • base: 训练集8000条,验证集1000条,测试集1000条
    • large: 训练集95000条,验证集2500条,测试集2500条

许可证信息

  • 许可证: Apache 2.0
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作