ithieund/viWikiHow-Abs-Sum
收藏Hugging Face2022-11-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ithieund/viWikiHow-Abs-Sum
下载链接
链接失效反馈官方服务:
资源简介:
# viWikiHow-Abs-Sum
A dataset for Vietnamese Abstractive Summarization task.
It includes all Vietnamese posts from WikiHow which was released in WikiLingua dataset.
# Introduction
This dataset was extracted from Train/Test split of WikiLingua dataset. As the target language is Vietnamese, we remove all other files, just keep train.\*.vi, test.\*.vi, and val.\*.vi for Vietnamese Abstractive Summarization task. The raw files then are stored in the *raw* director and after that, we run the python script to generate ready-to-use data files in TSV and JSONLINE formats which are stored in *processed* directory to be easily used for future training scripts.
# Directory structure
- raw: contains raw text files from WikiLingua
- test.src.vi
- test.tgt.vi
- train.src.vi
- train.tgt.vi
- val.src.vi
- val.tgt.vi
- processed: contains generated TSV and JSONLINE files
- test.tsv
- train.tsv
- valid.tsv
- test.jsonl
- train.jsonl
- valid.jsonl
- [and other variants]
# Credits
- Special thanks to WikiLingua authors: https://github.com/esdurmus/Wikilingua
- Article provided by <a href="https://www.wikihow.com/Main-Page" target="_blank">wikiHow</a>, a wiki that is building the world's largest and highest quality how-to manual. Please edit this article and find author credits at the original wikiHow article on How to Tie a Tie. Content on wikiHow can be shared under a <a href="http://creativecommons.org/licenses/by-nc-sa/3.0/" target="_blank">Creative Commons License</a>.
提供机构:
ithieund
原始信息汇总
数据集概述
数据集名称
- viWikiHow-Abs-Sum
数据集用途
- 用于越南语摘要生成任务(Vietnamese Abstractive Summarization task)。
数据来源
- 该数据集包含WikiHow发布的所有越南语文章,这些文章最初包含在WikiLingua数据集中。
数据处理
- 原始文件存储在
raw目录中,包括:test.src.vitest.tgt.vitrain.src.vitrain.tgt.vival.src.vival.tgt.vi
- 通过Python脚本处理后,生成TSV和JSONLINE格式的文件,存储在
processed目录中,包括:test.tsvtrain.tsvvalid.tsvtest.jsonltrain.jsonlvalid.jsonl- 其他变体文件
数据格式
- 原始数据格式:文本文件
- 处理后数据格式:TSV和JSONLINE
数据集结构
raw目录:存储原始文本文件。processed目录:存储处理后的TSV和JSONLINE文件。



