bkai-foundation-models/NewsSapo
收藏Hugging Face2024-03-05 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/bkai-foundation-models/NewsSapo
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- summarization
- feature-extraction
language:
- vi
pretty_name: Vietnamese NewsSapo Dataset
size_categories:
- 10M<n<100M
---
Vietnamese NewsSapo Dataset
The Vietnamese NewsSapo dataset was constructed to train sentence/passage embeddings. Our dataset is structured in a "title-abstract-contents" format, where each news article is represented by a tuple of (title, abstract, content). The content is the main text body of the article and has been processed to remove images, videos, and other non-textual elements. The dataset contains 31,728,183 triples.
To build this dataset, we followed a two-step process:
Step 1: Collect news data from 2021-11/2023. Combine with [Binhvq News Corpus](https://github.com/binhvq/news-corpus) to form a unified dataset.
Step 2: Extract title-sapo-content for each article.
### Please cite our manuscript if this dataset is used for your work
```
@article{duc2024towards,
title={Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models},
author={Nguyen Quang Duc, Le Hai Son, Nguyen Duc Nhan, Nguyen Dich Nhat Minh, Le Thanh Huong, Dinh Viet Sang},
journal={arXiv preprint arXiv:2403.01616},
year={2024}
}
```
提供机构:
bkai-foundation-models
原始信息汇总
Vietnamese NewsSapo Dataset
概述
Vietnamese NewsSapo 数据集是为了训练句子/段落嵌入而构建的。数据集采用“标题-摘要-内容”格式,每个新闻文章由一个三元组(标题,摘要,内容)表示。内容是文章的主要文本部分,已经过处理以去除图像、视频和其他非文本元素。数据集包含 31,728,183 个三元组。
数据集构建过程
- 数据收集:从 2021 年 11 月到 2023 年收集新闻数据,并与 Binhvq News Corpus 合并,形成统一的数据集。
- 数据提取:为每篇文章提取标题、摘要和内容。
数据集属性
- 任务类别:摘要、特征提取
- 语言:越南语
- 数据集大小:10M<n<100M
- 数据格式:标题-摘要-内容
引用
如果使用此数据集进行研究,请引用以下文献:
@article{duc2024towards, title={Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models}, author={Nguyen Quang Duc, Le Hai Son, Nguyen Duc Nhan, Nguyen Dich Nhat Minh, Le Thanh Huong, Dinh Viet Sang}, journal={arXiv preprint arXiv:2403.01616}, year={2024} }



