five

bkai-foundation-models/NewsSapo

收藏
Hugging Face2024-03-05 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/bkai-foundation-models/NewsSapo
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - summarization - feature-extraction language: - vi pretty_name: Vietnamese NewsSapo Dataset size_categories: - 10M<n<100M --- Vietnamese NewsSapo Dataset The Vietnamese NewsSapo dataset was constructed to train sentence/passage embeddings. Our dataset is structured in a "title-abstract-contents" format, where each news article is represented by a tuple of (title, abstract, content). The content is the main text body of the article and has been processed to remove images, videos, and other non-textual elements. The dataset contains 31,728,183 triples. To build this dataset, we followed a two-step process: Step 1: Collect news data from 2021-11/2023. Combine with [Binhvq News Corpus](https://github.com/binhvq/news-corpus) to form a unified dataset. Step 2: Extract title-sapo-content for each article. ### Please cite our manuscript if this dataset is used for your work ``` @article{duc2024towards, title={Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models}, author={Nguyen Quang Duc, Le Hai Son, Nguyen Duc Nhan, Nguyen Dich Nhat Minh, Le Thanh Huong, Dinh Viet Sang}, journal={arXiv preprint arXiv:2403.01616}, year={2024} } ```
提供机构:
bkai-foundation-models
原始信息汇总

Vietnamese NewsSapo Dataset

概述

Vietnamese NewsSapo 数据集是为了训练句子/段落嵌入而构建的。数据集采用“标题-摘要-内容”格式,每个新闻文章由一个三元组(标题,摘要,内容)表示。内容是文章的主要文本部分,已经过处理以去除图像、视频和其他非文本元素。数据集包含 31,728,183 个三元组。

数据集构建过程

  1. 数据收集:从 2021 年 11 月到 2023 年收集新闻数据,并与 Binhvq News Corpus 合并,形成统一的数据集。
  2. 数据提取:为每篇文章提取标题、摘要和内容。

数据集属性

  • 任务类别:摘要、特征提取
  • 语言:越南语
  • 数据集大小:10M<n<100M
  • 数据格式:标题-摘要-内容

引用

如果使用此数据集进行研究,请引用以下文献:

@article{duc2024towards, title={Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models}, author={Nguyen Quang Duc, Le Hai Son, Nguyen Duc Nhan, Nguyen Dich Nhat Minh, Le Thanh Huong, Dinh Viet Sang}, journal={arXiv preprint arXiv:2403.01616}, year={2024} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作