navjordj/SNL_summarization
收藏Hugging Face2024-01-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/navjordj/SNL_summarization
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- summarization
- text2text-generation
language:
- 'no'
- nb
size_categories:
- 10K<n<100K
dataset_info:
features:
- name: id
dtype: int64
- name: url
dtype: string
- name: date_scraped
dtype: string
- name: headline
dtype: string
- name: category
dtype: string
- name: ingress
dtype: string
- name: article
dtype: string
splits:
- name: train
num_bytes: 26303219.28053567
num_examples: 10874
- name: validation
num_bytes: 1981086.682983145
num_examples: 819
- name: test
num_bytes: 3144582.036481182
num_examples: 1300
download_size: 19441287
dataset_size: 31428888.0
---
# SNL Summarization Dataset
The source of this dataset is a web scrape of SNL (Store Norske Leksikon), a publicly owned Norwegian encyclopedia. Articles in SNL are structured so that the first para
graph (the lead) acts as a summary of the entire article.
## Methodology
From our thesis:
We couldn’t find any existing datasets containing SNL data, so we decided to create our own by scraping articles from SNL.no. The first step involved gathering a list of all article URLs on the site. We extracted the URLs from the sitemaps and retained only those following the format ”https://snl.no/name of article” to avoid non-article pages. Next, we scraped the URLs with multiple threads downloading articles at the same time using the Python module grequests and parsed the received HTML using beautifulsoup4. We extracted the text from the lead and the rest of the article text, joining the latter while removing any whitespace. Additionally, we saved metadata such as URLs, headlines, and categories for each article.
To filter out very short articles, we set criteria for keeping an article: the lead had
to be at least 100 characters long, and the rest of the article had to be longer than 400 characters.
Finally, we split the dataset using an 84%/6%/10% split for the train/validation/test sets. This
division was chosen to ensure a sufficient amount of data for training our models while still
providing an adequate sample size for validation and testing. By allocating a larger portion
(84%) of the data for training, our goal was to optimize the model’s learning process. We
allocated 6% of the data for validation, which was intended to help fine-tune the model and
its hyperparameters, while the remaining 10% was designated for the final evaluation of our
model’s performance on unseen data in the test set.
# License
Please refer to the license of SNL
# Citation
If you are using this dataset in your work, please cite our master thesis which this dataset was a part of
```
@mastersthesis{navjord2023beyond,
title={Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
author={Navjord, J{\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
year={2023},
school={Norwegian University of Life Sciences, {\AA}s}
}
```
提供机构:
navjordj
原始信息汇总
数据集概述
数据集名称
SNL Summarization Dataset
数据集来源
数据集来源于对SNL(Store Norske Leksikon)的网页抓取,SNL是挪威的公共百科全书。
数据集特征
- 任务类别:摘要生成、文本到文本生成
- 语言:挪威语
- 大小类别:10K<n<100K
数据集结构
- 特征:
- id: int64
- url: string
- date_scraped: string
- headline: string
- category: string
- ingress: string
- article: string
- 分割:
- 训练集:10874个样本,26303219.28053567字节
- 验证集:819个样本,1981086.682983145字节
- 测试集:1300个样本,3144582.036481182字节
- 下载大小:19441287字节
- 数据集大小:31428888.0字节
数据集创建方法
数据集通过爬取SNL.no的文章URL,使用Python模块grequests和beautifulsoup4进行多线程下载和HTML解析,提取文章文本和元数据(如URL、标题、类别)。为确保文章质量,设置了文本长度阈值,并按84%/6%/10%的比例分割为训练/验证/测试集。
引用信息
若使用此数据集,请引用以下硕士论文:
@mastersthesis{navjord2023beyond, title={Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers}, author={Navjord, J{o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen}, year={2023}, school={Norwegian University of Life Sciences, {AA}s} }



