navjordj/SNL_summarization

Name: navjordj/SNL_summarization
Creator: navjordj
Published: 2024-01-23 07:25:47
License: 暂无描述

Hugging Face2024-01-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/navjordj/SNL_summarization

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - summarization - text2text-generation language: - 'no' - nb size_categories: - 10K<n<100K dataset_info: features: - name: id dtype: int64 - name: url dtype: string - name: date_scraped dtype: string - name: headline dtype: string - name: category dtype: string - name: ingress dtype: string - name: article dtype: string splits: - name: train num_bytes: 26303219.28053567 num_examples: 10874 - name: validation num_bytes: 1981086.682983145 num_examples: 819 - name: test num_bytes: 3144582.036481182 num_examples: 1300 download_size: 19441287 dataset_size: 31428888.0 --- # SNL Summarization Dataset The source of this dataset is a web scrape of SNL (Store Norske Leksikon), a publicly owned Norwegian encyclopedia. Articles in SNL are structured so that the first para graph (the lead) acts as a summary of the entire article. ## Methodology From our thesis: We couldn’t find any existing datasets containing SNL data, so we decided to create our own by scraping articles from SNL.no. The first step involved gathering a list of all article URLs on the site. We extracted the URLs from the sitemaps and retained only those following the format ”https://snl.no/name of article” to avoid non-article pages. Next, we scraped the URLs with multiple threads downloading articles at the same time using the Python module grequests and parsed the received HTML using beautifulsoup4. We extracted the text from the lead and the rest of the article text, joining the latter while removing any whitespace. Additionally, we saved metadata such as URLs, headlines, and categories for each article. To filter out very short articles, we set criteria for keeping an article: the lead had to be at least 100 characters long, and the rest of the article had to be longer than 400 characters. Finally, we split the dataset using an 84%/6%/10% split for the train/validation/test sets. This division was chosen to ensure a sufficient amount of data for training our models while still providing an adequate sample size for validation and testing. By allocating a larger portion (84%) of the data for training, our goal was to optimize the model’s learning process. We allocated 6% of the data for validation, which was intended to help fine-tune the model and its hyperparameters, while the remaining 10% was designated for the final evaluation of our model’s performance on unseen data in the test set. # License Please refer to the license of SNL # Citation If you are using this dataset in your work, please cite our master thesis which this dataset was a part of ``` @mastersthesis{navjord2023beyond, title={Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers}, author={Navjord, J{\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen}, year={2023}, school={Norwegian University of Life Sciences, {\AA}s} } ```

提供机构：

navjordj

原始信息汇总

数据集概述

数据集名称

SNL Summarization Dataset

数据集来源

数据集来源于对SNL（Store Norske Leksikon）的网页抓取，SNL是挪威的公共百科全书。

数据集特征

任务类别：摘要生成、文本到文本生成
语言：挪威语
大小类别：10K<n<100K

数据集结构

特征：
- id: int64
- url: string
- date_scraped: string
- headline: string
- category: string
- ingress: string
- article: string
分割：
- 训练集：10874个样本，26303219.28053567字节
- 验证集：819个样本，1981086.682983145字节
- 测试集：1300个样本，3144582.036481182字节
下载大小：19441287字节
数据集大小：31428888.0字节

数据集创建方法

数据集通过爬取SNL.no的文章URL，使用Python模块grequests和beautifulsoup4进行多线程下载和HTML解析，提取文章文本和元数据（如URL、标题、类别）。为确保文章质量，设置了文本长度阈值，并按84%/6%/10%的比例分割为训练/验证/测试集。

引用信息

若使用此数据集，请引用以下硕士论文：

@mastersthesis{navjord2023beyond, title={Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers}, author={Navjord, J{o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen}, year={2023}, school={Norwegian University of Life Sciences, {AA}s} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集