five

alexandrainst/nordjylland-news-summarization

收藏
Hugging Face2024-05-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/alexandrainst/nordjylland-news-summarization
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: data/train-* - split: val path: data/val-* - split: test path: data/test-* dataset_info: features: - name: text dtype: string - name: summary dtype: string - name: text_len dtype: int64 - name: summary_len dtype: int64 splits: - name: train num_bytes: 118935809 num_examples: 75219 - name: val num_bytes: 6551332 num_examples: 4178 - name: test num_bytes: 6670392 num_examples: 4178 download_size: 81334629 dataset_size: 132157533 license: cc0-1.0 task_categories: - summarization language: - da size_categories: - 10K<n<100K --- # Dataset Card for "nordjylland-news-summarization" ## Dataset Description - **Point of Contact:** [Oliver Kinch](mailto:oliver.kinch@alexandra.dk) - **Size of dataset:** 148 MB ### Dataset Summary This dataset consists of pairs containing text and corresponding summaries extracted from the Danish newspaper [TV2 Nord](https://www.tv2nord.dk/). ### Supported Tasks and Leaderboards Summarization is the intended task for this dataset. No leaderboard is active at this point. ### Languages The dataset is available in Danish (`da`). ## Dataset Structure An example from the dataset looks as follows. ``` { "text": "some text", "summary": "some summary", "text_len": <number of chars in text>, "summary_len": <number of chars in summary> } ``` ### Data Fields - `text`: a `string` feature. - `summary`: a `string` feature. - `text_len`: an `int64` feature. - `summary_len`: an `int64` feature. ### Dataset Statistics #### Number of samples - Train: 75219 - Val: 4178 - Test: 4178 #### Text Length Distribution - Minimum length: 21 - Maximum length: 35164 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61e0713ac50610f535ed2c88/YBO73NHfW5Ufh0svopGbc.png) #### Summary Length Distribution - Minimum length: 12 - Maximum length: 499 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61e0713ac50610f535ed2c88/tSLeODADes_r-V7sED2tH.png) ## Potential Dataset Issues Within the dataset, there are 181 instances where the length of the summary exceeds the length of the corresponding text. ## Dataset Creation ### Curation Rationale There are not many large-scale summarization datasets in Danish. ### Source Data The dataset has been collected through the TV2 Nord API, which can be accessed [here](https://developer.bazo.dk/#876ab6f9-e057-43e3-897a-1563de34397e). ## Additional Information ### Dataset Curators [Oliver Kinch](https://huggingface.co/oliverkinch) from the [The Alexandra Institute](https://alexandra.dk/) ### Licensing Information The dataset is licensed under the [CC0 license](https://creativecommons.org/share-your-work/public-domain/cc0/).
提供机构:
alexandrainst
原始信息汇总

数据集卡片 "nordjylland-news-summarization"

数据集描述

数据集摘要

该数据集包含从丹麦报纸《TV2 Nord》提取的文本及其对应的摘要对。

支持的任务和排行榜

该数据集旨在用于摘要任务,目前没有活跃的排行榜。

语言

数据集提供丹麦语(da)版本。

数据集结构

数据集示例如下:

json { "text": "some text", "summary": "some summary", "text_len": <number of chars in text>, "summary_len": <number of chars in summary> }

数据字段

  • text:一个 string 特征。
  • summary:一个 string 特征。
  • text_len:一个 int64 特征。
  • summary_len:一个 int64 特征。

数据集统计

样本数量

  • 训练集:75219
  • 验证集:4178
  • 测试集:4178

文本长度分布

  • 最小长度:21
  • 最大长度:35164

摘要长度分布

  • 最小长度:12
  • 最大长度:499

潜在的数据集问题

数据集中有181个实例,其中摘要的长度超过了相应文本的长度。

数据集创建

策划理由

丹麦语中没有很多大规模的摘要数据集。

源数据

数据集通过《TV2 Nord》API收集,该API可在此处访问。

附加信息

数据集策展人

Oliver Kinch 来自 The Alexandra Institute

许可信息

数据集根据 CC0 许可 进行许可。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作