alexandrainst/nordjylland-news-summarization
收藏Hugging Face2024-05-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/alexandrainst/nordjylland-news-summarization
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: val
path: data/val-*
- split: test
path: data/test-*
dataset_info:
features:
- name: text
dtype: string
- name: summary
dtype: string
- name: text_len
dtype: int64
- name: summary_len
dtype: int64
splits:
- name: train
num_bytes: 118935809
num_examples: 75219
- name: val
num_bytes: 6551332
num_examples: 4178
- name: test
num_bytes: 6670392
num_examples: 4178
download_size: 81334629
dataset_size: 132157533
license: cc0-1.0
task_categories:
- summarization
language:
- da
size_categories:
- 10K<n<100K
---
# Dataset Card for "nordjylland-news-summarization"
## Dataset Description
- **Point of Contact:** [Oliver Kinch](mailto:oliver.kinch@alexandra.dk)
- **Size of dataset:** 148 MB
### Dataset Summary
This dataset consists of pairs containing text and corresponding summaries extracted from the Danish newspaper [TV2 Nord](https://www.tv2nord.dk/).
### Supported Tasks and Leaderboards
Summarization is the intended task for this dataset. No leaderboard is active at this point.
### Languages
The dataset is available in Danish (`da`).
## Dataset Structure
An example from the dataset looks as follows.
```
{
"text": "some text",
"summary": "some summary",
"text_len": <number of chars in text>,
"summary_len": <number of chars in summary>
}
```
### Data Fields
- `text`: a `string` feature.
- `summary`: a `string` feature.
- `text_len`: an `int64` feature.
- `summary_len`: an `int64` feature.
### Dataset Statistics
#### Number of samples
- Train: 75219
- Val: 4178
- Test: 4178
#### Text Length Distribution
- Minimum length: 21
- Maximum length: 35164

#### Summary Length Distribution
- Minimum length: 12
- Maximum length: 499

## Potential Dataset Issues
Within the dataset, there are 181 instances where the length of the summary exceeds the length of the corresponding text.
## Dataset Creation
### Curation Rationale
There are not many large-scale summarization datasets in Danish.
### Source Data
The dataset has been collected through the TV2 Nord API, which can be accessed [here](https://developer.bazo.dk/#876ab6f9-e057-43e3-897a-1563de34397e).
## Additional Information
### Dataset Curators
[Oliver Kinch](https://huggingface.co/oliverkinch) from the [The Alexandra
Institute](https://alexandra.dk/)
### Licensing Information
The dataset is licensed under the [CC0
license](https://creativecommons.org/share-your-work/public-domain/cc0/).
提供机构:
alexandrainst
原始信息汇总
数据集卡片 "nordjylland-news-summarization"
数据集描述
数据集摘要
该数据集包含从丹麦报纸《TV2 Nord》提取的文本及其对应的摘要对。
支持的任务和排行榜
该数据集旨在用于摘要任务,目前没有活跃的排行榜。
语言
数据集提供丹麦语(da)版本。
数据集结构
数据集示例如下:
json { "text": "some text", "summary": "some summary", "text_len": <number of chars in text>, "summary_len": <number of chars in summary> }
数据字段
text:一个string特征。summary:一个string特征。text_len:一个int64特征。summary_len:一个int64特征。
数据集统计
样本数量
- 训练集:75219
- 验证集:4178
- 测试集:4178
文本长度分布
- 最小长度:21
- 最大长度:35164
摘要长度分布
- 最小长度:12
- 最大长度:499
潜在的数据集问题
数据集中有181个实例,其中摘要的长度超过了相应文本的长度。
数据集创建
策划理由
丹麦语中没有很多大规模的摘要数据集。
源数据
数据集通过《TV2 Nord》API收集,该API可在此处访问。
附加信息
数据集策展人
Oliver Kinch 来自 The Alexandra Institute
许可信息
数据集根据 CC0 许可 进行许可。



