NbAiLab/norwegian_parliament

Name: NbAiLab/norwegian_parliament
Creator: NbAiLab
Published: 2024-04-16 08:15:10
License: 暂无描述

Hugging Face2024-04-16 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/NbAiLab/norwegian_parliament

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - found language: - no license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-classification --- # Dataset Card Creation Guide ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** N/A - **Repository:** [GitHub](https://github.com/ltgoslo/NorBERT/) - **Paper:** N/A - **Leaderboard:** N/A - **Point of Contact:** - ### Dataset Summary This is a classification dataset created from a subset of the [Talk of Norway](https://www.nb.no/sprakbanken/ressurskatalog/oai-repo-clarino-uib-no-11509-123/). This dataset contains text phrases from the political parties Fremskrittspartiet and Sosialistisk Venstreparti. The dataset is annotated with the party the speaker, as well as a timestamp. The classification task is to, simply by looking at the text, being able to predict is the speech was done by a representative from Fremskrittspartiet or from SV. ### Supported Tasks and Leaderboards This dataset is meant for classification. The first model tested on this dataset was [NB-BERT-base](NbAiLab/nb-bert-base), claiming a F1-score of [81.9](https://arxiv.org/abs/2104.09617)). The dataset was also used for testing the North-T5-models, where the XXL model ](https://huggingface.co/north/t5_xxl_NCC) claims a F1-score of 91.8. Please note that some of the text-fields are quite long. Truncating the text fields, for instance when switching tokenizers, will also make the task harder. ### Languages The text in the dataset is in Norwegian. ## Dataset Structure ### Data Fields - `text`: Text of a speech - `date`: Date (`YYYY-MM-DD`) the speech was held - `label`: Political party the speaker was associated with at the time - 0 - 1 - `full_label`: Political party the speaker was associated with at the time - Fremskrittspartiet - Sosialistisk Venstreparti ### Data Splits The dataset is split into a `train`, `validation`, and `test` split with the following sizes: | | Tain | Valid | Test | | ----- | ------ | ----- | ----- | | Number of examples | 3600 | 1200 | 1200 | The dataset is balanced on political party. ## Dataset Creation This dataset is based on the publicly available information by Norwegian Parliament (Storting) and created by the National Library of Norway AI-Lab to benchmark their language models. ## Additional Information The [Talk of Norway dataset](https://www.nb.no/sprakbanken/ressurskatalog/oai-repo-clarino-uib-no-11509-123/) is also available at the [LTG Talk-of-Norway Github](https://github.com/ltgoslo/talk-of-norway). ### Licensing Information This work is licensed under a Creative Commons Attribution 4.0 International License. ### Citation Information The following article can be quoted when referring to this dataset, since it is the first study that are using the dataset for evaluating a language model: ``` @inproceedings{kummervold-etal-2021-operationalizing, title = {Operationalizing a National Digital Library: The Case for a {N}orwegian Transformer Model}, author = {Kummervold, Per E and De la Rosa, Javier and Wetjen, Freddy and Brygfjeld, Svein Arne", booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)}, year = "2021", address = "Reykjavik, Iceland (Online)", publisher = {Link{"o}ping University Electronic Press, Sweden}, url = "https://aclanthology.org/2021.nodalida-main.3", pages = "20--29", abstract = "In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for both Norwegian Bokm{aa}l and Norwegian Nynorsk. Our model also improves the mBERT performance for other languages present in the corpus such as English, Swedish, and Danish. For languages not included in the corpus, the weights degrade moderately while keeping strong multilingual properties. Therefore, we show that building high-quality models within a memory institution using somewhat noisy optical character recognition (OCR) content is feasible, and we hope to pave the way for other memory institutions to follow.", } ```

提供机构：

NbAiLab

原始信息汇总

数据集概述

数据集摘要

来源: 该数据集是从Talk of Norway中提取的子集。
内容: 包含来自政治党派Fremskrittspartiet和Sosialistisk Venstreparti的文本片段，标注了发言人所属党派及时间戳。
任务: 通过文本内容预测发言人是来自Fremskrittspartiet还是Sosialistisk Venstreparti。

支持的任务和评估指标

任务: 文本分类。
模型表现:
- NB-BERT-base模型报告的F1分数为81.9。
- North-T5-XXL模型报告的F1分数为91.8。

语言

语言: 挪威语。

数据集结构

数据字段:
- text: 演讲文本。
- date: 演讲日期（格式：YYYY-MM-DD）。
- label: 发言人所属党派（0或1）。
- full_label: 发言人所属党派（Fremskrittspartiet或Sosialistisk Venstreparti）。
数据分割:
- train: 3600个样本。
- validation: 1200个样本。
- test: 1200个样本。

数据集创建

来源数据: 基于挪威议会（Storting）的公开信息。
创建目的: 由挪威国家图书馆AI实验室创建，用于基准测试其语言模型。

许可证信息

许可证: 本作品采用Creative Commons Attribution 4.0 International License。

引用信息

引用文献:

@inproceedings{kummervold-etal-2021-operationalizing, title = {Operationalizing a National Digital Library: The Case for a {N}orwegian Transformer Model}, author = {Kummervold, Per E and De la Rosa, Javier and Wetjen, Freddy and Brygfjeld, Svein Arne", booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)}, year = "2021", address = "Reykjavik, Iceland (Online)", publisher = {Link{"o}ping University Electronic Press, Sweden}, url = "https://aclanthology.org/2021.nodalida-main.3", pages = "20--29", abstract = {In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for both Norwegian Bokm{aa}l and Norwegian Nynorsk. Our model also improves the mBERT performance for other languages present in the corpus such as English, Swedish, and Danish. For languages not included in the corpus, the weights degrade moderately while keeping strong multilingual properties. Therefore, we show that building high-quality models within a memory institution using somewhat noisy optical character recognition (OCR) content is feasible, and we hope to pave the way for other memory institutions to follow.}, }

5,000+

优质数据集

54 个

任务类型

进入经典数据集