UppsalaNLP/sou-corpus

Name: UppsalaNLP/sou-corpus
Creator: UppsalaNLP
Published: 2026-01-23 22:51:26
License: 暂无描述

Hugging Face2026-01-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/UppsalaNLP/sou-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - sv task_categories: - text-generation - token-classification tags: - swedish - government-reports - dependency-parsing - universal-dependencies - nlp size_categories: - 100K<n<1M source_datasets: - original --- # SOU Corpus - Swedish Government Official Reports Cleaned and dependency-parsed Swedish Government Official Reports (Statens offentliga utredningar) from 1994-2020. ## Dataset Description This dataset contains sentence-segmented and dependency-parsed text from Swedish Government Official Reports. The original documents were cleaned, processed, and annotated with Universal Dependencies-style parsing. ### Fields - **document_id**: Original document identifier (can be linked to Riksdagen open data) - **text_type**: Type of text section - `full_text`: Main report body - `summary_swedish`: Standard Swedish summary - `summary_simple_swedish`: Simple Swedish (lättläst) summary - `summary_english`: English summary - **section**: Section header from the document - **text**: Plain text sentence - **parsed**: Dependency-parsed sentence (token//POS//deprel//head format) ### Parsed Format Each token in the `parsed` field follows the format: ``` word//POS_TAG//DEPENDENCY_RELATION//HEAD_INDEX ``` Example: ``` Sverige//PM|NOM//nsubj//3 är//VB|PRS|AKT//cop//3 ett//DT|NEU|SIN|IND//det//3 land//NN|NEU|SIN|IND|NOM//ROOT//3 ``` ## Usage ```python from datasets import load_dataset dataset = load_dataset("UppsalaNLP/sou-corpus") # Access train/test splits train = dataset["train"] test = dataset["test"] # Example print(train[0]["text"]) print(train[0]["section"]) print(train[0]["text_type"]) ``` ### Extract Tokens ```python def parse_tokens(parsed_str): tokens = [] for t in parsed_str.split(' '): parts = t.split('//') if len(parts) >= 4: tokens.append({ 'word': parts[0], 'pos': parts[1], 'deprel': parts[2], 'head': int(parts[3]) if parts[3].isdigit() else 0 }) return tokens tokens = parse_tokens(train[0]["parsed"]) ``` ## Source Documents obtained from [Riksdagens öppna data](http://data.riksdagen.se). Original document URLs follow the pattern: `https://data.riksdagen.se/dokument/{document_id}.html` ## Citation ```bibtex @inproceedings{durlich-etal-2022-cause, title = "Cause and Effect in Governmental Reports: Two Data Sets for Causality Detection in Swedish", author = "D{\"u}rlich, Luise and Reimann, Sebastian and Finnveden, Gustav and Nivre, Joakim and Stymne, Sara", booktitle = "Proceedings of the First Workshop on Natural Language Processing for Political Sciences", month = jun, year = "2022", address = "Marseilles, France" } ``` ## License This dataset is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). ## Links - [Uppsala NLP](https://huggingface.co/UppsalaNLP) - [GitHub Repository](https://github.com/UppsalaNLP/SOU-corpus) - [Riksdagen Open Data](http://data.riksdagen.se)

提供机构：

UppsalaNLP

5,000+

优质数据集

54 个

任务类型

进入经典数据集