five

UppsalaNLP/sou-corpus

收藏
Hugging Face2026-01-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/UppsalaNLP/sou-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - sv task_categories: - text-generation - token-classification tags: - swedish - government-reports - dependency-parsing - universal-dependencies - nlp size_categories: - 100K<n<1M source_datasets: - original --- # SOU Corpus - Swedish Government Official Reports Cleaned and dependency-parsed Swedish Government Official Reports (Statens offentliga utredningar) from 1994-2020. ## Dataset Description This dataset contains sentence-segmented and dependency-parsed text from Swedish Government Official Reports. The original documents were cleaned, processed, and annotated with Universal Dependencies-style parsing. ### Fields - **document_id**: Original document identifier (can be linked to Riksdagen open data) - **text_type**: Type of text section - `full_text`: Main report body - `summary_swedish`: Standard Swedish summary - `summary_simple_swedish`: Simple Swedish (lättläst) summary - `summary_english`: English summary - **section**: Section header from the document - **text**: Plain text sentence - **parsed**: Dependency-parsed sentence (token//POS//deprel//head format) ### Parsed Format Each token in the `parsed` field follows the format: ``` word//POS_TAG//DEPENDENCY_RELATION//HEAD_INDEX ``` Example: ``` Sverige//PM|NOM//nsubj//3 är//VB|PRS|AKT//cop//3 ett//DT|NEU|SIN|IND//det//3 land//NN|NEU|SIN|IND|NOM//ROOT//3 ``` ## Usage ```python from datasets import load_dataset dataset = load_dataset("UppsalaNLP/sou-corpus") # Access train/test splits train = dataset["train"] test = dataset["test"] # Example print(train[0]["text"]) print(train[0]["section"]) print(train[0]["text_type"]) ``` ### Extract Tokens ```python def parse_tokens(parsed_str): tokens = [] for t in parsed_str.split(' '): parts = t.split('//') if len(parts) >= 4: tokens.append({ 'word': parts[0], 'pos': parts[1], 'deprel': parts[2], 'head': int(parts[3]) if parts[3].isdigit() else 0 }) return tokens tokens = parse_tokens(train[0]["parsed"]) ``` ## Source Documents obtained from [Riksdagens öppna data](http://data.riksdagen.se). Original document URLs follow the pattern: `https://data.riksdagen.se/dokument/{document_id}.html` ## Citation ```bibtex @inproceedings{durlich-etal-2022-cause, title = "Cause and Effect in Governmental Reports: Two Data Sets for Causality Detection in Swedish", author = "D{\"u}rlich, Luise and Reimann, Sebastian and Finnveden, Gustav and Nivre, Joakim and Stymne, Sara", booktitle = "Proceedings of the First Workshop on Natural Language Processing for Political Sciences", month = jun, year = "2022", address = "Marseilles, France" } ``` ## License This dataset is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). ## Links - [Uppsala NLP](https://huggingface.co/UppsalaNLP) - [GitHub Repository](https://github.com/UppsalaNLP/SOU-corpus) - [Riksdagen Open Data](http://data.riksdagen.se)
提供机构:
UppsalaNLP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作