five

rcds/swiss_doc2doc_ir

收藏
Hugging Face2023-07-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/rcds/swiss_doc2doc_ir
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language: - de - fr - it language_creators: - expert-generated license: - cc-by-sa-4.0 multilinguality: - multilingual pretty_name: 'Swiss Doc2doc Information Retrieval' size_categories: - 100K<n<1M source_datasets: - original tags: [] task_categories: - text-classification task_ids: - entity-linking-classification --- https://huggingface.co/spaces/huggingface/datasets-tagging # Dataset Card for Swiss Doc2doc Information Retrieval ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Swiss Doc2doc Information Retrieval is a multilingual, diachronic dataset of 131K Swiss Federal Supreme Court (FSCS) cases annotated with law citations and ruling citations, posing a challenging text classification task. As unique label we are using decision_id of cited rulings and uuid of cited law articles, which can be found in the SwissCourtRulingCorpus. We also provide additional metadata, i.e., the publication year, the legal area and the canton of origin per case, to promote robustness and fairness studies on the critical area of legal NLP. ### Supported Tasks and Leaderboards Swiss Doc2Doc IR can be used as information retrieval task using documents in Swiss Legislation (https://huggingface.co/datasets/rcds/swiss_legislation) and Swiss Leading desicions (https://huggingface.co/datasets/rcds/swiss_leading_decisions). ### Languages Switzerland has four official languages with three languages (German 86K, French 30k and Italian 10k) being represented. The decisions are written by the judges and clerks in the language of the proceedings. ## Dataset Structure ### Data Instances ``` { "decision_id": "000127ef-17d2-4ded-8621-c0c962c18fd5", "language": de, "year": 2018, "chamber": "CH_BGer_008", "region": "Federation", "origin_chamber": 47, "origin_court": 8, "origin_canton": 151, "law_area": "social_law", "law_sub_area": , "laws": "['75488867-c001-4eb9-93b9-04264ea91f55', 'e6b06567-1236-4210-adb3-e11c26e497d5', '04bf6369-99cb-41fa-8aff-413679bc8c18', ...], "cited_rulings": "['fe8a76b3-8b0f-4f27-a277-2d887140e7ab', '16fef75e-e8d5-4a51-8230-a9ca3676c8a9', '6d21b282-3b23-41dd-9350-6ba5386df9b1', '302fd9f3-e78a-4a9f-9f8d-cde51fcbdfe7']", "facts": "Sachverhalt: A. A._, geboren 1954, war ab November 2002 als Pflegehilfe im Altersheim C._ angestellt. Am 23. Dezember 2002 meldete sie sich erstmals unter Hinweis auf Depressionen ...", "considerations": "Erwägungen: 1. 1.1. Die Beschwerde kann wegen Rechtsverletzung gemäss Art. 95 und Art. 96 BGG erhoben werden. Das Bundesgericht wendet das ...", "rulings": "Demnach erkennt das Bundesgericht: 1. Die Beschwerde wird abgewiesen. 2. Die Gerichtskosten von Fr. 800.- werden der Beschwerdeführerin ...", } ``` ### Data Fields ``` decision_id: (str) a unique identifier of the for the document language: (str) one of (de, fr, it) year: (int) the publication year chamber: (str) the chamber of the case region: (str) the region of the case origin_chamber: (str) the chamber of the origin case origin_court: (str) the court of the origin case origin_canton: (str) the canton of the origin case law_area: (str) the law area of the case law_sub_area:(str) the law sub area of the case laws: (str) a list of law ids cited rulings: (str) a list of cited rulings ids facts: (str) the facts of the case considerations: (str) the considerations of the case rulings: (str) the rulings of the case ``` ### Data Splits The dataset was split date-stratisfied - Train: 2002-2015 - Validation: 2016-2017 - Test: 2018-2022 | Language | Subset | Number of Documents (Training/Validation/Test) | |------------|------------|------------------------------------------------| | German | **de** | 86'832 (59'170 / 19'002 / 8'660) | | French | **fr** | 46'203 (30'513 / 10'816 / 4'874) | | Italian | **it** | 8'306 (5'673 / 1'855 / 778) | ## Dataset Creation ### Curation Rationale The dataset was created by Stern et al. (2023). ### Source Data #### Initial Data Collection and Normalization The original data are available at the Swiss Federal Supreme Court (https://www.bger.ch) in unprocessed formats (HTML). The documents were downloaded from the Entscheidsuche portal (https://entscheidsuche.ch) in HTML. #### Who are the source language producers? The original data are published from the Swiss Federal Supreme Court (https://www.bger.ch) in unprocessed formats (HTML). The documents were downloaded from the Entscheidsuche portal (https://entscheidsuche.ch) in HTML. ### Annotations #### Annotation process The decisions have been annotated with the citation ids using html tags and parsers. For more details on laws (rcds/swiss_legislation) and rulings (rcds/swiss_rulings). #### Who are the annotators? Stern annotated the citations. Metadata is published by the Swiss Federal Supreme Court (https://www.bger.ch). ### Personal and Sensitive Information The dataset contains publicly available court decisions from the Swiss Federal Supreme Court. Personal or sensitive information has been anonymized by the court before publication according to the following guidelines: https://www.bger.ch/home/juridiction/anonymisierungsregeln.html. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information We release the data under CC-BY-4.0 which complies with the court licensing (https://www.bger.ch/files/live/sites/bger/files/pdf/de/urteilsveroeffentlichung_d.pdf) © Swiss Federal Supreme Court, 2002-2022 The copyright for the editorial content of this website and the consolidated texts, which is owned by the Swiss Federal Supreme Court, is licensed under the Creative Commons Attribution 4.0 International licence. This means that you can re-use the content provided you acknowledge the source and indicate any changes you have made. Source: https://www.bger.ch/files/live/sites/bger/files/pdf/de/urteilsveroeffentlichung_d.pdf ### Citation Information Please cite our [ArXiv-Preprint](https://arxiv.org/abs/2306.09237) ``` @misc{rasiah2023scale, title={SCALE: Scaling up the Complexity for Advanced Language Model Evaluation}, author={Vishvaksenan Rasiah and Ronja Stern and Veton Matoshi and Matthias Stürmer and Ilias Chalkidis and Daniel E. Ho and Joel Niklaus}, year={2023}, eprint={2306.09237}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Contributions Thanks to [@Stern5497](https://github.com/stern5497) for adding this dataset.
提供机构:
rcds
原始信息汇总

数据集概述

数据集名称: Swiss Doc2doc Information Retrieval

数据集描述: 这是一个多语言、历时性的数据集,包含131K瑞士联邦最高法院(FSCS)案件,标注有法律引用和裁决引用,用于挑战性的文本分类任务。数据集使用引用的裁决的decision_id和引用的法律文章的uuid作为唯一标签,并提供额外的元数据,如出版年份、法律领域和案件来源的州,以促进法律NLP领域的鲁棒性和公平性研究。

语言: 德语、法语、意大利语

许可证: CC-BY-SA-4.0

数据集大小: 100K<n<1M

任务类别: 文本分类

任务ID: 实体链接分类

数据集结构

数据实例: 每个数据实例包含案件唯一标识符、语言、出版年份、案件审理的法庭、地区、原始案件的法庭和法院、州、法律领域、法律子领域、法律ID列表、引用的裁决ID列表、案件事实、案件考虑和裁决。

数据字段:

  • decision_id: 文档的唯一标识符
  • language: 语言(de, fr, it)
  • year: 出版年份
  • chamber: 案件审理的法庭
  • region: 案件地区
  • origin_chamber: 原始案件的法庭
  • origin_court: 原始案件的法院
  • origin_canton: 原始案件的州
  • law_area: 法律领域
  • law_sub_area: 法律子领域
  • laws: 法律ID列表
  • cited_rulings: 引用的裁决ID列表
  • facts: 案件事实
  • considerations: 案件考虑
  • rulings: 裁决

数据分割: 数据集按日期分割,包括训练集(2002-2015)、验证集(2016-2017)和测试集(2018-2022)。

数据集创建

来源数据: 原始数据来自瑞士联邦最高法院,以HTML格式从Entscheidsuche门户下载。

注释过程: 裁决已使用HTML标签和解析器标注引用ID。

注释者: Stern负责注释引用。元数据由瑞士联邦最高法院发布。

个人和敏感信息: 数据集包含瑞士联邦最高法院公开的法院裁决,个人或敏感信息已根据法院指南进行匿名化处理。

许可证信息: 数据集根据CC-BY-4.0许可证发布,与法院许可一致。

引用信息: 请引用ArXiv预印本2306.09237

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作