five

taltwi/LaCour

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/taltwi/LaCour
下载链接
链接失效反馈
官方服务:
资源简介:
--- language_creators: - found - machine-generated language: - en - fr - ru - es - hr - it - pt - tr - pl - lt - de - uk - hu - nl - sq - ro - sr license: - cc-by-sa-4.0 multilinguality: - multilingual size_categories: - 1K<n<10K - n<1K pretty_name: LaCour! tags: - legal - hearing - oral argument - transcript - echr - dialog dataset_info: - config_name: documents features: - name: id dtype: int64 - name: webcast_id dtype: string - name: hearing_title dtype: string - name: hearing_date dtype: string - name: hearing_type dtype: string - name: application_number list: string - name: case_id dtype: string - name: case_name dtype: string - name: case_url dtype: string - name: ecli dtype: string - name: type dtype: string - name: document_date dtype: string - name: importance dtype: int64 - name: articles list: string - name: respondent_government list: string - name: issue dtype: string - name: strasbourg_caselaw dtype: string - name: external_sources dtype: string - name: conclusion dtype: string - name: separate_opinion dtype: string splits: - name: train num_bytes: 968588 num_examples: 369 download_size: 383241 dataset_size: 968588 - config_name: transcripts features: - name: id dtype: int64 - name: webcast_id dtype: string - name: segment_id dtype: int64 - name: speaker_role dtype: string - name: speaker_name dtype: string - name: sequence_id dtype: int64 - name: begin dtype: string - name: end dtype: string - name: language dtype: string - name: text dtype: string splits: - name: train num_bytes: 20732998 num_examples: 88920 download_size: 8452356 dataset_size: 20732998 viewer: true configs: - config_name: documents data_files: - split: train path: documents/train-* - config_name: transcripts data_files: - split: train path: transcripts/train-* --- ## Dataset Description - **Homepage: https://trusthlt.org/lacour** - **Repository: https://github.com/trusthlt/lacour-corpus** - **Paper: https://doi.org/10.1007/s10506-024-09428-4** ### Dataset Summary This dataset contains transcribed court hearings sourced from official hearings of the __European Court of Human Rights__ ([https://www.echr.coe.int/webcasts-of-hearings](https://www.echr.coe.int/webcasts-of-hearings)). The hearings are 154 selected webcasts (videos) from 2012-2022 in their original language (no interpretation). With manual annotation for language labels and automatic processing of the extracted audio with [pyannote](https://huggingface.co/pyannote/speaker-diarization) and [whisper-large-v2](https://huggingface.co/openai/whisper-large-v2), the resulting dataset contains 4000 speaker turns and 88920 individual lines. The dataset contains two subsets, the transcripts and the metadata with linked documents. The transcripts are additionally available as .txt or .xml. ### Languages The largest amounts in the transcripts are: English, French A smaller portion also contains the following languages: Russian, Spanish, Croatian, Italian, Portuguese, Turkish, Polish, Lithuanian, German, Ukrainian, Hungarian, Dutch, Albanian, Romanian, Serbian The collected metadata is: English ## Dataset Structure ### Data Instances Each instance in transcripts represents an entire segment of a transcript, similar to a conversation turn in a dialog. ``` { 'id': 0, 'webcast_id': '4927011_26052021', 'segment_id': 0, 'speaker_role': 'Announcer', 'speaker_name': 'UNK', 'sequence_id': 0, 'begin': '10.74', 'end': '11.42', 'language': 'fr', 'text': 'La Cour!' } ``` Each instance in documents represents a information on a document in hudoc associated with a hearing and the metadata associated with a hearing. The actual document is linked and can also be found in [hudoc](https://hudoc.echr.coe.int) with the case_id. Note: `hearing_type` states the type of the hearing, `type` states the type of the document. If the hearing is a "Grand Chamber hearing", the "CHAMBER" document refers to a different hearing. ``` { 'id': 16, 'webcast_id': '1232311_02102012', 'hearing_title': 'Michaud v. France (nos. 12323/11)', 'hearing_date': '2012-10-02 00:00:00', 'hearing_type': 'Chamber hearing', 'application_number': ['12323/11'], 'case_id': '001-115377', 'case_name': 'CASE OF MICHAUD v. FRANCE', 'case_url': 'https://hudoc.echr.coe.int/eng?i=001-115377', 'ecli': 'ECLI:CE:ECHR:2012:1206JUD001232311', 'type': 'CHAMBER', 'document_date': '2012-12-06 00:00:00', 'importance': 1, 'articles': ['8', '8-1', '8-2', '34', '35'], 'respondent_government': ['FRA'], 'issue': 'Decision of the National Bar Council of 12 July 2007 “adopting regulations on internal procedures for implementing the obligation to combat money laundering and terrorist financing, and an internal supervisory mechanism to guarantee compliance with those procedures” ; Article 21-1 of the Law of 31 December 1971 ; Law no. 2004-130 of 11 February 2004 ; Monetary and Financial Code', 'strasbourg_caselaw': 'André and Other v. France, no 18603/03, 24 July 2008;Bosphorus Hava Yollari Turizm ve Ticaret Anonim Sirketi v. Ireland [GC], no 45036/98, ECHR 2005-VI;Burden v. the United Kingdom [GC], no 13378/05, §§ 33-34, ECHR 2008;Campbell v. the United Kingdom, 25 March 1992, §§ 44 and 46-48, Series A no 233;Dudgeon v. the United Kingdom, 22 October 1981, § 41, Series A no 28;Ekinci and Akalin v. Turkey, no 77097/01, § 47, 30 January 2007;Frérot v. France, no 70204/01, §§ 53-54, 12 June 2007;Grifhorst v. France, no 28336/02, § 93, 26 February 2009;Johnston and Others v. Ireland, 18 December 1986, § 42, Series A no 112;Kokkinakis v. Greece, 25 May 1993, § 52, Series A no 260-A;Kopp v. Switzerland, 25 March 1998, Reports of Judgments and Decisions 1998-II;M.S.S. v. Belgium and Greece [GC], no 30696/09, ECHR 2011;Marckx v. Belgium, 13 June 1979, § 27, Series A no 31;Mor v. France, no 28198/09, 15 December 2011;Niemietz v. Germany, 16 December 1992, Series A no 251-B;Norris v. Ireland, 26 October 1988, §§ 30-34 and 38, Series A no 142;Roemen and Schmit v. Luxembourg, no 51772/99, ECHR 2003-IV;Sallinen and Others v. Finland, no 50882/99, 27 September 2005;Schönenberger and Durmaz v. Switzerland, 20 June 1988, Series A no 137;Silver and Others v. the United Kingdom, 25 March 1983, §§ 56-88, Series A no 61;Wieser and Bicos Beteiligungen GmbH v. Austria, no 74336/01, §§ 65-66, ECHR 2007-IV;Xavier Da Silveira v. France, no 43757/05, § 36-37 and 43, 21 January 2010', 'external_sources': 'Directive 91/308/EEC, 10 June 1991;Article 6 of the Treaty on European Union;Charter of Fundamental Rights of the European Union;Articles 169, 170, 173, 175, 177, 184 and 189 of the Treaty establishing the European Community;Recommendations 12 and 16 of the financial action task force (“FATF”) on money laundering;Council of Europe Convention on Laundering, Search, Seizure and Confiscation of the Proceeds from Crime and on the Financing of Terrorism (16 May 2005)', 'conclusion': 'Remainder inadmissible;No violation of Article 8 - Right to respect for private and family life (Article 8-1 - Respect for correspondence;Respect for private life)', 'separate_opinion': 'FALSE' } ``` ### Data Fields transcripts: * id: the identifier * webcast_id: the identifier for the hearing * segment_id: the identifier of the current speaker segment in the current hearing * speaker_name: the name of the speaker (not given for Applicant, Government or Third Party) * speaker_role: the role/party the speaker represents (`Announcer` for announcements, `Judge` for judges, `JudgeP` for judge president, `Applicant` for representatives of the applicant, `Government` for representatives of the respondent government, `ThirdParty` for representatives of third party interveners) * sequence_id: id to keep the order of speech texts within a segment * begin: the timestamp for begin of line (in seconds) * end: the timestamp for end of line (in seconds) * language: the language spoken (in ISO 639-1) * text: the spoken line documents: * id: the identifier * webcast_id: the identifier for the hearing (allows linking to transcripts) * hearing_title: the title of the hearing * hearing_date: the date of the hearing * hearing_type: the type of hearing (Grand Chamber, Chamber or Grand Chamber Judgment Hearing) * application_number: the application numbers which are associated with the hearing and case * case_id: the id of the case * case_name: the name of the case * case_url: the direct link to the document * ecli: the ECLI (European Case Law Identifier) * type: the type of the document * document_date: the date of the document * importance: the importance score of the case (1 is the highest importance, key case) * articles: the concerning articles of the Convention of Human Rights * respondent_government: the code of the respondent government(s) (in ISO-3166 Alpha-3) * issue: the references to the issue of the case * strasbourg_caselaw: the list of cases in the ECHR which are relevant to the current case * external_sources: the relevant references outside of the ECHR * conclusion: the short textual description of the conclusion * separate_opinion: the indicator if there is a separate opinion ### Data Splits The dataset is only split into a train set. ## Dataset Creation ### Curation Rationale This datasets provides partly corrected transcribed webcasts to enable the processing of hearings in legal NLP. No specific task is given. ### Source Data #### Data Collection The data was collected by transcribing the publicly available [webcasts of the ECHR](https://www.echr.coe.int/webcasts-of-hearings) with the help of [pyannote](https://huggingface.co/pyannote/speaker-diarization) and [whisper-large-v2](https://huggingface.co/openai/whisper-large-v2). The documents were sourced from the [ECHR hudoc database](https://hudoc.echr.coe.int). #### Who are the source producers? Participants in hearings before the ECHR for the audio and video material. Employees and judges of the ECHR for the documents. ### Annotations #### Annotation process **language identification** Spoken languages were manually identified by research assistants. Disagreements were discussed to achieve the final language label. **transcript correction** All parts spoken by Judge or Judge President are corrected for the languages English and French by research assistants with a high proficiency in the respective language. #### Personal and Sensitive Information The dataset contains names of judges and other participants in the hearings. Due to those names being available in the public court material, we did not remove them. The machine-generated transcripts may also contain names, which were neither checked nor removed. In case of sensitive information, we rely on the provided material to provide protection (occasionally bleeping out names which should not have been mentioned in webcasts, appropriate anonymization in the documents). ## Additional Information Download the transcripts and linked documents: ```python from datasets import load_dataset lacour_transcripts = load_dataset("TrustHLT/LaCour", "transcripts") # default config lacour_documents = load_dataset("TrustHLT/LaCour", "documents") ``` Formatted versions of the transcripts in .txt and .xml and more information on the collection and creation can be found on [github](https://github.com/trusthlt/lacour-corpus). ### Citation Information Please cite this data using: ```bibtex @article{Held2024LaCour, title = {LaCour!: enabling research on argumentation in hearings of the European Court of Human Rights}, author = {Held, Lena and Habernal, Ivan}, year = 2024, month = nov, journal = {Artificial Intelligence and Law}, publisher = {Springer Science and Business Media LLC}, doi = {10.1007/s10506-024-09428-4}, issn = {1572-8382}, url = {http://dx.doi.org/10.1007/s10506-024-09428-4} } ```
提供机构:
taltwi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作