Name: Gshgrapes/lex_glue
Creator: Gshgrapes
Published: 2026-02-23 07:08:05
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/Gshgrapes/lex_glue

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - found language_creators: - found language: - en license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - extended task_categories: - question-answering - text-classification task_ids: - multi-class-classification - multi-label-classification - multiple-choice-qa - topic-classification pretty_name: LexGLUE config_names: - case_hold - ecthr_a - ecthr_b - eurlex - ledgar - scotus - unfair_tos dataset_info: - config_name: case_hold features: - name: context dtype: string - name: endings sequence: string - name: label dtype: class_label: names: '0': '0' '1': '1' '2': '2' '3': '3' '4': '4' splits: - name: train num_bytes: 74781706 num_examples: 45000 - name: test num_bytes: 5989952 num_examples: 3600 - name: validation num_bytes: 6474603 num_examples: 3900 download_size: 47303537 dataset_size: 87246261 - config_name: ecthr_a features: - name: text sequence: string - name: labels sequence: class_label: names: '0': '2' '1': '3' '2': '5' '3': '6' '4': '8' '5': '9' '6': '10' '7': '11' '8': '14' '9': P1-1 splits: - name: train num_bytes: 89637449 num_examples: 9000 - name: test num_bytes: 11884168 num_examples: 1000 - name: validation num_bytes: 10985168 num_examples: 1000 download_size: 53352586 dataset_size: 112506785 - config_name: ecthr_b features: - name: text sequence: string - name: labels sequence: class_label: names: '0': '2' '1': '3' '2': '5' '3': '6' '4': '8' '5': '9' '6': '10' '7': '11' '8': '14' '9': P1-1 splits: - name: train num_bytes: 89657649 num_examples: 9000 - name: test num_bytes: 11886928 num_examples: 1000 - name: validation num_bytes: 10987816 num_examples: 1000 download_size: 53352494 dataset_size: 112532393 - config_name: eurlex features: - name: text dtype: string - name: labels sequence: class_label: names: '0': '100163' '1': '100168' '2': '100169' '3': '100170' '4': '100171' '5': '100172' '6': '100173' '7': '100174' '8': '100175' '9': '100176' '10': '100177' '11': '100179' '12': '100180' '13': '100183' '14': '100184' '15': '100185' '16': '100186' '17': '100187' '18': '100189' '19': '100190' '20': '100191' '21': '100192' '22': '100193' '23': '100194' '24': '100195' '25': '100196' '26': '100197' '27': '100198' '28': '100199' '29': '100200' '30': '100201' '31': '100202' '32': '100204' '33': '100205' '34': '100206' '35': '100207' '36': '100212' '37': '100214' '38': '100215' '39': '100220' '40': '100221' '41': '100222' '42': '100223' '43': '100224' '44': '100226' '45': '100227' '46': '100229' '47': '100230' '48': '100231' '49': '100232' '50': '100233' '51': '100234' '52': '100235' '53': '100237' '54': '100238' '55': '100239' '56': '100240' '57': '100241' '58': '100242' '59': '100243' '60': '100244' '61': '100245' '62': '100246' '63': '100247' '64': '100248' '65': '100249' '66': '100250' '67': '100252' '68': '100253' '69': '100254' '70': '100255' '71': '100256' '72': '100257' '73': '100258' '74': '100259' '75': '100260' '76': '100261' '77': '100262' '78': '100263' '79': '100264' '80': '100265' '81': '100266' '82': '100268' '83': '100269' '84': '100270' '85': '100271' '86': '100272' '87': '100273' '88': '100274' '89': '100275' '90': '100276' '91': '100277' '92': '100278' '93': '100279' '94': '100280' '95': '100281' '96': '100282' '97': '100283' '98': '100284' '99': '100285' splits: - name: train num_bytes: 390770241 num_examples: 55000 - name: test num_bytes: 59739094 num_examples: 5000 - name: validation num_bytes: 41544476 num_examples: 5000 download_size: 208028049 dataset_size: 492053811 - config_name: ledgar features: - name: text dtype: string - name: label dtype: class_label: names: '0': Adjustments '1': Agreements '2': Amendments '3': Anti-Corruption Laws '4': Applicable Laws '5': Approvals '6': Arbitration '7': Assignments '8': Assigns '9': Authority '10': Authorizations '11': Base Salary '12': Benefits '13': Binding Effects '14': Books '15': Brokers '16': Capitalization '17': Change In Control '18': Closings '19': Compliance With Laws '20': Confidentiality '21': Consent To Jurisdiction '22': Consents '23': Construction '24': Cooperation '25': Costs '26': Counterparts '27': Death '28': Defined Terms '29': Definitions '30': Disability '31': Disclosures '32': Duties '33': Effective Dates '34': Effectiveness '35': Employment '36': Enforceability '37': Enforcements '38': Entire Agreements '39': Erisa '40': Existence '41': Expenses '42': Fees '43': Financial Statements '44': Forfeitures '45': Further Assurances '46': General '47': Governing Laws '48': Headings '49': Indemnifications '50': Indemnity '51': Insurances '52': Integration '53': Intellectual Property '54': Interests '55': Interpretations '56': Jurisdictions '57': Liens '58': Litigations '59': Miscellaneous '60': Modifications '61': No Conflicts '62': No Defaults '63': No Waivers '64': Non-Disparagement '65': Notices '66': Organizations '67': Participations '68': Payments '69': Positions '70': Powers '71': Publicity '72': Qualifications '73': Records '74': Releases '75': Remedies '76': Representations '77': Sales '78': Sanctions '79': Severability '80': Solvency '81': Specific Performance '82': Submission To Jurisdiction '83': Subsidiaries '84': Successors '85': Survival '86': Tax Withholdings '87': Taxes '88': Terminations '89': Terms '90': Titles '91': Transactions With Affiliates '92': Use Of Proceeds '93': Vacations '94': Venues '95': Vesting '96': Waiver Of Jury Trials '97': Waivers '98': Warranties '99': Withholdings splits: - name: train num_bytes: 43358291 num_examples: 60000 - name: test num_bytes: 6845581 num_examples: 10000 - name: validation num_bytes: 7143588 num_examples: 10000 download_size: 27650585 dataset_size: 57347460 - config_name: scotus features: - name: text dtype: string - name: label dtype: class_label: names: '0': '1' '1': '2' '2': '3' '3': '4' '4': '5' '5': '6' '6': '7' '7': '8' '8': '9' '9': '10' '10': '11' '11': '12' '12': '13' splits: - name: train num_bytes: 178959316 num_examples: 5000 - name: test num_bytes: 76213279 num_examples: 1400 - name: validation num_bytes: 75600243 num_examples: 1400 download_size: 173411399 dataset_size: 330772838 - config_name: unfair_tos features: - name: text dtype: string - name: labels sequence: class_label: names: '0': Limitation of liability '1': Unilateral termination '2': Unilateral change '3': Content removal '4': Contract by using '5': Choice of law '6': Jurisdiction '7': Arbitration splits: - name: train num_bytes: 1041782 num_examples: 5532 - name: test num_bytes: 303099 num_examples: 1607 - name: validation num_bytes: 452111 num_examples: 2275 download_size: 865604 dataset_size: 1796992 configs: - config_name: case_hold data_files: - split: train path: case_hold/train-* - split: test path: case_hold/test-* - split: validation path: case_hold/validation-* - config_name: ecthr_a data_files: - split: train path: ecthr_a/train-* - split: test path: ecthr_a/test-* - split: validation path: ecthr_a/validation-* - config_name: ecthr_b data_files: - split: train path: ecthr_b/train-* - split: test path: ecthr_b/test-* - split: validation path: ecthr_b/validation-* - config_name: eurlex data_files: - split: train path: eurlex/train-* - split: test path: eurlex/test-* - split: validation path: eurlex/validation-* - config_name: ledgar data_files: - split: train path: ledgar/train-* - split: test path: ledgar/test-* - split: validation path: ledgar/validation-* - config_name: scotus data_files: - split: train path: scotus/train-* - split: test path: scotus/test-* - split: validation path: scotus/validation-* - config_name: unfair_tos data_files: - split: train path: unfair_tos/train-* - split: test path: unfair_tos/test-* - split: validation path: unfair_tos/validation-* --- # Dataset Card for "LexGLUE" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/coastalcph/lex-glue - **Repository:** https://github.com/coastalcph/lex-glue - **Paper:** https://arxiv.org/abs/2110.00976 - **Leaderboard:** https://github.com/coastalcph/lex-glue - **Point of Contact:** [Ilias Chalkidis](mailto:ilias.chalkidis@di.ku.dk) ### Dataset Summary Inspired by the recent widespread use of the GLUE multi-task benchmark NLP dataset (Wang et al., 2018), the subsequent more difficult SuperGLUE (Wang et al., 2019), other previous multi-task NLP benchmarks (Conneau and Kiela, 2018; McCann et al., 2018), and similar initiatives in other domains (Peng et al., 2019), we introduce the *Legal General Language Understanding Evaluation (LexGLUE) benchmark*, a benchmark dataset to evaluate the performance of NLP methods in legal tasks. LexGLUE is based on seven existing legal NLP datasets, selected using criteria largely from SuperGLUE. As in GLUE and SuperGLUE (Wang et al., 2019b,a), one of our goals is to push towards generic (or ‘foundation’) models that can cope with multiple NLP tasks, in our case legal NLP tasks possibly with limited task-specific fine-tuning. Another goal is to provide a convenient and informative entry point for NLP researchers and practitioners wishing to explore or develop methods for legalNLP. Having these goals in mind, the datasets we include in LexGLUE and the tasks they address have been simplified in several ways to make it easier for newcomers and generic models to address all tasks. LexGLUE benchmark is accompanied by experimental infrastructure that relies on Hugging Face Transformers library and resides at: https://github.com/coastalcph/lex-glue. ### Supported Tasks and Leaderboards The supported tasks are the following: <table> <tr><td>Dataset</td><td>Source</td><td>Sub-domain</td><td>Task Type</td><td>Classes</td><tr> <tr><td>ECtHR (Task A)</td><td> <a href="https://aclanthology.org/P19-1424/">Chalkidis et al. (2019)</a> </td><td>ECHR</td><td>Multi-label classification</td><td>10+1</td></tr> <tr><td>ECtHR (Task B)</td><td> <a href="https://aclanthology.org/2021.naacl-main.22/">Chalkidis et al. (2021a)</a> </td><td>ECHR</td><td>Multi-label classification </td><td>10+1</td></tr> <tr><td>SCOTUS</td><td> <a href="http://scdb.wustl.edu">Spaeth et al. (2020)</a></td><td>US Law</td><td>Multi-class classification</td><td>14</td></tr> <tr><td>EUR-LEX</td><td> <a href="https://arxiv.org/abs/2109.00904">Chalkidis et al. (2021b)</a></td><td>EU Law</td><td>Multi-label classification</td><td>100</td></tr> <tr><td>LEDGAR</td><td> <a href="https://aclanthology.org/2020.lrec-1.155/">Tuggener et al. (2020)</a></td><td>Contracts</td><td>Multi-class classification</td><td>100</td></tr> <tr><td>UNFAIR-ToS</td><td><a href="https://arxiv.org/abs/1805.01217"> Lippi et al. (2019)</a></td><td>Contracts</td><td>Multi-label classification</td><td>8+1</td></tr> <tr><td>CaseHOLD</td><td><a href="https://arxiv.org/abs/2104.08671">Zheng et al. (2021)</a></td><td>US Law</td><td>Multiple choice QA</td><td>n/a</td></tr> </table> #### ecthr_a The European Court of Human Rights (ECtHR) hears allegations that a state has breached human rights provisions of the European Convention of Human Rights (ECHR). For each case, the dataset provides a list of factual paragraphs (facts) from the case description. Each case is mapped to articles of the ECHR that were violated (if any). #### ecthr_b The European Court of Human Rights (ECtHR) hears allegations that a state has breached human rights provisions of the European Convention of Human Rights (ECHR). For each case, the dataset provides a list of factual paragraphs (facts) from the case description. Each case is mapped to articles of ECHR that were allegedly violated (considered by the court). #### scotus The US Supreme Court (SCOTUS) is the highest federal court in the United States of America and generally hears only the most controversial or otherwise complex cases which have not been sufficiently well solved by lower courts. This is a single-label multi-class classification task, where given a document (court opinion), the task is to predict the relevant issue areas. The 14 issue areas cluster 278 issues whose focus is on the subject matter of the controversy (dispute). #### eurlex European Union (EU) legislation is published in EUR-Lex portal. All EU laws are annotated by EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. The current version of EuroVoc contains more than 7k concepts referring to various activities of the EU and its Member States (e.g., economics, health-care, trade). Given a document, the task is to predict its EuroVoc labels (concepts). #### ledgar LEDGAR dataset aims contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. #### unfair_tos The UNFAIR-ToS dataset contains 50 Terms of Service (ToS) from on-line platforms (e.g., YouTube, Ebay, Facebook, etc.). The dataset has been annotated on the sentence-level with 8 types of unfair contractual terms (sentences), meaning terms that potentially violate user rights according to the European consumer law. #### case_hold The CaseHOLD (Case Holdings on Legal Decisions) dataset includes multiple choice questions about holdings of US court cases from the Harvard Law Library case law corpus. Holdings are short summaries of legal rulings accompany referenced decisions relevant for the present case. The input consists of an excerpt (or prompt) from a court decision, containing a reference to a particular case, while the holding statement is masked out. The model must identify the correct (masked) holding statement from a selection of five choices. The current leaderboard includes several Transformer-based (Vaswaniet al., 2017) pre-trained language models, which achieve state-of-the-art performance in most NLP tasks (Bommasani et al., 2021) and NLU benchmarks (Wang et al., 2019a). Results reported by [Chalkidis et al. (2021)](https://arxiv.org/abs/2110.00976): *Task-wise Test Results* <table> <tr><td>Dataset</td><td>ECtHR A</td><td>ECtHR B</td><td>SCOTUS</td><td>EUR-LEX</td><td>LEDGAR</td><td>UNFAIR-ToS</td><td>CaseHOLD</td></tr> <tr><td>Model</td><td>μ-F1 / m-F1 </td><td>μ-F1 / m-F1 </td><td>μ-F1 / m-F1 </td><td>μ-F1 / m-F1 </td><td>μ-F1 / m-F1 </td><td>μ-F1 / m-F1</td><td>μ-F1 / m-F1 </td></tr> <tr><td>TFIDF+SVM</td><td> 64.7 / 51.7 </td><td>74.6 / 65.1 </td><td> 78.2 / 69.5 </td><td>71.3 / 51.4 </td><td>87.2 / 82.4 </td><td>95.4 / 78.8</td><td>n/a </td></tr> <tr><td colspan="8" style='text-align:center'>Medium-sized Models (L=12, H=768, A=12)</td></tr> <td>BERT</td> <td> 71.2 / 63.6 </td> <td> 79.7 / 73.4 </td> <td> 68.3 / 58.3 </td> <td> 71.4 / 57.2 </td> <td> 87.6 / 81.8 </td> <td> 95.6 / 81.3 </td> <td> 70.8 </td> </tr> <td>RoBERTa</td> <td> 69.2 / 59.0 </td> <td> 77.3 / 68.9 </td> <td> 71.6 / 62.0 </td> <td> 71.9 / 57.9 </td> <td> 87.9 / 82.3 </td> <td> 95.2 / 79.2 </td> <td> 71.4 </td> </tr> <td>DeBERTa</td> <td> 70.0 / 60.8 </td> <td> 78.8 / 71.0 </td> <td> 71.1 / 62.7 </td> <td> 72.1 / 57.4 </td> <td> 88.2 / 83.1 </td> <td> 95.5 / 80.3 </td> <td> 72.6 </td> </tr> <td>Longformer</td> <td> 69.9 / 64.7 </td> <td> 79.4 / 71.7 </td> <td> 72.9 / 64.0 </td> <td> 71.6 / 57.7 </td> <td> 88.2 / 83.0 </td> <td> 95.5 / 80.9 </td> <td> 71.9 </td> </tr> <td>BigBird</td> <td> 70.0 / 62.9 </td> <td> 78.8 / 70.9 </td> <td> 72.8 / 62.0 </td> <td> 71.5 / 56.8 </td> <td> 87.8 / 82.6 </td> <td> 95.7 / 81.3 </td> <td> 70.8 </td> </tr> <td>Legal-BERT</td> <td> 70.0 / 64.0 </td> <td> 80.4 / 74.7 </td> <td> 76.4 / 66.5 </td> <td> 72.1 / 57.4 </td> <td> 88.2 / 83.0 </td> <td> 96.0 / 83.0 </td> <td> 75.3 </td> </tr> <td>CaseLaw-BERT</td> <td> 69.8 / 62.9 </td> <td> 78.8 / 70.3 </td> <td> 76.6 / 65.9 </td> <td> 70.7 / 56.6 </td> <td> 88.3 / 83.0 </td> <td> 96.0 / 82.3 </td> <td> 75.4 </td> </tr> <tr><td colspan="8" style='text-align:center'>Large-sized Models (L=24, H=1024, A=18)</td></tr> <tr><td>RoBERTa</td> <td> 73.8 / 67.6 </td> <td> 79.8 / 71.6 </td> <td> 75.5 / 66.3 </td> <td> 67.9 / 50.3 </td> <td> 88.6 / 83.6 </td> <td> 95.8 / 81.6 </td> <td> 74.4 </td> </tr> </table> *Averaged (Mean over Tasks) Test Results* <table> <tr><td>Averaging</td><td>Arithmetic</td><td>Harmonic</td><td>Geometric</td></tr> <tr><td>Model</td><td>μ-F1 / m-F1 </td><td>μ-F1 / m-F1 </td><td>μ-F1 / m-F1 </td></tr> <tr><td colspan="4" style='text-align:center'>Medium-sized Models (L=12, H=768, A=12)</td></tr> <tr><td>BERT</td><td> 77.8 / 69.5 </td><td> 76.7 / 68.2 </td><td> 77.2 / 68.8 </td></tr> <tr><td>RoBERTa</td><td> 77.8 / 68.7 </td><td> 76.8 / 67.5 </td><td> 77.3 / 68.1 </td></tr> <tr><td>DeBERTa</td><td> 78.3 / 69.7 </td><td> 77.4 / 68.5 </td><td> 77.8 / 69.1 </td></tr> <tr><td>Longformer</td><td> 78.5 / 70.5 </td><td> 77.5 / 69.5 </td><td> 78.0 / 70.0 </td></tr> <tr><td>BigBird</td><td> 78.2 / 69.6 </td><td> 77.2 / 68.5 </td><td> 77.7 / 69.0 </td></tr> <tr><td>Legal-BERT</td><td> 79.8 / 72.0 </td><td> 78.9 / 70.8 </td><td> 79.3 / 71.4 </td></tr> <tr><td>CaseLaw-BERT</td><td> 79.4 / 70.9 </td><td> 78.5 / 69.7 </td><td> 78.9 / 70.3 </td></tr> <tr><td colspan="4" style='text-align:center'>Large-sized Models (L=24, H=1024, A=18)</td></tr> <tr><td>RoBERTa</td><td> 79.4 / 70.8 </td><td> 78.4 / 69.1 </td><td> 78.9 / 70.0 </td></tr> </table> ### Languages We only consider English datasets, to make experimentation easier for researchers across the globe. ## Dataset Structure ### Data Instances #### ecthr_a An example of 'train' looks as follows. ```json { "text": ["8. The applicant was arrested in the early morning of 21 October 1990 ...", ...], "labels": [6] } ``` #### ecthr_b An example of 'train' looks as follows. ```json { "text": ["8. The applicant was arrested in the early morning of 21 October 1990 ...", ...], "label": [5, 6] } ``` #### scotus An example of 'train' looks as follows. ```json { "text": "Per Curiam\nSUPREME COURT OF THE UNITED STATES\nRANDY WHITE, WARDEN v. ROGER L. WHEELER\n Decided December 14, 2015\nPER CURIAM.\nA death sentence imposed by a Kentucky trial court and\naffirmed by the ...", "label": 8 } ``` #### eurlex An example of 'train' looks as follows. ```json { "text": "COMMISSION REGULATION (EC) No 1629/96 of 13 August 1996 on an invitation to tender for the refund on export of wholly milled round grain rice to certain third countries ...", "labels": [4, 20, 21, 35, 68] } ``` #### ledgar An example of 'train' looks as follows. ```json { "text": "All Taxes shall be the financial responsibility of the party obligated to pay such Taxes as determined by applicable law and neither party is or shall be liable at any time for any of the other party ...", "label": 32 } ``` #### unfair_tos An example of 'train' looks as follows. ```json { "text": "tinder may terminate your account at any time without notice if it believes that you have violated this agreement.", "label": 2 } ``` #### casehold An example of 'test' looks as follows. ```json { "context": "In Granato v. City and County of Denver, No. CIV 11-0304 MSK/BNB, 2011 WL 3820730 (D.Colo. Aug. 20, 2011), the Honorable Marcia S. Krieger, now-Chief United States District Judge for the District of Colorado, ruled similarly: At a minimum, a party asserting a Mo-nell claim must plead sufficient facts to identify ... to act pursuant to City or State policy, custom, decision, ordinance, re d 503, 506-07 (3d Cir.l985)(<HOLDING>).", "endings": ["holding that courts are to accept allegations in the complaint as being true including monell policies and writing that a federal court reviewing the sufficiency of a complaint has a limited task", "holding that for purposes of a class certification motion the court must accept as true all factual allegations in the complaint and may draw reasonable inferences therefrom", "recognizing that the allegations of the complaint must be accepted as true on a threshold motion to dismiss", "holding that a court need not accept as true conclusory allegations which are contradicted by documents referred to in the complaint", "holding that where the defendant was in default the district court correctly accepted the fact allegations of the complaint as true" ], "label": 0 } ``` ### Data Fields #### ecthr_a - `text`: a list of `string` features (list of factual paragraphs (facts) from the case description). - `labels`: a list of classification labels (a list of violated ECHR articles, if any) . <details> <summary>List of ECHR articles</summary> "Article 2", "Article 3", "Article 5", "Article 6", "Article 8", "Article 9", "Article 10", "Article 11", "Article 14", "Article 1 of Protocol 1" </details> #### ecthr_b - `text`: a list of `string` features (list of factual paragraphs (facts) from the case description) - `labels`: a list of classification labels (a list of articles considered). <details> <summary>List of ECHR articles</summary> "Article 2", "Article 3", "Article 5", "Article 6", "Article 8", "Article 9", "Article 10", "Article 11", "Article 14", "Article 1 of Protocol 1" </details> #### scotus - `text`: a `string` feature (the court opinion). - `label`: a classification label (the relevant issue area). <details> <summary>List of issue areas</summary> (1, Criminal Procedure), (2, Civil Rights), (3, First Amendment), (4, Due Process), (5, Privacy), (6, Attorneys), (7, Unions), (8, Economic Activity), (9, Judicial Power), (10, Federalism), (11, Interstate Relations), (12, Federal Taxation), (13, Miscellaneous), (14, Private Action) </details> #### eurlex - `text`: a `string` feature (an EU law). - `labels`: a list of classification labels (a list of relevant EUROVOC concepts). <details> <summary>List of EUROVOC concepts</summary> The list is very long including 100 EUROVOC concepts. You can find the EUROVOC concepts descriptors <a href="https://raw.githubusercontent.com/nlpaueb/multi-eurlex/master/data/eurovoc_descriptors.json">here</a>. </details> #### ledgar - `text`: a `string` feature (a contract provision/paragraph). - `label`: a classification label (the type of contract provision). <details> <summary>List of contract provision types</summary> "Adjustments", "Agreements", "Amendments", "Anti-Corruption Laws", "Applicable Laws", "Approvals", "Arbitration", "Assignments", "Assigns", "Authority", "Authorizations", "Base Salary", "Benefits", "Binding Effects", "Books", "Brokers", "Capitalization", "Change In Control", "Closings", "Compliance With Laws", "Confidentiality", "Consent To Jurisdiction", "Consents", "Construction", "Cooperation", "Costs", "Counterparts", "Death", "Defined Terms", "Definitions", "Disability", "Disclosures", "Duties", "Effective Dates", "Effectiveness", "Employment", "Enforceability", "Enforcements", "Entire Agreements", "Erisa", "Existence", "Expenses", "Fees", "Financial Statements", "Forfeitures", "Further Assurances", "General", "Governing Laws", "Headings", "Indemnifications", "Indemnity", "Insurances", "Integration", "Intellectual Property", "Interests", "Interpretations", "Jurisdictions", "Liens", "Litigations", "Miscellaneous", "Modifications", "No Conflicts", "No Defaults", "No Waivers", "Non-Disparagement", "Notices", "Organizations", "Participations", "Payments", "Positions", "Powers", "Publicity", "Qualifications", "Records", "Releases", "Remedies", "Representations", "Sales", "Sanctions", "Severability", "Solvency", "Specific Performance", "Submission To Jurisdiction", "Subsidiaries", "Successors", "Survival", "Tax Withholdings", "Taxes", "Terminations", "Terms", "Titles", "Transactions With Affiliates", "Use Of Proceeds", "Vacations", "Venues", "Vesting", "Waiver Of Jury Trials", "Waivers", "Warranties", "Withholdings", </details> #### unfair_tos - `text`: a `string` feature (a ToS sentence) - `labels`: a list of classification labels (a list of unfair types, if any). <details> <summary>List of unfair types</summary> "Limitation of liability", "Unilateral termination", "Unilateral change", "Content removal", "Contract by using", "Choice of law", "Jurisdiction", "Arbitration" </details> #### casehold - `context`: a `string` feature (a context sentence incl. a masked holding statement). - `holdings`: a list of `string` features (a list of candidate holding statements). - `label`: a classification label (the id of the original/correct holding). ### Data Splits <table> <tr><td>Dataset </td><td>Training</td><td>Development</td><td>Test</td><td>Total</td></tr> <tr><td>ECtHR (Task A)</td><td>9,000</td><td>1,000</td><td>1,000</td><td>11,000</td></tr> <tr><td>ECtHR (Task B)</td><td>9,000</td><td>1,000</td><td>1,000</td><td>11,000</td></tr> <tr><td>SCOTUS</td><td>5,000</td><td>1,400</td><td>1,400</td><td>7,800</td></tr> <tr><td>EUR-LEX</td><td>55,000</td><td>5,000</td><td>5,000</td><td>65,000</td></tr> <tr><td>LEDGAR</td><td>60,000</td><td>10,000</td><td>10,000</td><td>80,000</td></tr> <tr><td>UNFAIR-ToS</td><td>5,532</td><td>2,275</td><td>1,607</td><td>9,414</td></tr> <tr><td>CaseHOLD</td><td>45,000</td><td>3,900</td><td>3,900</td><td>52,800</td></tr> </table> ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data <table> <tr><td>Dataset</td><td>Source</td><td>Sub-domain</td><td>Task Type</td><tr> <tr><td>ECtHR (Task A)</td><td> <a href="https://aclanthology.org/P19-1424/">Chalkidis et al. (2019)</a> </td><td>ECHR</td><td>Multi-label classification</td></tr> <tr><td>ECtHR (Task B)</td><td> <a href="https://aclanthology.org/2021.naacl-main.22/">Chalkidis et al. (2021a)</a> </td><td>ECHR</td><td>Multi-label classification </td></tr> <tr><td>SCOTUS</td><td> <a href="http://scdb.wustl.edu">Spaeth et al. (2020)</a></td><td>US Law</td><td>Multi-class classification</td></tr> <tr><td>EUR-LEX</td><td> <a href="https://arxiv.org/abs/2109.00904">Chalkidis et al. (2021b)</a></td><td>EU Law</td><td>Multi-label classification</td></tr> <tr><td>LEDGAR</td><td> <a href="https://aclanthology.org/2020.lrec-1.155/">Tuggener et al. (2020)</a></td><td>Contracts</td><td>Multi-class classification</td></tr> <tr><td>UNFAIR-ToS</td><td><a href="https://arxiv.org/abs/1805.01217"> Lippi et al. (2019)</a></td><td>Contracts</td><td>Multi-label classification</td></tr> <tr><td>CaseHOLD</td><td><a href="https://arxiv.org/abs/2104.08671">Zheng et al. (2021)</a></td><td>US Law</td><td>Multiple choice QA</td></tr> </table> #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Dataset Curators *Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz, and Nikolaos Aletras.* *LexGLUE: A Benchmark Dataset for Legal Language Understanding in English.* *2022. In the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland.* ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information [*Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz, and Nikolaos Aletras.* *LexGLUE: A Benchmark Dataset for Legal Language Understanding in English.* *2022. In the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland.*](https://arxiv.org/abs/2110.00976) ``` @inproceedings{chalkidis-etal-2021-lexglue, title={LexGLUE: A Benchmark Dataset for Legal Language Understanding in English}, author={Chalkidis, Ilias and Jana, Abhik and Hartung, Dirk and Bommarito, Michael and Androutsopoulos, Ion and Katz, Daniel Martin and Aletras, Nikolaos}, year={2022}, booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics}, address={Dubln, Ireland}, } ``` ### Contributions Thanks to [@iliaschalkidis](https://github.com/iliaschalkidis) for adding this dataset.

应用场景：