ddrg/super_eurlex

Name: ddrg/super_eurlex
Creator: ddrg
Published: 2023-11-14 06:18:46
License: 暂无描述

Hugging Face2023-11-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ddrg/super_eurlex

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - found language: - bg - cs - da - de - el - en - es - et - fi - fr - ga - hr - hu - it - lt - lv - mt - nl - pl - pt - ro - sk - sl - sv language_creators: - found license: - mit multilinguality: - multilingual size_categories: - 1M<n<10M source_datasets: - original tags: - legal documents - corpus - eurlex - html task_categories: - text-classification - fill-mask task_ids: - multi-class-classification - multi-label-classification pretty_name: 'SuperEURLEX: A Corpus of Plain Text and HTML from EURLEX, Annotated for multiple Legal Domain Text Classification Tasks.' --- # Dataset Card for SuperEURLEX This dataset contains over 4.6M Legal Documents from EURLEX with Annotations. Over 3.7M of this 4.6M documents are also available in HTML format. This dataset can be used for pretraining language models as well as for testing them on legal text classification tasks. Use this dataset as follows: ```python from datasets import load_dataset config = "0.DE" # {sector}.{lang}[.html] dataset = load_dataset("ddrg/super_eurlex", config, split='train') ``` ## Dataset Details ### Dataset Description This Dataset was scrapped from [EURLEX](https://eur-lex.europa.eu/homepage.html). It contains more than 4.6M Legal Documents in Plain Text and over 3.7M In HTML Format. Those Documents are separated by their language (This Dataset includes a total of 24 official European Languages) and by their Sector. #### The Table below shows the number of documents per language: | | Raw | HTML | |---:|--------:|--------:| | BG | 29,778 | 27,718 | | CS | 94,439 | 91,754 | | DA | 398,559 | 300,488 | | DE | 384,179 | 265,724 | | EL | 167,502 | 117,009 | | EN | 456,212 | 354,186 | | ES | 253,821 | 201,400 | | ET | 142,183 | 139,690 | | FI | 238,143 | 214,206 | | FR | 427,011 | 305,592 | | GA | 19,673 | 19,437 | | HR | 37,200 | 35,944 | | HU | 69,275 | 66,334 | | IT | 358,637 | 259,936 | | LT | 62,975 | 61,139 | | LV | 105,433 | 102,105 | | MT | 46,695 | 43,969 | | NL | 345,276 | 237,366 | | PL | 146,502 | 143,490 | | PT | 369,571 | 314,148 | | RO | 47,398 | 45,317 | | SK | 100,718 | 98,192 | | SL | 170,583 | 166,646 | | SV | 172,926 | 148,656 | - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] - **Repository:** https://huggingface.co/datasets/ddrg/super_eurlex/tree/main - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses ### As Corpus for: - **Pretraining of Language Models with self supervised tasks** like Masked Language Modeling and Next Sentence Prediction - Legal Text Analysis ### As Dataset for evaluation on the following task: - *eurovoc*-Concepts Prediction i.e. which tags apply? (Muli-Label Classification (large Scale)) - Example for this task is given[below - *subject-matter* Prediction i.e. which other tags apply (Multi-Label Classification) - *form* Classification i.e. What Kind of Document is it? (Multi-Class) - And more ### Example for Use Of EUROVOC-Concepts ```python from datasets import load_dataset import transformers as tr from sklearn.preprocessing import MultiLabelBinarizer import numpy as np import evaluate import uuid # ==================== # # Prepare Data # # ==================== # CONFIG = "3.EN" # {sector}.{lang}[.html] MODEL_NAME = "distilroberta-base" dataset = load_dataset("ddrg/super_eurlex", CONFIG, split='train') tokenizer = tr.AutoTokenizer.from_pretrained(MODEL_NAME) # Remove Unlabeled Columns def remove_nulls(batch): return [(sample != None) for sample in batch["eurovoc"]] dataset = dataset.filter(remove_nulls, batched=True, keep_in_memory=True) # Tokenize Text def tokenize(batch): return tokenizer(batch["text_cleaned"], truncation=True, padding="max_length") # Keep in Memory is optional (The Dataset is large though and can easily use up alot of memory) dataset = dataset.map(tokenize, batched=True, keep_in_memory=True) # Create Label Column by encoding Eurovoc Concepts encoder = MultiLabelBinarizer() # List of all Possible Labels eurovoc_concepts = dataset["eurovoc"] encoder.fit(eurovoc_concepts) def encode_labels(batch): batch["label"] = encoder.transform(batch["eurovoc"]) return batch dataset = dataset.map(encode_labels, batched=True, keep_in_memory=True) # Split into train and Test set dataset = dataset.train_test_split(0.2) # ==================== # # Load & Train Model # # ==================== # model = tr.AutoModelForSequenceClassification.from_pretrained( MODEL_NAME, num_labels=len(encoder.classes_), problem_type="multi_label_classification", ) metric = evaluate.load("JP-SystemsX/nDCG", experiment_id=uuid.uuid4()) def compute_metric(eval_pred): predictions, labels = eval_pred return metric.compute(predictions=predictions, references=labels, k=5) # Set Hyperparameter # Note: We stay mostly with default values to keep example short # Though more hyperparameter should be set and tuned in praxis train_args = tr.TrainingArguments( output_dir="./cache", per_device_train_batch_size=16, num_train_epochs=20 ) trainer = tr.Trainer( model=model, args=train_args, train_dataset=dataset["train"], compute_metrics=compute_metric, ) trainer.train() # This will take a while print(trainer.evaluate(dataset["test"])) # >>> {'eval_loss': 0.0018887673504650593, 'eval_nDCG@5': 0.8072531683578489, 'eval_runtime': 663.8582, 'eval_samples_per_second': 32.373, 'eval_steps_per_second': 4.048, 'epoch': 20.0} ``` ### Out-of-Scope Use  [More Information Needed] ## Dataset Structure This dataset is divided into multiple split by _Sector x Language x Format_ Sector refers to the kind of Document it belongs to: - **0:** Consolidated acts - **1:** Treaties - **2:** International agreements - **3:** Legislation - **4:** Complementary legislation - **5:** Preparatory acts and working documents - **6:** Case-law - **7:** National transposition measures - **8:** References to national case-law concerning EU law - **9:** Parliamentary questions - **C:** Other documents published in the Official Journal C series - **E:** EFTA documents Language refers to each of the 24 official European Languages that were included at the date of the dataset creation: - BG ~ Bulgarian - CS ~ Czech - DA ~ Danish - DE ~ German - EL ~ Greek - EN ~ English - ES ~ Spanish - ET ~ Estonian - FI ~ Finnish - FR ~ French - GA ~ Irish - HR ~ Croatian - HU ~ Hungarian - IT ~ Italian - LT ~ Lithuanian - LV ~ Latvian - MT ~ Maltese - NL ~ Dutch - PL ~ Polish - PT ~ Portuguese - RO ~ Romanian - SK ~ Slovak - SL ~ Slovenian - SV ~ Swedish Format refers to plain Text (default) or HTML format (.html) > Note: Plain Text contains generally more documents because not all documents were available in HTML format but those that were are included in both formats Those Splits are named the following way: `{sector}.{lang}[.html]` For Example: - `3.EN` would be English legislative documents in plain text format - `3.EN.html` would be the same in HTML Format Each _Sector_ has its own set of meta data: <details><summary>Sector 0 (Consolidated acts)</summary> - _celex_id_ ~ Unique Identifier for each document - _text_cleaned_ (Plain Text) **or** _text_html_raw_ (HTML Format) - _form_ ~ Kind of Document e.g. Consolidated text, or Treaty </details> <details><summary>Sector 1 (Treaties)</summary> - _celex_id_ ~ Unique Identifier for each document - _text_cleaned_ (Plain Text) **or** _text_html_raw_ (HTML Format) - _form_ ~ Kind of Document e.g. Consolidated text, or Treaty - _subject_matter_ ~ Keywords that provide general overview of content in a document see [here](https://eur-lex.europa.eu/content/e-learning/browsing_options.html) for more information - _current_consolidated_version_ ~ date when this version of the document was consolidated `Format DD/MM/YYYY` - _directory_code_ ~ Information to structure documents in some kind of directory structure by topic e.g. `'03.50.30.00 Agriculture / Approximation of laws and health measures / Animal health and zootechnics'` - _eurovoc_ ~ Keywords that describe document content based on the European Vocabulary see [here](https://eur-lex.europa.eu/browse/eurovoc.html) for more information </details> <details><summary>Sector 2 (International agreements)</summary> - _celex_id_ ~ Unique Identifier for each document - _text_cleaned_ (Plain Text) **or** _text_html_raw_ (HTML Format) - _form_ ~ Kind of Document e.g. Consolidated text, or Treaty - _directory_code_ ~ Information to structure documents in some kind of directory structure by topic e.g. `'03.50.30.00 Agriculture / Approximation of laws and health measures / Animal health and zootechnics'` - _subject_matter_ ~ Keywords that provide general overview of content in a document see [here](https://eur-lex.europa.eu/content/e-learning/browsing_options.html) for more information - _eurovoc_ ~ Keywords that describe document content based on the European Vocabulary see [here](https://eur-lex.europa.eu/browse/eurovoc.html) for more information - _latest_consolidated_version_ ~ `Format DD/MM/YYYY` - _current_consolidated_version_ ~ `Format DD/MM/YYYY` </details> <details><summary>Sector 3 (Legislation)</summary> - _celex_id_ ~ Unique Identifier for each document - _text_cleaned_ (Plain Text) **or** _text_html_raw_ (HTML Format) - _form_ ~ Kind of Document e.g. Consolidated text, or Treaty - _directory_code_ ~ Information to structure documents in some kind of directory structure by topic e.g. `'03.50.30.00 Agriculture / Approximation of laws and health measures / Animal health and zootechnics'` - _subject_matter_ ~ Keywords that provide general overview of content in a document see [here](https://eur-lex.europa.eu/content/e-learning/browsing_options.html) for more information - _eurovoc_ ~ Keywords that describe document content based on the European Vocabulary see [here](https://eur-lex.europa.eu/browse/eurovoc.html) for more information - _latest_consolidated_version_ ~ `Format DD/MM/YYYY` - _current_consolidated_version_ ~ `Format DD/MM/YYYY` </details> <details><summary>Sector 4 (Complementary legislation)</summary> - _celex_id_ ~ Unique Identifier for each document - _text_cleaned_ (Plain Text) **or** _text_html_raw_ (HTML Format) - _form_ ~ Kind of Document e.g. Consolidated text, or Treaty - _directory_code_ ~ Information to structure documents in some kind of directory structure by topic e.g. `'03.50.30.00 Agriculture / Approximation of laws and health measures / Animal health and zootechnics'` - _subject_matter_ ~ Keywords that provide general overview of content in a document see [here](https://eur-lex.europa.eu/content/e-learning/browsing_options.html) for more information - _eurovoc_ ~ Keywords that describe document content based on the European Vocabulary see [here](https://eur-lex.europa.eu/browse/eurovoc.html) for more information - _latest_consolidated_version_ ~ `Format DD/MM/YYYY` - _current_consolidated_version_ ~ `Format DD/MM/YYYY` </details> <details><summary>Sector 5 (Preparatory acts and working documents)</summary> - _celex_id_ ~ Unique Identifier for each document - _text_cleaned_ (Plain Text) **or** _text_html_raw_ (HTML Format) - _form_ ~ Kind of Document e.g. Consolidated text, or Treaty - _directory_code_ ~ Information to structure documents in some kind of directory structure by topic e.g. `'03.50.30.00 Agriculture / Approximation of laws and health measures / Animal health and zootechnics'` - _subject_matter_ ~ Keywords that provide general overview of content in a document see [here](https://eur-lex.europa.eu/content/e-learning/browsing_options.html) for more information - _eurovoc_ ~ Keywords that describe document content based on the European Vocabulary see [here](https://eur-lex.europa.eu/browse/eurovoc.html) for more information - _latest_consolidated_version_ ~ `Format DD/MM/YYYY` </details> <details><summary>Sector 6 (Case-law)</summary> - _celex_id_ ~ Unique Identifier for each document - _text_cleaned_ (Plain Text) **or** _text_html_raw_ (HTML Format) - _form_ ~ Kind of Document e.g. Consolidated text, or Treaty - _directory_code_ ~ Information to structure documents in some kind of directory structure by topic e.g. `'03.50.30.00 Agriculture / Approximation of laws and health measures / Animal health and zootechnics'` - _subject_matter_ ~ Keywords that provide general overview of content in a document see [here](https://eur-lex.europa.eu/content/e-learning/browsing_options.html) for more information - _eurovoc_ ~ Keywords that describe document content based on the European Vocabulary see [here](https://eur-lex.europa.eu/browse/eurovoc.html) for more information - _case-law_directory_code_before_lisbon_ ~ Classification system used for case law before Treaty of Lisbon came into effect (2009), each code reflects a particular area of EU law </details> <details><summary>Sector 7 (National transposition measures)</summary> - _celex_id_ ~ Unique Identifier for each document - _text_cleaned_ (Plain Text) **or** _text_html_raw_ (HTML Format) - _form_ ~ Kind of Document e.g. Consolidated text, or Treaty - _transposed_legal_acts_ ~ national laws that exist in EU member states as a direct result of the need to comply with EU directives </details> <details><summary>Sector 8 (References to national case-law concerning EU law)</summary> - _celex_id_ ~ Unique Identifier for each document - _text_cleaned_ (Plain Text) **or** _text_html_raw_ (HTML Format) - _form_ ~ Kind of Document e.g. Consolidated text, or Treaty - _case-law_directory_code_before_lisbon_ ~ Classification system used for case law before Treaty of Lisbon came into effect (2009), each code reflects a particular area of EU law - _subject_matter_ ~ Keywords that provide general overview of content in a document see [here](https://eur-lex.europa.eu/content/e-learning/browsing_options.html) for more information </details> <details><summary>Sector 9 (Parliamentary questions)</summary> - _celex_id_ ~ Unique Identifier for each document - _text_cleaned_ (Plain Text) **or** _text_html_raw_ (HTML Format) - _form_ ~ Kind of Document e.g. Consolidated text, or Treaty - _directory_code_ ~ Information to structure documents in some kind of directory structure by topic e.g. `'03.50.30.00 Agriculture / Approximation of laws and health measures / Animal health and zootechnics'` - _subject_matter_ ~ Keywords that provide general overview of content in a document see [here](https://eur-lex.europa.eu/content/e-learning/browsing_options.html) for more information - _eurovoc_ ~ Keywords that describe document content based on the European Vocabulary see [here](https://eur-lex.europa.eu/browse/eurovoc.html) for more information </details> <details><summary>Sector C (Other documents published in the Official Journal C series)</summary> - _celex_id_ ~ Unique Identifier for each document - _text_cleaned_ (Plain Text) **or** _text_html_raw_ (HTML Format) - _form_ ~ Kind of Document e.g. Consolidated text, or Treaty - _eurovoc_ ~ Keywords that describe document content based on the European Vocabulary see [here](https://eur-lex.europa.eu/browse/eurovoc.html) for more information </details> <details><summary>Sector E (EFTA documents)</summary> - _celex_id_ ~ Unique Identifier for each document - _text_cleaned_ (Plain Text) **or** _text_html_raw_ (HTML Format) - _form_ ~ Kind of Document e.g. Consolidated text, or Treaty - _directory_code_ ~ Information to structure documents in some kind of directory structure by topic e.g. `'03.50.30.00 Agriculture / Approximation of laws and health measures / Animal health and zootechnics'` - _subject_matter_ ~ Keywords that provide general overview of content in a document see [here](https://eur-lex.europa.eu/content/e-learning/browsing_options.html) for more information - _eurovoc_ ~ Keywords that describe document content based on the European Vocabulary see [here](https://eur-lex.europa.eu/browse/eurovoc.html) for more information </details> ## Dataset Creation ### Curation Rationale This dataset was created for the creation and/or evaluation of pretrained Legal Language Models. ### Source Data #### Data Collection and Processing We used the [EURLEX-Web-Scrapper Repo](https://github.com/JP-SystemsX/Eurlex-Web-Scrapper) for the data collection process. #### Who are the source data producers? The Source data stems from the [EURLEX-Website](https://eur-lex.europa.eu/) and was therefore produced by various entities within the European Union #### Personal and Sensitive Information No Personal or Sensitive Information is included to the best of our knowledge. ## Bias, Risks, and Limitations - We removed HTML documents from which we couldn't extract plain text under the assumption that those are **corrupted files**. However, we can't guarantee that we removed all. - The Extraction of plain text from legal HTML documents can lead to **formatting issues** e.g. the extraction of text from tables might mix up the order such that it becomes nearly incomprehensible. - This dataset might contain many **missing values** in the meta-data columns as not every document was annotated in the same way [More Information Needed] ### Recommendations - Consider Removing rows with missing values for the task before training a model on it ## Citation [optional]  **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional]  [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]

提供机构：

ddrg

原始信息汇总

数据集概述

基本信息

名称: SuperEURLEX
描述: 包含超过4.6M的法律文档，其中3.7M文档以HTML格式提供。该数据集适用于预训练语言模型以及测试法律文本分类任务。
语言: 包含24种欧洲官方语言（BG, CS, DA, DE, EL, EN, ES, ET, FI, FR, GA, HR, HU, IT, LT, LV, MT, NL, PL, PT, RO, SK, SL, SV）
许可证: MIT
多语言性: 多语言
大小: 1M<n<10M
源数据集: 原始数据
任务类别: 文本分类, 填充掩码
任务ID: 多类分类, 多标签分类
预训练用途: 语言模型预训练，如掩码语言建模和下一句预测
评估任务: 预测标签（多标签分类），主题预测（多标签分类），文档类型分类（多类分类）

数据集结构

分割方式: 按部门（Sector）、语言（Language）和格式（Format）分割
部门（Sector）: 包括多种文档类型，如合并法案、条约、国际协议等
语言（Language）: 24种欧洲官方语言
格式（Format）: 纯文本（默认）或HTML格式
数据集命名规则: {sector}.{lang}[.html]
示例: 3.EN（英语立法文档，纯文本格式）, 3.EN.html（同上，HTML格式）

数据集内容

文档元数据: 每个部门包含不同的元数据，如celex_id（唯一文档标识符）、text_cleaned（纯文本）或text_html_raw（HTML格式）、form（文档类型）等
示例元数据:
- Sector 0 (Consolidated acts): celex_id, text_cleaned/text_html_raw, form
- Sector 1 (Treaties): celex_id, text_cleaned/text_html_raw, form, subject_matter, current_consolidated_version, directory_code, eurovoc

使用方法

加载数据集: python from datasets import load_dataset config = "0.DE" # {sector}.{lang}[.html] dataset = load_dataset("ddrg/super_eurlex", config, split=train)

注意事项

数据集中可能包含格式问题和缺失值，建议在使用前进行数据清洗和预处理。

5,000+

优质数据集

54 个

任务类型

进入经典数据集