joelniklaus/MultiLegalPile_Wikipedia_Filtered

Name: joelniklaus/MultiLegalPile_Wikipedia_Filtered
Creator: joelniklaus
Published: 2022-11-29 21:52:23
License: 暂无描述

Hugging Face2022-11-29 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/joelniklaus/MultiLegalPile_Wikipedia_Filtered

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - other language_creators: - found language: - bg - cs - da - de - el - en - es - et - fi - fr - ga - hr - hu - it - lt - lv - mt - nl - pl - pt - ro - sk - sl - sv license: - cc-by-4.0 multilinguality: - multilingual paperswithcode_id: null pretty_name: "MultiLegalPile_Wikipedia_Filtered: A filtered version of the MultiLegalPile dataset, together with wikipedia articles." size_categories: - 10M<n<100M source_datasets: - original task_categories: - fill-mask --- # Dataset Card for MultiLegalPile_Wikipedia_Filtered: A filtered version of the MultiLegalPile dataset, together with wikipedia articles ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** [Joel Niklaus](mailto:joel.niklaus.2@bfh.ch) ### Dataset Summary The Multi_Legal_Pile is a large-scale multilingual legal dataset suited for pretraining language models. It spans over 24 languages and four legal text types. ### Supported Tasks and Leaderboards The dataset supports the tasks of fill-mask. ### Languages The following languages are supported: bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv ## Dataset Structure It is structured in the following format: {language}_{text_type}_{shard}.jsonl.xz text_type is one of the following: - caselaw - contracts - legislation - other - wikipedia Use the dataset like this: ```python from datasets import load_dataset config = 'en_contracts' # {language}_{text_type} dataset = load_dataset('joelito/Multi_Legal_Pile', config, split='train', streaming=True) ``` 'config' is a combination of language and text_type, e.g. 'en_contracts' or 'de_caselaw'. To load all the languages or all the text_types, use 'all' instead of the language or text_type (e.g., ' all_legislation'). ### Data Instances The file format is jsonl.xz and there is a `train` and `validation` split available. Since some configurations are very small or non-existent, they might not contain a train split or not be present at all. The complete dataset consists of five large subsets: - [Native Multi Legal Pile](https://huggingface.co/datasets/joelito/Multi_Legal_Pile) - [Eurlex Resources](https://huggingface.co/datasets/joelito/eurlex_resources) - [MC4 Legal](https://huggingface.co/datasets/joelito/mc4_legal) - [Pile of Law](https://huggingface.co/datasets/pile-of-law/pile-of-law) - [EU Wikipedias](https://huggingface.co/datasets/joelito/EU_Wikipedias) ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation This dataset has been created by combining the following datasets: Native Multi Legal Pile, Eurlex Resources, MC4 Legal, Pile of Law, EU Wikipedias. It has been filtered to remove short documents (less than 64 whitespace-separated tokens) and documents with more than 30% punctuation or numbers (see prepare_legal_data.py for more details). ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ``` TODO add citation ``` ### Contributions Thanks to [@JoelNiklaus](https://github.com/joelniklaus) for adding this dataset.

提供机构：

joelniklaus

原始信息汇总

数据集概述

数据集描述

数据集摘要

Multi_Legal_Pile 是一个适用于预训练语言模型的大规模多语言法律数据集。它涵盖了 24 种语言和四种法律文本类型。

支持的任务和排行榜

该数据集支持 fill-mask 任务。

语言

支持以下语言：

bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv

数据集结构

数据集的结构格式为：{language}{text_type}{shard}.jsonl.xz。

text_type 包括以下类型：

caselaw
contracts
legislation
other
wikipedia

数据实例

文件格式为 jsonl.xz，包含 train 和 validation 两个数据集。由于某些配置非常小或不存在，可能不包含 train 数据集或根本不存在。

数据字段

[更多信息待补充]

数据分割

[更多信息待补充]

数据集创建

该数据集是通过合并以下数据集创建的：

Native Multi Legal Pile
Eurlex Resources
MC4 Legal
Pile of Law
EU Wikipedias

数据集经过过滤，移除了短文档（少于 64 个空格分隔的词）和超过 30% 的标点符号或数字的文档（更多细节请参见 prepare_legal_data.py）。

策划理由

[更多信息待补充]

源数据

初始数据收集和规范化

[更多信息待补充]

源语言生产者

[更多信息待补充]

注释

注释过程

[更多信息待补充]

注释者

[更多信息待补充]

个人和敏感信息

[更多信息待补充]

使用数据的注意事项

数据集的社会影响

[更多信息待补充]

偏见的讨论

[更多信息待补充]

其他已知限制

[更多信息待补充]

附加信息

数据集策展人

[更多信息待补充]

许可信息

[更多信息待补充]

引用信息

TODO add citation

贡献

感谢 @JoelNiklaus 添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集