gboleda/wikicorpus

Name: gboleda/wikicorpus
Creator: gboleda
Published: 2024-01-18 11:18:14
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/gboleda/wikicorpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Wikicorpus annotations_creators: - machine-generated - no-annotation language_creators: - found language: - ca - en - es license: - gfdl multilinguality: - monolingual size_categories: - 100K<n<1M - 10M<n<100M - 1M<n<10M source_datasets: - original task_categories: - fill-mask - text-classification - text-generation - token-classification task_ids: - language-modeling - masked-language-modeling - part-of-speech paperswithcode_id: null tags: - word-sense-disambiguation - lemmatization dataset_info: - config_name: raw_ca features: - name: id dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 263170192 num_examples: 143883 download_size: 96437841 dataset_size: 263170192 - config_name: raw_es features: - name: id dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 671295359 num_examples: 259409 download_size: 252926918 dataset_size: 671295359 - config_name: raw_en features: - name: id dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3388801074 num_examples: 1359146 download_size: 1346378932 dataset_size: 3388801074 - config_name: tagged_ca features: - name: id dtype: string - name: title dtype: string - name: sentence sequence: string - name: lemmas sequence: string - name: pos_tags sequence: string - name: wordnet_senses sequence: string splits: - name: train num_bytes: 1666129919 num_examples: 2016221 download_size: 226390380 dataset_size: 1666129919 - config_name: tagged_es features: - name: id dtype: string - name: title dtype: string - name: sentence sequence: string - name: lemmas sequence: string - name: pos_tags sequence: string - name: wordnet_senses sequence: string splits: - name: train num_bytes: 4100040390 num_examples: 5039367 download_size: 604910899 dataset_size: 4100040390 - config_name: tagged_en features: - name: id dtype: string - name: title dtype: string - name: sentence sequence: string - name: lemmas sequence: string - name: pos_tags sequence: string - name: wordnet_senses sequence: string splits: - name: train num_bytes: 18077275300 num_examples: 26350272 download_size: 2477450893 dataset_size: 18077275300 config_names: - raw_ca - raw_en - raw_es - tagged_ca - tagged_en - tagged_es --- # Dataset Card for Wikicorpus ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://www.cs.upc.edu/~nlp/wikicorpus/ - **Repository:** - **Paper:** https://www.cs.upc.edu/~nlp/papers/reese10.pdf - **Leaderboard:** - **Point of Contact:** ### Dataset Summary The Wikicorpus is a trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia (based on a 2006 dump) and has been automatically enriched with linguistic information. In its present version, it contains over 750 million words. The corpora have been annotated with lemma and part of speech information using the open source library FreeLing. Also, they have been sense annotated with the state of the art Word Sense Disambiguation algorithm UKB. As UKB assigns WordNet senses, and WordNet has been aligned across languages via the InterLingual Index, this sort of annotation opens the way to massive explorations in lexical semantics that were not possible before. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages Each sub-dataset is monolingual in the languages: - ca: Catalan - en: English - es: Spanish ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information The WikiCorpus is licensed under the same license as Wikipedia, that is, the [GNU Free Documentation License](http://www.fsf.org/licensing/licenses/fdl.html) ### Citation Information ``` @inproceedings{reese-etal-2010-wikicorpus, title = "{W}ikicorpus: A Word-Sense Disambiguated Multilingual {W}ikipedia Corpus", author = "Reese, Samuel and Boleda, Gemma and Cuadros, Montse and Padr{\'o}, Llu{\'i}s and Rigau, German", booktitle = "Proceedings of the Seventh International Conference on Language Resources and Evaluation ({LREC}'10)", month = may, year = "2010", address = "Valletta, Malta", publisher = "European Language Resources Association (ELRA)", url = "http://www.lrec-conf.org/proceedings/lrec2010/pdf/222_Paper.pdf", abstract = "This article presents a new freely available trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia and has been automatically enriched with linguistic information. To our knowledge, this is the largest such corpus that is freely available to the community: In its present version, it contains over 750 million words. The corpora have been annotated with lemma and part of speech information using the open source library FreeLing. Also, they have been sense annotated with the state of the art Word Sense Disambiguation algorithm UKB. As UKB assigns WordNet senses, and WordNet has been aligned across languages via the InterLingual Index, this sort of annotation opens the way to massive explorations in lexical semantics that were not possible before. We present a first attempt at creating a trilingual lexical resource from the sense-tagged Wikipedia corpora, namely, WikiNet. Moreover, we present two by-products of the project that are of use for the NLP community: An open source Java-based parser for Wikipedia pages developed for the construction of the corpus, and the integration of the WSD algorithm UKB in FreeLing.", } ``` ### Contributions Thanks to [@albertvillanova](https://github.com/albertvillanova) for adding this dataset.

提供机构：

gboleda

原始信息汇总

数据集卡片 for Wikicorpus

数据集描述

数据集名称: Wikicorpus
注释创建者:
- 机器生成
- 无注释
语言创建者: 发现
语言:
- 加泰罗尼亚语 (ca)
- 英语 (en)
- 西班牙语 (es)
许可证: GFDL
多语言性: 单语
大小分类:
- 100K<n<1M
- 10M<n<100M
- 1M<n<10M
源数据集: 原始
任务分类:
- 填空
- 文本分类
- 文本生成
- 标记分类
任务ID:
- 语言建模
- 掩码语言建模
- 词性标注
标签:
- 词义消歧
- 词形还原

数据集结构

配置名称和特征

raw_ca

特征:
- id: 字符串
- title: 字符串
- text: 字符串
分割:
- 训练集
  - 字节数: 263170192
  - 样本数: 143883
下载大小: 96437841
数据集大小: 263170192

raw_es

特征:
- id: 字符串
- title: 字符串
- text: 字符串
分割:
- 训练集
  - 字节数: 671295359
  - 样本数: 259409
下载大小: 252926918
数据集大小: 671295359

raw_en

特征:
- id: 字符串
- title: 字符串
- text: 字符串
分割:
- 训练集
  - 字节数: 3388801074
  - 样本数: 1359146
下载大小: 1346378932
数据集大小: 3388801074

tagged_ca

特征:
- id: 字符串
- title: 字符串
- sentence: 序列字符串
- lemmas: 序列字符串
- pos_tags: 序列字符串
- wordnet_senses: 序列字符串
分割:
- 训练集
  - 字节数: 1666129919
  - 样本数: 2016221
下载大小: 226390380
数据集大小: 1666129919

tagged_es

特征:
- id: 字符串
- title: 字符串
- sentence: 序列字符串
- lemmas: 序列字符串
- pos_tags: 序列字符串
- wordnet_senses: 序列字符串
分割:
- 训练集
  - 字节数: 4100040390
  - 样本数: 5039367
下载大小: 604910899
数据集大小: 4100040390

tagged_en

特征:
- id: 字符串
- title: 字符串
- sentence: 序列字符串
- lemmas: 序列字符串
- pos_tags: 序列字符串
- wordnet_senses: 序列字符串
分割:
- 训练集
  - 字节数: 18077275300
  - 样本数: 26350272
下载大小: 2477450893
数据集大小: 18077275300

配置名称

raw_ca
raw_en
raw_es
tagged_ca
tagged_en
tagged_es

5,000+

优质数据集

54 个

任务类型

进入经典数据集