softcatala/catalan-dictionary
收藏Hugging Face2022-10-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/softcatala/catalan-dictionary
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- ca
license:
- gpl-2.0
- lgpl-2.1
multilinguality:
- monolingual
size_categories:
- 1M<n<10M
source_datasets:
- original
task_categories:
- text-generation
task_ids:
- language-modeling
pretty_name: catalan-dictionary
---
# Dataset Card for ca-text-corpus
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:**
- **Repository:** https://github.com/Softcatala/catalan-dict-tools
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
Catalan word lists with part of speech labeling curated by humans. Contains 1 180 773 forms including verbs, nouns, adjectives, names or toponyms. These word lists are used to build applications like Catalan spellcheckers or verb querying applications.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
Catalan (`ca`).
## Dataset Structure
The dataset contains 3 columns:
* Form (e.g. cantaré)
* Lemma (e.g. cantar)
* POS tag (e.g. VMIF1S00)
You can have the meaning of the POS tag here: https://freeling-user-manual.readthedocs.io/en/latest/tagsets/tagset-ca/#part-of-speech-verb
### Data Instances
[More Information Needed]
### Data Fields
[More Information Needed]
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[LGPL 2.1](https://www.gnu.org/licenses/old-licenses/lgpl-2.1.html).
[GPL 2.0](https://www.gnu.org/licenses/old-licenses/gpl-2.0.html).
### Citation Information
[More Information Needed]
### Contributions
Softcatalà
Jaume Ortolà
Joan Moratinos
提供机构:
softcatala
原始信息汇总
数据集概述
数据集名称
- 名称: catalan-dictionary
- 别名: ca-text-corpus
数据集描述
- 摘要: 包含1,180,773个形式的加泰罗尼亚语词汇列表,包括动词、名词、形容词、人名或地名,并由人工进行词性标注。这些词汇列表用于构建加泰罗尼亚语拼写检查器或动词查询应用。
- 语言: 加泰罗尼亚语 (
ca) - 任务类别: 文本生成
- 任务ID: 语言建模
数据集结构
- 数据集包含3列:
- Form: 例如 "cantaré"
- Lemma: 例如 "cantar"
- POS tag: 例如 "VMIF1S00"
数据集创建
- 许可证:
- LGPL 2.1
- GPL 2.0
- 贡献者:
- Softcatalà
- Jaume Ortolà
- Joan Moratinos



