varox34/telugu-dataset

Name: varox34/telugu-dataset
Creator: varox34
Published: 2024-01-10 16:59:49
License: 暂无描述

Hugging Face2024-01-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/varox34/telugu-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- YAML tags: null annotations_creators: - expert-generated language: - te language_creators: - found license: - cc-by-4.0 multilinguality: - monolingual pretty_name: UD_Spanish-AnCora source_datasets: [] task_categories: - token-classification task_ids: - part-of-speech --- # UD_Spanish-AnCora ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Website:** https://github.com/UniversalDependencies/UD_Spanish-AnCora - **Point of Contact:** [Daniel Zeman](zeman@ufal.mff.cuni.cz) ### Dataset Summary This dataset is composed of the annotations from the [AnCora corpus](http://clic.ub.edu/corpus/), projected on the [Universal Dependencies treebank](https://universaldependencies.org/). We use the POS annotations of this corpus as part of the EvalEs Spanish language benchmark. ### Supported Tasks and Leaderboards POS tagging ### Languages The dataset is in Spanish (`es-ES`) ## Dataset Structure ### Data Instances Three conllu files. Annotations are encoded in plain text files (UTF-8, normalized to NFC, using only the LF character as line break, including an LF character at the end of file) with three types of lines: 1) Word lines containing the annotation of a word/token in 10 fields separated by single tab characters (see below). 2) Blank lines marking sentence boundaries. 3) Comment lines starting with hash (#). ### Data Fields Word lines contain the following fields: 1) ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0). 2) FORM: Word form or punctuation symbol. 3) LEMMA: Lemma or stem of word form. 4) UPOS: Universal part-of-speech tag. 5) XPOS: Language-specific part-of-speech tag; underscore if not available. 6) FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available. 7) HEAD: Head of the current word, which is either a value of ID or zero (0). 8) DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one. 9) DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs. 10) MISC: Any other annotation. From: [https://universaldependencies.org](https://universaldependencies.org/guidelines.html) ### Data Splits - es_ancora-ud-train.conllu - es_ancora-ud-dev.conllu - es_ancora-ud-test.conllu ## Dataset Creation ### Curation Rationale [N/A] ### Source Data [UD_Spanish-AnCora](https://github.com/UniversalDependencies/UD_Spanish-AnCora) #### Initial Data Collection and Normalization The original annotation was done in a constituency framework as a part of the [AnCora project](http://clic.ub.edu/corpus/) at the University of Barcelona. It was converted to dependencies by the [Universal Dependencies team](https://universaldependencies.org/) and used in the CoNLL 2009 shared task. The CoNLL 2009 version was later converted to HamleDT and to Universal Dependencies. For more information on the AnCora project, visit the [AnCora site](http://clic.ub.edu/corpus/). To learn about the Universal Dependences, visit the webpage [https://universaldependencies.org](https://universaldependencies.org) #### Who are the source language producers? For more information on the AnCora corpus and its sources, visit the [AnCora site](http://clic.ub.edu/corpus/). ### Annotations #### Annotation process For more information on the first AnCora annotation, visit the [AnCora site](http://clic.ub.edu/corpus/). #### Who are the annotators? For more information on the AnCora annotation team, visit the [AnCora site](http://clic.ub.edu/corpus/). ### Personal and Sensitive Information No personal or sensitive information included. ## Considerations for Using the Data ### Social Impact of Dataset This dataset contributes to the development of language models in Spanish. ### Discussion of Biases [N/A] ### Other Known Limitations [N/A] ## Additional Information ### Dataset Curators [N/A] ### Licensing Information This work is licensed under a <a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC Attribution 4.0 International License</a>. ### Citation Information The following paper must be cited when using this corpus: Taulé, M., M.A. Martí, M. Recasens (2008) 'Ancora: Multilevel Annotated Corpora for Catalan and Spanish', Proceedings of 6th International Conference on Language Resources and Evaluation. Marrakesh (Morocco). To cite the Universal Dependencies project: Rueter, J. (Creator), Erina, O. (Contributor), Klementeva, J. (Contributor), Ryabov, I. (Contributor), Tyers, F. M. (Contributor), Zeman, D. (Contributor), Nivre, J. (Creator) (15 Nov 2020). Universal Dependencies version 2.7 Erzya JR. Universal Dependencies Consortium. ### Contributions [N/A]

提供机构：

varox34

原始信息汇总

UD_Spanish-AnCora 数据集概述

数据集描述

数据集摘要

该数据集包含从 AnCora 语料库投影到 Universal Dependencies 树库的注释。我们使用该语料库的词性标注作为 EvalEs 西班牙语基准测试的一部分。

支持的任务和排行榜

词性标注

语言

数据集为西班牙语 (es-ES)

数据集结构

数据实例

包含三个 conllu 文件。

注释以纯文本文件（UTF-8，归一化为 NFC，仅使用 LF 字符作为换行符，包括文件末尾的 LF 字符）编码，包含三种类型的行：

包含 10 个字段分隔的单词/标记注释的单词行。
标记句子边界的空白行。
以哈希（#）开头的注释行。

数据字段

单词行包含以下字段：

ID：单词索引，每个新句子从 1 开始；可能是多词标记的范围；可能是空节点的十进制数（十进制数可以小于 1 但必须大于 0）。
FORM：单词形式或标点符号。
LEMMA：词形式或词干。
UPOS：通用词性标记。
XPOS：特定语言的词性标记；如果不可用则为下划线。
FEATS：来自通用特征库存或定义的语言特定扩展的形态特征列表；如果不可用则为下划线。
HEAD：当前单词的头，可以是 ID 值或零（0）。
DEPREL：与 HEAD 的通用依赖关系（如果 HEAD = 0 则为根）或定义的语言特定子类型之一。
DEPS：增强依赖图，以头-依赖关系对列表的形式。
MISC：任何其他注释。

数据分割

es_ancora-ud-train.conllu
es_ancora-ud-dev.conllu
es_ancora-ud-test.conllu

数据集创建

源数据

原始注释在成分框架中完成，作为巴塞罗那大学 AnCora 项目的一部分。它被 Universal Dependencies 团队转换为依赖关系，并在 CoNLL 2009 共享任务中使用。CoNLL 2009 版本后来被转换为 HamleDT 和 Universal Dependencies。

注释

注释过程

更多关于 AnCora 注释的信息，请访问 AnCora 网站。

注释者

更多关于 AnCora 注释团队的信息，请访问 AnCora 网站。

个人和敏感信息

不包含个人或敏感信息。

使用数据的注意事项

数据集的社会影响

该数据集有助于西班牙语语言模型的发展。

附加信息

许可信息

该作品根据 <a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC Attribution 4.0 International License</a> 许可。

引用信息

使用该语料库时，必须引用以下论文：

Taulé, M., M.A. Martí, M. Recasens (2008) Ancora: Multilevel Annotated Corpora for Catalan and Spanish, Proceedings of 6th International Conference on Language Resources and Evaluation. Marrakesh (Morocco).

引用 Universal Dependencies 项目时：

Rueter, J. (Creator), Erina, O. (Contributor), Klementeva, J. (Contributor), Ryabov, I. (Contributor), Tyers, F. M. (Contributor), Zeman, D. (Contributor), Nivre, J. (Creator) (15 Nov 2020). Universal Dependencies version 2.7 Erzya JR. Universal Dependencies Consortium.

5,000+

优质数据集

54 个

任务类型

进入经典数据集