five

varox34/telugu-dataset

收藏
Hugging Face2024-01-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/varox34/telugu-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- YAML tags: null annotations_creators: - expert-generated language: - te language_creators: - found license: - cc-by-4.0 multilinguality: - monolingual pretty_name: UD_Spanish-AnCora source_datasets: [] task_categories: - token-classification task_ids: - part-of-speech --- # UD_Spanish-AnCora ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Website:** https://github.com/UniversalDependencies/UD_Spanish-AnCora - **Point of Contact:** [Daniel Zeman](zeman@ufal.mff.cuni.cz) ### Dataset Summary This dataset is composed of the annotations from the [AnCora corpus](http://clic.ub.edu/corpus/), projected on the [Universal Dependencies treebank](https://universaldependencies.org/). We use the POS annotations of this corpus as part of the EvalEs Spanish language benchmark. ### Supported Tasks and Leaderboards POS tagging ### Languages The dataset is in Spanish (`es-ES`) ## Dataset Structure ### Data Instances Three conllu files. Annotations are encoded in plain text files (UTF-8, normalized to NFC, using only the LF character as line break, including an LF character at the end of file) with three types of lines: 1) Word lines containing the annotation of a word/token in 10 fields separated by single tab characters (see below). 2) Blank lines marking sentence boundaries. 3) Comment lines starting with hash (#). ### Data Fields Word lines contain the following fields: 1) ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0). 2) FORM: Word form or punctuation symbol. 3) LEMMA: Lemma or stem of word form. 4) UPOS: Universal part-of-speech tag. 5) XPOS: Language-specific part-of-speech tag; underscore if not available. 6) FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available. 7) HEAD: Head of the current word, which is either a value of ID or zero (0). 8) DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one. 9) DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs. 10) MISC: Any other annotation. From: [https://universaldependencies.org](https://universaldependencies.org/guidelines.html) ### Data Splits - es_ancora-ud-train.conllu - es_ancora-ud-dev.conllu - es_ancora-ud-test.conllu ## Dataset Creation ### Curation Rationale [N/A] ### Source Data [UD_Spanish-AnCora](https://github.com/UniversalDependencies/UD_Spanish-AnCora) #### Initial Data Collection and Normalization The original annotation was done in a constituency framework as a part of the [AnCora project](http://clic.ub.edu/corpus/) at the University of Barcelona. It was converted to dependencies by the [Universal Dependencies team](https://universaldependencies.org/) and used in the CoNLL 2009 shared task. The CoNLL 2009 version was later converted to HamleDT and to Universal Dependencies. For more information on the AnCora project, visit the [AnCora site](http://clic.ub.edu/corpus/). To learn about the Universal Dependences, visit the webpage [https://universaldependencies.org](https://universaldependencies.org) #### Who are the source language producers? For more information on the AnCora corpus and its sources, visit the [AnCora site](http://clic.ub.edu/corpus/). ### Annotations #### Annotation process For more information on the first AnCora annotation, visit the [AnCora site](http://clic.ub.edu/corpus/). #### Who are the annotators? For more information on the AnCora annotation team, visit the [AnCora site](http://clic.ub.edu/corpus/). ### Personal and Sensitive Information No personal or sensitive information included. ## Considerations for Using the Data ### Social Impact of Dataset This dataset contributes to the development of language models in Spanish. ### Discussion of Biases [N/A] ### Other Known Limitations [N/A] ## Additional Information ### Dataset Curators [N/A] ### Licensing Information This work is licensed under a <a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC Attribution 4.0 International License</a>. ### Citation Information The following paper must be cited when using this corpus: Taulé, M., M.A. Martí, M. Recasens (2008) 'Ancora: Multilevel Annotated Corpora for Catalan and Spanish', Proceedings of 6th International Conference on Language Resources and Evaluation. Marrakesh (Morocco). To cite the Universal Dependencies project: Rueter, J. (Creator), Erina, O. (Contributor), Klementeva, J. (Contributor), Ryabov, I. (Contributor), Tyers, F. M. (Contributor), Zeman, D. (Contributor), Nivre, J. (Creator) (15 Nov 2020). Universal Dependencies version 2.7 Erzya JR. Universal Dependencies Consortium. ### Contributions [N/A]
提供机构:
varox34
原始信息汇总

UD_Spanish-AnCora 数据集概述

数据集描述

数据集摘要

该数据集包含从 AnCora 语料库 投影到 Universal Dependencies 树库 的注释。我们使用该语料库的词性标注作为 EvalEs 西班牙语基准测试的一部分。

支持的任务和排行榜

  • 词性标注

语言

数据集为西班牙语 (es-ES)

数据集结构

数据实例

包含三个 conllu 文件。

注释以纯文本文件(UTF-8,归一化为 NFC,仅使用 LF 字符作为换行符,包括文件末尾的 LF 字符)编码,包含三种类型的行:

  1. 包含 10 个字段分隔的单词/标记注释的单词行。
  2. 标记句子边界的空白行。
  3. 以哈希(#)开头的注释行。

数据字段

单词行包含以下字段:

  1. ID:单词索引,每个新句子从 1 开始;可能是多词标记的范围;可能是空节点的十进制数(十进制数可以小于 1 但必须大于 0)。
  2. FORM:单词形式或标点符号。
  3. LEMMA:词形式或词干。
  4. UPOS:通用词性标记。
  5. XPOS:特定语言的词性标记;如果不可用则为下划线。
  6. FEATS:来自通用特征库存或定义的语言特定扩展的形态特征列表;如果不可用则为下划线。
  7. HEAD:当前单词的头,可以是 ID 值或零(0)。
  8. DEPREL:与 HEAD 的通用依赖关系(如果 HEAD = 0 则为根)或定义的语言特定子类型之一。
  9. DEPS:增强依赖图,以头-依赖关系对列表的形式。
  10. MISC:任何其他注释。

数据分割

  • es_ancora-ud-train.conllu
  • es_ancora-ud-dev.conllu
  • es_ancora-ud-test.conllu

数据集创建

源数据

原始注释在成分框架中完成,作为巴塞罗那大学 AnCora 项目 的一部分。它被 Universal Dependencies 团队 转换为依赖关系,并在 CoNLL 2009 共享任务中使用。CoNLL 2009 版本后来被转换为 HamleDT 和 Universal Dependencies。

注释

注释过程

更多关于 AnCora 注释的信息,请访问 AnCora 网站

注释者

更多关于 AnCora 注释团队的信息,请访问 AnCora 网站

个人和敏感信息

不包含个人或敏感信息。

使用数据的注意事项

数据集的社会影响

该数据集有助于西班牙语语言模型的发展。

附加信息

许可信息

该作品根据 <a rel="license" href="https://creativecommons.org/licenses/by/4.0/">CC Attribution 4.0 International License</a> 许可。

引用信息

使用该语料库时,必须引用以下论文:

Taulé, M., M.A. Martí, M. Recasens (2008) Ancora: Multilevel Annotated Corpora for Catalan and Spanish, Proceedings of 6th International Conference on Language Resources and Evaluation. Marrakesh (Morocco).

引用 Universal Dependencies 项目时:

Rueter, J. (Creator), Erina, O. (Contributor), Klementeva, J. (Contributor), Ryabov, I. (Contributor), Tyers, F. M. (Contributor), Zeman, D. (Contributor), Nivre, J. (Creator) (15 Nov 2020). Universal Dependencies version 2.7 Erzya JR. Universal Dependencies Consortium.

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作