erayyildiz/turkish_ner

Name: erayyildiz/turkish_ner
Creator: erayyildiz
Published: 2024-01-18 11:17:29
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/erayyildiz/turkish_ner

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language_creators: - expert-generated language: - tr license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - token-classification task_ids: - named-entity-recognition pretty_name: TurkishNer dataset_info: features: - name: id dtype: string - name: tokens sequence: string - name: domain dtype: class_label: names: '0': architecture '1': basketball '2': book '3': business '4': education '5': fictional_universe '6': film '7': food '8': geography '9': government '10': law '11': location '12': military '13': music '14': opera '15': organization '16': people '17': religion '18': royalty '19': soccer '20': sports '21': theater '22': time '23': travel '24': tv - name: ner_tags sequence: class_label: names: '0': O '1': B-PERSON '2': I-PERSON '3': B-ORGANIZATION '4': I-ORGANIZATION '5': B-LOCATION '6': I-LOCATION '7': B-MISC '8': I-MISC splits: - name: train num_bytes: 177658278 num_examples: 532629 download_size: 204393976 dataset_size: 177658278 --- # Dataset Card for turkish_ner ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** http://arxiv.org/abs/1702.02363 - **Repository:** [Needs More Information] - **Paper:** http://arxiv.org/abs/1702.02363 - **Leaderboard:** [Needs More Information] - **Point of Contact:** erayyildiz@ktu.edu.tr ### Dataset Summary Automatically annotated Turkish corpus for named entity recognition and text categorization using large-scale gazetteers. The constructed gazetteers contains approximately 300K entities with thousands of fine-grained entity types under 25 different domains. ### Supported Tasks and Leaderboards [Needs More Information] ### Languages Turkish ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits There's only the training set. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators H. Bahadir Sahin, Caglar Tirkaz, Eray Yildiz, Mustafa Tolga Eren and Omer Ozan Sonmez ### Licensing Information Creative Commons Attribution 4.0 International ### Citation Information @InProceedings@article{DBLP:journals/corr/SahinTYES17, author = {H. Bahadir Sahin and Caglar Tirkaz and Eray Yildiz and Mustafa Tolga Eren and Omer Ozan Sonmez}, title = {Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers}, journal = {CoRR}, volume = {abs/1702.02363}, year = {2017}, url = {http://arxiv.org/abs/1702.02363}, archivePrefix = {arXiv}, eprint = {1702.02363}, timestamp = {Mon, 13 Aug 2018 16:46:36 +0200}, biburl = {https://dblp.org/rec/journals/corr/SahinTYES17.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ### Contributions Thanks to [@merveenoyan](https://github.com/merveenoyan) for adding this dataset.

提供机构：

erayyildiz

原始信息汇总

数据集卡片 for TurkishNer

数据集描述

数据集摘要

自动标注的土耳其语语料库，用于命名实体识别和文本分类，使用大规模地名词典。构建的地名词典包含约30万个实体，具有数千种细粒度实体类型，涵盖25个不同领域。

支持的任务和排行榜

[需要更多信息]

语言

土耳其语

数据集结构

数据实例

[需要更多信息]

数据字段

id: 字符串类型
tokens: 字符串序列
domain: 类别标签，包括以下领域：
- 0: architecture
- 1: basketball
- 2: book
- 3: business
- 4: education
- 5: fictional_universe
- 6: film
- 7: food
- 8: geography
- 9: government
- 10: law
- 11: location
- 12: military
- 13: music
- 14: opera
- 15: organization
- 16: people
- 17: religion
- 18: royalty
- 19: soccer
- 20: sports
- 21: theater
- 22: time
- 23: travel
- 24: tv
ner_tags: 类别标签序列，包括以下标签：
- 0: O
- 1: B-PERSON
- 2: I-PERSON
- 3: B-ORGANIZATION
- 4: I-ORGANIZATION
- 5: B-LOCATION
- 6: I-LOCATION
- 7: B-MISC
- 8: I-MISC

数据分割

只有训练集。

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集和规范化

[需要更多信息]

源语言生产者是谁？

[需要更多信息]

标注

标注过程

[需要更多信息]

标注者是谁？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据集的社会影响

[需要更多信息]

偏见的讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策展人

H. Bahadir Sahin, Caglar Tirkaz, Eray Yildiz, Mustafa Tolga Eren 和 Omer Ozan Sonmez

许可信息

Creative Commons Attribution 4.0 International

引用信息

@InProceedings@article{DBLP:journals/corr/SahinTYES17, author = {H. Bahadir Sahin and Caglar Tirkaz and Eray Yildiz and Mustafa Tolga Eren and Omer Ozan Sonmez}, title = {Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers}, journal = {CoRR}, volume = {abs/1702.02363}, year = {2017}, url = {http://arxiv.org/abs/1702.02363}, archivePrefix = {arXiv}, eprint = {1702.02363}, timestamp = {Mon, 13 Aug 2018 16:46:36 +0200}, biburl = {https://dblp.org/rec/journals/corr/SahinTYES17.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

贡献

感谢 @merveenoyan 添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集