five

erayyildiz/turkish_ner

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/erayyildiz/turkish_ner
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - expert-generated language: - tr license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - token-classification task_ids: - named-entity-recognition pretty_name: TurkishNer dataset_info: features: - name: id dtype: string - name: tokens sequence: string - name: domain dtype: class_label: names: '0': architecture '1': basketball '2': book '3': business '4': education '5': fictional_universe '6': film '7': food '8': geography '9': government '10': law '11': location '12': military '13': music '14': opera '15': organization '16': people '17': religion '18': royalty '19': soccer '20': sports '21': theater '22': time '23': travel '24': tv - name: ner_tags sequence: class_label: names: '0': O '1': B-PERSON '2': I-PERSON '3': B-ORGANIZATION '4': I-ORGANIZATION '5': B-LOCATION '6': I-LOCATION '7': B-MISC '8': I-MISC splits: - name: train num_bytes: 177658278 num_examples: 532629 download_size: 204393976 dataset_size: 177658278 --- # Dataset Card for turkish_ner ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** http://arxiv.org/abs/1702.02363 - **Repository:** [Needs More Information] - **Paper:** http://arxiv.org/abs/1702.02363 - **Leaderboard:** [Needs More Information] - **Point of Contact:** erayyildiz@ktu.edu.tr ### Dataset Summary Automatically annotated Turkish corpus for named entity recognition and text categorization using large-scale gazetteers. The constructed gazetteers contains approximately 300K entities with thousands of fine-grained entity types under 25 different domains. ### Supported Tasks and Leaderboards [Needs More Information] ### Languages Turkish ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits There's only the training set. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators H. Bahadir Sahin, Caglar Tirkaz, Eray Yildiz, Mustafa Tolga Eren and Omer Ozan Sonmez ### Licensing Information Creative Commons Attribution 4.0 International ### Citation Information @InProceedings@article{DBLP:journals/corr/SahinTYES17, author = {H. Bahadir Sahin and Caglar Tirkaz and Eray Yildiz and Mustafa Tolga Eren and Omer Ozan Sonmez}, title = {Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers}, journal = {CoRR}, volume = {abs/1702.02363}, year = {2017}, url = {http://arxiv.org/abs/1702.02363}, archivePrefix = {arXiv}, eprint = {1702.02363}, timestamp = {Mon, 13 Aug 2018 16:46:36 +0200}, biburl = {https://dblp.org/rec/journals/corr/SahinTYES17.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ### Contributions Thanks to [@merveenoyan](https://github.com/merveenoyan) for adding this dataset.
提供机构:
erayyildiz
原始信息汇总

数据集卡片 for TurkishNer

数据集描述

数据集摘要

自动标注的土耳其语语料库,用于命名实体识别和文本分类,使用大规模地名词典。构建的地名词典包含约30万个实体,具有数千种细粒度实体类型,涵盖25个不同领域。

支持的任务和排行榜

[需要更多信息]

语言

土耳其语

数据集结构

数据实例

[需要更多信息]

数据字段

  • id: 字符串类型
  • tokens: 字符串序列
  • domain: 类别标签,包括以下领域:
    • 0: architecture
    • 1: basketball
    • 2: book
    • 3: business
    • 4: education
    • 5: fictional_universe
    • 6: film
    • 7: food
    • 8: geography
    • 9: government
    • 10: law
    • 11: location
    • 12: military
    • 13: music
    • 14: opera
    • 15: organization
    • 16: people
    • 17: religion
    • 18: royalty
    • 19: soccer
    • 20: sports
    • 21: theater
    • 22: time
    • 23: travel
    • 24: tv
  • ner_tags: 类别标签序列,包括以下标签:
    • 0: O
    • 1: B-PERSON
    • 2: I-PERSON
    • 3: B-ORGANIZATION
    • 4: I-ORGANIZATION
    • 5: B-LOCATION
    • 6: I-LOCATION
    • 7: B-MISC
    • 8: I-MISC

数据分割

只有训练集。

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集和规范化

[需要更多信息]

源语言生产者是谁?

[需要更多信息]

标注

标注过程

[需要更多信息]

标注者是谁?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据集的社会影响

[需要更多信息]

偏见的讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策展人

H. Bahadir Sahin, Caglar Tirkaz, Eray Yildiz, Mustafa Tolga Eren 和 Omer Ozan Sonmez

许可信息

Creative Commons Attribution 4.0 International

引用信息

@InProceedings@article{DBLP:journals/corr/SahinTYES17, author = {H. Bahadir Sahin and Caglar Tirkaz and Eray Yildiz and Mustafa Tolga Eren and Omer Ozan Sonmez}, title = {Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers}, journal = {CoRR}, volume = {abs/1702.02363}, year = {2017}, url = {http://arxiv.org/abs/1702.02363}, archivePrefix = {arXiv}, eprint = {1702.02363}, timestamp = {Mon, 13 Aug 2018 16:46:36 +0200}, biburl = {https://dblp.org/rec/journals/corr/SahinTYES17.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

贡献

感谢 @merveenoyan 添加此数据集。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作