ficsort/SzegedNER

Name: ficsort/SzegedNER
Creator: ficsort
Published: 2022-11-02 15:56:22
License: 暂无描述

Hugging Face2022-11-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ficsort/SzegedNER

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: - hu language_creators: - other license: [] multilinguality: - monolingual paperswithcode_id: null pretty_name: SzegedNER size_categories: - 1K<n<10K source_datasets: - original tags: - hungarian - szeged - ner task_categories: - token-classification task_ids: - named-entity-recognition --- # Introduction The recognition and classification of proper nouns and names in plain text is of key importance in Natural Language Processing (NLP) as it has a beneficial effect on the performance of various types of applications, including Information Extraction, Machine Translation, Syntactic Parsing/Chunking, etc. ## Corpus of Business Newswire Texts (business) The Named Entity Corpus for Hungarian is a subcorpus of the Szeged Treebank, which contains full syntactic annotations done manually by linguist experts. A significant part of these texts has been annotated with Named Entity class labels in line with the annotation standards used on the CoNLL-2003 shared task. Statistical data on Named Entities occurring in the corpus: ``` | tokens | phrases ------ | ------ | ------- non NE | 200067 | PER | 1921 | 982 ORG | 20433 | 10533 LOC | 1501 | 1294 MISC | 2041 | 1662 ``` ### Reference > György Szarvas, Richárd Farkas, László Felföldi, András Kocsor, János Csirik: Highly accurate Named Entity corpus for Hungarian. International Conference on Language Resources and Evaluation 2006, Genova (Italy) ## Criminal NE corpus (criminal) The Hungarian National Corpus and its Heti Világgazdaság (HVG) subcorpus provided the basis for corpus text selection: articles related to the topic of financially liable offences were selected and annotated for the categories person, organization, location and miscellaneous. There are two annotated versions of the corpus. When preparing the tag-for-meaning annotation, our linguists took into consideration the context in which the Named Entity under investigation occurred, thus, it was not the primary sense of the Named Entity that determined the tag (e.g. Manchester=LOC) but its contextual reference (e.g. Manchester won the Premier League=ORG). As for tag-for-tag annotation, these cases were not differentiated: tags were always given on the basis of the primary sense. Statistical data on Named Entities occurring in the corpus: ``` | tag-for-meaning | tag-for-tag ------ | --------------- | ----------- non NE | 200067 | PER | 8101 | 8121 ORG | 8782 | 9480 LOC | 5049 | 5391 MISC | 1917 | 854 ``` ## Metadata dataset_info: - config_name: business features: - name: id dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: 0: O 1: B-PER 2: I-PER 3: B-ORG 4: I-ORG 5: B-LOC 6: I-LOC 7: B-MISC 8: I-MISC - name: document_id dtype: string - name: sentence_id dtype: string splits: - name: original num_bytes: 4452207 num_examples: 9573 - name: test num_bytes: 856798 num_examples: 1915 - name: train num_bytes: 3171931 num_examples: 6701 - name: validation num_bytes: 423478 num_examples: 957 download_size: 0 dataset_size: 8904414 - config_name: criminal features: - name: id dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: 0: O 1: B-PER 2: I-PER 3: B-ORG 4: I-ORG 5: B-LOC 6: I-LOC 7: B-MISC 8: I-MISC - name: document_id dtype: string - name: sentence_id dtype: string splits: - name: original num_bytes: 2807970 num_examples: 5375 - name: test num_bytes: 520959 num_examples: 1089 - name: train num_bytes: 1989662 num_examples: 3760 - name: validation num_bytes: 297349 num_examples: 526 download_size: 0 dataset_size: 5615940

提供机构：

ficsort

原始信息汇总

SzegedNER 数据集概述

基本信息

数据集名称: SzegedNER
语言: 匈牙利语
数据集大小: 1K<n<10K
多语言性: 单语种
任务类别: 标记分类
任务ID: 命名实体识别
标签创建者: 专家生成
源数据集: 原始数据集
标签: 匈牙利语, Szeged, 命名实体识别

数据集内容

商业新闻文本语料库 (business)

来源: Szeged Treebank 的子语料库，包含由语言学专家手动完成的全句法注释。
命名实体统计:
```
   | tokens | phrases
```
------ | ------ | ------- non NE | 200067 | PER | 1921 | 982 ORG | 20433 | 10533 LOC | 1501 | 1294 MISC | 2041 | 1662

刑事命名实体语料库 (criminal)

来源: 匈牙利国家语料库及其 Heti Világgazdaság (HVG) 子语料库，涉及金融责任犯罪主题的文章。
命名实体统计:
```
   | tag-for-meaning | tag-for-tag
```
------ | --------------- | ----------- non NE | 200067 | PER | 8101 | 8121 ORG | 8782 | 9480 LOC | 5049 | 5391 MISC | 1917 | 854

元数据

商业新闻文本语料库 (business)

配置名称: business
特征:
- id: 字符串
- tokens: 字符串序列
- ner_tags: 类别标签序列
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
  - 7: B-MISC
  - 8: I-MISC
- document_id: 字符串
- sentence_id: 字符串
分割:
- original: 4452207 字节, 9573 个样本
- test: 856798 字节, 1915 个样本
- train: 3171931 字节, 6701 个样本
- validation: 423478 字节, 957 个样本
下载大小: 0 字节
数据集大小: 8904414 字节

刑事命名实体语料库 (criminal)

配置名称: criminal
特征:
- id: 字符串
- tokens: 字符串序列
- ner_tags: 类别标签序列
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
  - 7: B-MISC
  - 8: I-MISC
- document_id: 字符串
- sentence_id: 字符串
分割:
- original: 2807970 字节, 5375 个样本
- test: 520959 字节, 1089 个样本
- train: 1989662 字节, 3760 个样本
- validation: 297349 字节, 526 个样本
下载大小: 0 字节
数据集大小: 5615940 字节

5,000+

优质数据集

54 个

任务类型

进入经典数据集