klintan/swedish_ner_corpus

Name: klintan/swedish_ner_corpus
Creator: klintan
Published: 2024-01-18 11:16:38
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/klintan/swedish_ner_corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - found language: - sv license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - token-classification task_ids: - named-entity-recognition pretty_name: Swedish NER Corpus dataset_info: features: - name: id dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': '0' '1': LOC '2': MISC '3': ORG '4': PER splits: - name: train num_bytes: 2032630 num_examples: 6886 - name: test num_bytes: 755234 num_examples: 2453 download_size: 1384558 dataset_size: 2787864 --- # Dataset Card for Swedish NER Corpus ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/klintan/swedish-ner-corpus]() - **Repository:** [https://github.com/klintan/swedish-ner-corpus]() - **Point of contact:** [Andreas Klintberg](ankl@kth.se) ### Dataset Summary Webbnyheter 2012 from Spraakbanken, semi-manually annotated and adapted for CoreNLP Swedish NER. Semi-manually defined in this case as: Bootstrapped from Swedish Gazetters then manually correcte/reviewed by two independent native speaking swedish annotators. No annotator agreement calculated. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages Swedish ## Dataset Structure ### Data Instances A sample dataset instance is provided below: ```json {'id': '3', 'ner_tags': [4, 4, 0, 0, 0, 0, 0, 0, 3, 3, 0], 'tokens': ['Margaretha', 'Fahlgren', ',', 'professor', 'i', 'litteraturvetenskap', ',', 'vice-rektor', 'Uppsala', 'universitet', '.']} ``` ### Data Fields - `id`: id of the sentence - `token`: current token - `ner_tag`: ner tag of the token Full fields: ```json { "id":{ "feature_type":"Value" "dtype":"string" } "tokens":{ "feature_type":"Sequence" "feature":{ "feature_type":"Value" "dtype":"string" } } "ner_tags":{ "feature_type":"Sequence" "dtype":"int32" "feature":{ "feature_type":"ClassLabel" "dtype":"int32" "class_names":[ 0:"0" 1:"LOC" 2:"MISC" 3:"ORG" 4:"PER" ] } } } ``` ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators The original dataset was provided by Språkbanken which consists of news from Swedish newspapers' websites. ### Licensing Information https://github.com/klintan/swedish-ner-corpus/blob/master/LICENSE ### Citation Information [More Information Needed] ### Contributions Thanks to [@abhishekkrthakur](https://github.com/abhishekkrthakur) for adding this dataset.

提供机构：

klintan

原始信息汇总

数据集卡片 - Swedish NER Corpus

数据集描述

数据集摘要

Swedish NER Corpus 数据集源自 Språkbanken 的 Webbnyheter 2012，经过半手工标注并适应 CoreNLP 瑞典 NER。半手工标注过程包括从瑞典地名录中引导，然后由两名独立的母语为瑞典语的标注者手动校正/审核。未计算标注者一致性。

支持的任务和排行榜

[更多信息需要]

语言

瑞典语

数据集结构

数据实例

以下是一个数据集实例的示例：

json { "id": "3", "ner_tags": [4, 4, 0, 0, 0, 0, 0, 0, 3, 3, 0], "tokens": ["Margaretha", "Fahlgren", ",", "professor", "i", "litteraturvetenskap", ",", "vice-rektor", "Uppsala", "universitet", "."] }

数据字段

id: 句子的ID
token: 当前词元
ner_tag: 词元的NER标签

完整字段：

json { "id": { "feature_type": "Value", "dtype": "string" }, "tokens": { "feature_type": "Sequence", "feature": { "feature_type": "Value", "dtype": "string" } }, "ner_tags": { "feature_type": "Sequence", "dtype": "int32", "feature": { "feature_type": "ClassLabel", "dtype": "int32", "class_names": [ "0": "0", "1": "LOC", "2": "MISC", "3": "ORG", "4": "PER" ] } } }

数据分割

[更多信息需要]

数据集创建

策划理由

[更多信息需要]

源数据

[更多信息需要]

初始数据收集和规范化

[更多信息需要]

源语言生产者是谁？

[更多信息需要]

标注

[更多信息需要]

标注过程

[更多信息需要]

标注者是谁？

[更多信息需要]

个人和敏感信息

[更多信息需要]

使用数据的注意事项

数据集的社会影响

[更多信息需要]

偏见的讨论

[更多信息需要]

其他已知限制

[更多信息需要]

附加信息

数据集策展人

原始数据集由 Språkbanken 提供，包含来自瑞典报纸网站的新闻。

许可信息

https://github.com/klintan/swedish-ner-corpus/blob/master/LICENSE

引用信息

[更多信息需要]

贡献

感谢 @abhishekkrthakur 添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集