TurkuNLP/turku_ner_corpus
收藏Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/TurkuNLP/turku_ner_corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- expert-generated
language:
- fi
license:
- cc-by-nc-sa-4.0
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- token-classification
task_ids:
- named-entity-recognition
pretty_name: Turku NER corpus
dataset_info:
features:
- name: id
dtype: string
- name: tokens
sequence: string
- name: ner_tags
sequence:
class_label:
names:
'0': B-DATE
'1': B-EVENT
'2': B-LOC
'3': B-ORG
'4': B-PER
'5': B-PRO
'6': I-DATE
'7': I-EVENT
'8': I-LOC
'9': I-ORG
'10': I-PER
'11': I-PRO
'12': O
splits:
- name: train
num_bytes: 3257447
num_examples: 12217
- name: validation
num_bytes: 364223
num_examples: 1364
- name: test
num_bytes: 416644
num_examples: 1555
download_size: 1659911
dataset_size: 4038314
---
# Dataset Card for Turku NER corpus
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://turkunlp.org/fin-ner.html
- **Repository:** https://github.com/TurkuNLP/turku-ner-corpus/
- **Paper:** https://www.aclweb.org/anthology/2020.lrec-1.567/
- **Leaderboard:** [If the dataset supports an active leaderboard, add link here]()
- **Point of Contact:** {jouni.a.luoma,mhtoin,maria.h.pyykonen,mavela,sampo.pyysalo}@utu.f
### Dataset Summary
[More Information Needed]
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
[More Information Needed]
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
[More Information Needed]
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
[More Information Needed]
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
[More Information Needed]
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
[More Information Needed]
### Contributions
Thanks to [@abhishekkrthakur](https://github.com/abhishekkrthakur) for adding this dataset.
提供机构:
TurkuNLP
原始信息汇总
数据集卡片 for Turku NER corpus
数据集描述
数据集概述
- 数据集名称: Turku NER corpus
- 语言: 芬兰语 (fi)
- 许可证: CC BY-NC-SA 4.0
- 多语言性: 单语种
- 数据集大小: 10K<n<100K
- 源数据: 原始数据
- 任务类别: 词性标注
- 任务ID: 命名实体识别
数据集结构
数据字段
- id: 字符串类型
- tokens: 字符串序列
- ner_tags: 标签序列,包含以下类别:
- 0: B-DATE
- 1: B-EVENT
- 2: B-LOC
- 3: B-ORG
- 4: B-PER
- 5: B-PRO
- 6: I-DATE
- 7: I-EVENT
- 8: I-LOC
- 9: I-ORG
- 10: I-PER
- 11: I-PRO
- 12: O
数据分割
- 训练集:
- 字节数: 3257447
- 样本数: 12217
- 验证集:
- 字节数: 364223
- 样本数: 1364
- 测试集:
- 字节数: 416644
- 样本数: 1555
数据集大小
- 下载大小: 1659911 字节
- 数据集大小: 4038314 字节



