iahlt/arabic_ner_mafat_folds
收藏Hugging Face2024-01-13 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/iahlt/arabic_ner_mafat_folds
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: fold_1
features:
- name: id
dtype: string
- name: tokens
sequence: string
- name: raw_tags
sequence: string
- name: ner_tags
sequence:
class_label:
names:
'0': U-ANG
'1': L-ANG
'2': B-ANG
'3': I-ANG
'4': U-DUC
'5': I-DUC
'6': L-DUC
'7': B-DUC
'8': I-EVE
'9': L-EVE
'10': B-EVE
'11': U-EVE
'12': L-FAC
'13': U-FAC
'14': I-FAC
'15': B-FAC
'16': L-GPE
'17': B-GPE
'18': I-GPE
'19': U-GPE
'20': U-INFORMAL
'21': I-INFORMAL
'22': L-INFORMAL
'23': B-INFORMAL
'24': U-LOC
'25': I-LOC
'26': L-LOC
'27': B-LOC
'28': I-MISC
'29': U-MISC
'30': B-MISC
'31': L-MISC
'32': O
'33': I-ORG
'34': L-ORG
'35': U-ORG
'36': B-ORG
'37': L-PER
'38': B-PER
'39': I-PER
'40': U-PER
'41': I-TIMEX
'42': L-TIMEX
'43': U-TIMEX
'44': B-TIMEX
'45': U-TTL
'46': L-TTL
'47': B-TTL
'48': I-TTL
'49': B-WOA
'50': L-WOA
'51': U-WOA
'52': I-WOA
- name: record
dtype: string
splits:
- name: train
num_bytes: 87741254
num_examples: 30000
- name: validation
num_bytes: 28643406
num_examples: 10000
- name: test
num_bytes: 28643406
num_examples: 10000
download_size: 45076618
dataset_size: 145028066
- config_name: fold_2
features:
- name: id
dtype: string
- name: tokens
sequence: string
- name: raw_tags
sequence: string
- name: ner_tags
sequence:
class_label:
names:
'0': U-ANG
'1': L-ANG
'2': B-ANG
'3': I-ANG
'4': U-DUC
'5': I-DUC
'6': L-DUC
'7': B-DUC
'8': I-EVE
'9': L-EVE
'10': B-EVE
'11': U-EVE
'12': L-FAC
'13': U-FAC
'14': I-FAC
'15': B-FAC
'16': L-GPE
'17': B-GPE
'18': I-GPE
'19': U-GPE
'20': U-INFORMAL
'21': I-INFORMAL
'22': L-INFORMAL
'23': B-INFORMAL
'24': U-LOC
'25': I-LOC
'26': L-LOC
'27': B-LOC
'28': I-MISC
'29': U-MISC
'30': B-MISC
'31': L-MISC
'32': O
'33': I-ORG
'34': L-ORG
'35': U-ORG
'36': B-ORG
'37': L-PER
'38': B-PER
'39': I-PER
'40': U-PER
'41': I-TIMEX
'42': L-TIMEX
'43': U-TIMEX
'44': B-TIMEX
'45': U-TTL
'46': L-TTL
'47': B-TTL
'48': I-TTL
'49': B-WOA
'50': L-WOA
'51': U-WOA
'52': I-WOA
- name: record
dtype: string
splits:
- name: train
num_bytes: 86867948
num_examples: 30000
- name: validation
num_bytes: 29516712
num_examples: 10000
- name: test
num_bytes: 29516712
num_examples: 10000
download_size: 45337784
dataset_size: 145901372
- config_name: fold_3
features:
- name: id
dtype: string
- name: tokens
sequence: string
- name: raw_tags
sequence: string
- name: ner_tags
sequence:
class_label:
names:
'0': U-ANG
'1': L-ANG
'2': B-ANG
'3': I-ANG
'4': U-DUC
'5': I-DUC
'6': L-DUC
'7': B-DUC
'8': I-EVE
'9': L-EVE
'10': B-EVE
'11': U-EVE
'12': L-FAC
'13': U-FAC
'14': I-FAC
'15': B-FAC
'16': L-GPE
'17': B-GPE
'18': I-GPE
'19': U-GPE
'20': U-INFORMAL
'21': I-INFORMAL
'22': L-INFORMAL
'23': B-INFORMAL
'24': U-LOC
'25': I-LOC
'26': L-LOC
'27': B-LOC
'28': I-MISC
'29': U-MISC
'30': B-MISC
'31': L-MISC
'32': O
'33': I-ORG
'34': L-ORG
'35': U-ORG
'36': B-ORG
'37': L-PER
'38': B-PER
'39': I-PER
'40': U-PER
'41': I-TIMEX
'42': L-TIMEX
'43': U-TIMEX
'44': B-TIMEX
'45': U-TTL
'46': L-TTL
'47': B-TTL
'48': I-TTL
'49': B-WOA
'50': L-WOA
'51': U-WOA
'52': I-WOA
- name: record
dtype: string
splits:
- name: train
num_bytes: 87175881
num_examples: 30000
- name: validation
num_bytes: 29208779
num_examples: 10000
- name: test
num_bytes: 29208779
num_examples: 10000
download_size: 45201250
dataset_size: 145593439
- config_name: fold_4
features:
- name: id
dtype: string
- name: tokens
sequence: string
- name: raw_tags
sequence: string
- name: ner_tags
sequence:
class_label:
names:
'0': U-ANG
'1': L-ANG
'2': B-ANG
'3': I-ANG
'4': U-DUC
'5': I-DUC
'6': L-DUC
'7': B-DUC
'8': I-EVE
'9': L-EVE
'10': B-EVE
'11': U-EVE
'12': L-FAC
'13': U-FAC
'14': I-FAC
'15': B-FAC
'16': L-GPE
'17': B-GPE
'18': I-GPE
'19': U-GPE
'20': U-INFORMAL
'21': I-INFORMAL
'22': L-INFORMAL
'23': B-INFORMAL
'24': U-LOC
'25': I-LOC
'26': L-LOC
'27': B-LOC
'28': I-MISC
'29': U-MISC
'30': B-MISC
'31': L-MISC
'32': O
'33': I-ORG
'34': L-ORG
'35': U-ORG
'36': B-ORG
'37': L-PER
'38': B-PER
'39': I-PER
'40': U-PER
'41': I-TIMEX
'42': L-TIMEX
'43': U-TIMEX
'44': B-TIMEX
'45': U-TTL
'46': L-TTL
'47': B-TTL
'48': I-TTL
'49': B-WOA
'50': L-WOA
'51': U-WOA
'52': I-WOA
- name: record
dtype: string
splits:
- name: train
num_bytes: 87368897
num_examples: 30000
- name: validation
num_bytes: 29015763
num_examples: 10000
- name: test
num_bytes: 29015763
num_examples: 10000
download_size: 45120027
dataset_size: 145400423
configs:
- config_name: fold_1
data_files:
- split: train
path: fold_1/train-*
- split: validation
path: fold_1/validation-*
- split: test
path: fold_1/test-*
- config_name: fold_2
data_files:
- split: train
path: fold_2/train-*
- split: validation
path: fold_2/validation-*
- split: test
path: fold_2/test-*
- config_name: fold_3
data_files:
- split: train
path: fold_3/train-*
- split: validation
path: fold_3/validation-*
- split: test
path: fold_3/test-*
- config_name: fold_4
data_files:
- split: train
path: fold_4/train-*
- split: validation
path: fold_4/validation-*
- split: test
path: fold_4/test-*
---
数据集信息:
- 配置名称:fold_1
特征字段:
- 字段名:标识符(id),数据类型:字符串
- 字段名:词元(tokens),类型:字符串序列
- 字段名:原始标签,类型:字符串序列
- 字段名:命名实体识别标签(ner_tags),类型:序列,其类别标签映射如下:
0: U-ANG
1: L-ANG
2: B-ANG
3: I-ANG
4: U-DUC
5: I-DUC
6: L-DUC
7: B-DUC
8: I-EVE
9: L-EVE
10: B-EVE
11: U-EVE
12: L-FAC
13: U-FAC
14: I-FAC
15: B-FAC
16: L-GPE
17: B-GPE
18: I-GPE
19: U-GPE
20: U-INFORMAL
21: I-INFORMAL
22: L-INFORMAL
23: B-INFORMAL
24: U-LOC
25: I-LOC
26: L-LOC
27: B-LOC
28: I-MISC
29: U-MISC
30: B-MISC
31: L-MISC
32: O
33: I-ORG
34: L-ORG
35: U-ORG
36: B-ORG
37: L-PER
38: B-PER
39: I-PER
40: U-PER
41: I-TIMEX
42: L-TIMEX
43: U-TIMEX
44: B-TIMEX
45: U-TTL
46: L-TTL
47: B-TTL
48: I-TTL
49: B-WOA
50: L-WOA
51: U-WOA
52: I-WOA
- 字段名:原始记录(record),数据类型:字符串
数据拆分:
- 名称:训练集,字节数:87741254,样本数:30000
- 名称:验证集,字节数:28643406,样本数:10000
- 名称:测试集,字节数:28643406,样本数:10000
下载大小:45076618,数据集总大小:145028066
- 配置名称:fold_2
特征字段与fold_1完全一致,数据拆分:
- 名称:训练集,字节数:86867948,样本数:30000
- 名称:验证集,字节数:29516712,样本数:10000
- 名称:测试集,字节数:29516712,样本数:10000
下载大小:45337784,数据集总大小:145901372
- 配置名称:fold_3
特征字段与fold_1完全一致,数据拆分:
- 名称:训练集,字节数:87175881,样本数:30000
- 名称:验证集,字节数:29208779,样本数:10000
- 名称:测试集,字节数:29208779,样本数:10000
下载大小:45201250,数据集总大小:145593439
- 配置名称:fold_4
特征字段与fold_1完全一致,数据拆分:
- 名称:训练集,字节数:87368897,样本数:30000
- 名称:验证集,字节数:29015763,样本数:10000
- 名称:测试集,字节数:29015763,样本数:10000
下载大小:45120027,数据集总大小:145400423
配置项:
- 配置名称:fold_1
数据文件:
- 划分集:训练集,路径:fold_1/train-*
- 划分集:验证集,路径:fold_1/validation-*
- 划分集:测试集,路径:fold_1/test-*
- 配置名称:fold_2
数据文件:
- 划分集:训练集,路径:fold_2/train-*
- 划分集:验证集,路径:fold_2/validation-*
- 划分集:测试集,路径:fold_2/test-*
- 配置名称:fold_3
数据文件:
- 划分集:训练集,路径:fold_3/train-*
- 划分集:验证集,路径:fold_3/validation-*
- 划分集:测试集,路径:fold_3/test-*
- 配置名称:fold_4
数据文件:
- 划分集:训练集,路径:fold_4/train-*
- 划分集:验证集,路径:fold_4/validation-*
- 划分集:测试集,路径:fold_4/test-*
提供机构:
iahlt



