five

HebArabNlpProject/ShamNER

收藏
Hugging Face2025-07-11 更新2025-08-09 收录
下载链接:
https://hf-mirror.com/datasets/HebArabNlpProject/ShamNER
下载链接
链接失效反馈
官方服务:
资源简介:
ShamNER是一个经过精心整理的黎凡特阿拉伯语命名实体识别语料库,包含对句子进行命名实体注释的数据,以及用于检查注释者之间一致性的双重注释。该数据集包含多个注释轮次,从试点轮到第五轮为手动注释,质量在每一轮中都有所提高,第六轮为合成数据后进行人工编辑。数据集遵循严格的跨度新颖性评估规则,保证验证集和测试集中的实体表面形式在训练集中没有出现。数据集是分词器无关的,只存储原始句子和字符跨度信息,允许用户使用任何分词器重新生成BIO标签。

ShamNER is a curated corpus of Levantine-Arabic sentences annotated for Named Entities, plus dual annotation to check for consistency across human annotators. The dataset includes multiple annotation rounds, with the first five rounds being manual annotations with improving quality, and the sixth round being synthetic data followed by human post-editing. It adheres to a strict span-novel evaluation rule, ensuring that entity surface forms in the validation and test sets do not appear in the training set. The dataset is tokenizer-agnostic, storing only raw sentences and character span information, allowing users to regenerate BIO tags with any tokenizer.
提供机构:
HebArabNlpProject
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作