HIT-TMG/Hansel
收藏Hugging Face2023-03-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/HIT-TMG/Hansel
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
- found
language:
- zh
language_creators:
- found
- crowdsourced
license:
- cc-by-sa-4.0
multilinguality:
- monolingual
paperswithcode_id: hansel
pretty_name: Hansel
size_categories:
- 1M<n<10M
- 1K<n<10K
source_datasets:
- original
tags: []
task_categories:
- text-retrieval
task_ids:
- entity-linking-retrieval
dataset_info:
- config_name: wiki
features:
- name: id
dtype: string
- name: text
dtype: string
- name: start
dtype: int64
- name: end
dtype: int64
- name: mention
dtype: string
- name: gold_id
dtype: string
splits:
- name: train
- name: validation
- config_name: hansel-few-shot
features:
- name: id
dtype: string
- name: text
dtype: string
- name: start
dtype: int64
- name: end
dtype: int64
- name: mention
dtype: string
- name: gold_id
dtype: string
- name: source
dtype: string
- name: domain
dtype: string
splits:
- name: test
- config_name: hansel-zero-shot
features:
- name: id
dtype: string
- name: text
dtype: string
- name: start
dtype: int64
- name: end
dtype: int64
- name: mention
dtype: string
- name: gold_id
dtype: string
- name: source
dtype: string
- name: domain
dtype: string
splits:
- name: test
---
# Dataset Card for "Hansel"
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Splits](#data-splits)
- [Citation](#citation)
## Dataset Description
- **Homepage:** https://github.com/HITsz-TMG/Hansel
- **Paper:** https://arxiv.org/abs/2207.13005
Hansel is a high-quality human-annotated Chinese entity linking (EL) dataset, focusing on tail entities and emerging entities:
- The test set contains Few-shot (FS) and zero-shot (ZS) slices, has 10K examples and uses Wikidata as the corresponding knowledge base.
- The training and validation sets are from Wikipedia hyperlinks, useful for large-scale pretraining of Chinese EL systems.
Please see our [WSDM 2023](https://www.wsdm-conference.org/2023/) paper [**"Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark"**](https://dl.acm.org/doi/10.1145/3539597.3570418) to learn more about our dataset.
For models in the paper and our processed knowledge base, please see our [Github repository](https://github.com/HITsz-TMG/Hansel).
## Dataset Structure
### Data Instances
{"id": "hansel-eval-zs-1463",
"text": "1905电影网讯 已经筹备了十余年的吉尔莫·德尔·托罗的《匹诺曹》,在上个月顺利被网飞公司买下,成为了流媒体巨头旗下的新片。近日,这部备受关注的影片确定了自己的档期:2021年。虽然具体时间未定,但影片却已经实实在在地向前迈出了一步。",
"start": 29,
"end": 32,
"mention": "匹诺曹",
"gold_id": "Q73895818",
"source": "https://www.1905.com/news/20181107/1325389.shtml",
"domain": "news"
}
### Data Splits
| | # Mentions | # Entities | Domain |
| ---- | ---- | ---- | ---- |
| Train | 9,879,813 | 541,058 | Wikipedia |
| Validation | 9,674 | 6,320 | Wikipedia |
| Hansel-FS | 5,260 | 2,720 | News, Social Media |
| Hansel-ZS | 4,715 | 4,046 | News, Social Media, E-books, etc.|
## Citation
If you find our dataset useful, please cite us.
```bibtex
@inproceedings{xu2022hansel,
author = {Xu, Zhenran and Shan, Zifei and Li, Yuxin and Hu, Baotian and Qin, Bing},
title = {Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark},
year = {2023},
publisher = {Association for Computing Machinery},
url = {https://doi.org/10.1145/3539597.3570418},
booktitle = {Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining},
pages = {832–840}
}
```
annotations_creators:
- 众包标注
- 公开提取
language:
- 中文
language_creators:
- 公开提取
- 众包标注
license:
- 知识共享署名-相同方式共享4.0(CC BY-SA 4.0)
multilinguality:
- 单语言
paperswithcode_id: hansel
pretty_name: Hansel
size_categories:
- 100万<样本量<1000万
- 1000<样本量<1万
source_datasets:
- 原创数据集
tags: []
task_categories:
- 文本检索(Text Retrieval)
task_ids:
- 实体链接检索(Entity Linking Retrieval)
dataset_info:
- config_name: wiki
features:
- name: id
dtype: 字符串(string)
- name: text
dtype: 字符串(string)
- name: start
dtype: 64位整数(int64)
- name: end
dtype: 64位整数(int64)
- name: mention
dtype: 字符串(实体提及)
- name: gold_id
dtype: 字符串(标准实体ID)
splits:
- name: 训练集(train)
- name: 验证集(validation)
- config_name: hansel-few-shot
features:
- name: id
dtype: 字符串(string)
- name: text
dtype: 字符串(string)
- name: start
dtype: 64位整数(int64)
- name: end
dtype: 64位整数(int64)
- name: mention
dtype: 字符串(实体提及)
- name: gold_id
dtype: 字符串(标准实体ID)
- name: source
dtype: 字符串(来源链接)
- name: domain
dtype: 字符串(领域)
splits:
- name: 测试集(test)
- config_name: hansel-zero-shot
features:
- name: id
dtype: 字符串(string)
- name: text
dtype: 字符串(string)
- name: start
dtype: 64位整数(int64)
- name: end
dtype: 64位整数(int64)
- name: mention
dtype: 字符串(实体提及)
- name: gold_id
dtype: 字符串(标准实体ID)
- name: source
dtype: 字符串(来源链接)
- name: domain
dtype: 字符串(领域)
splits:
- name: 测试集(test)
# 《Hansel数据集卡片》
## 目录
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集结构](#dataset-structure)
- [数据样例](#data-instances)
- [数据划分](#data-splits)
- [引用](#citation)
## 数据集描述
- **主页:** https://github.com/HITsz-TMG/Hansel
- **论文:** https://arxiv.org/abs/2207.13005
Hansel是一个高质量的人工标注中文实体链接(Entity Linking, EL)数据集,聚焦于长尾实体与新兴实体:
- 测试集包含少样本(Few-shot, FS)与零样本(Zero-shot, ZS)子集,共包含10000条样本,以维基数据(Wikidata)作为配套知识库。
- 训练集与验证集源自维基百科超链接,可用于中文实体链接系统的大规模预训练。
如需了解本数据集的更多细节,请参阅我们发表于第十六届ACM网络搜索与数据挖掘国际会议(WSDM 2023)的论文**《Hansel:面向中文少样本与零样本实体链接的基准数据集》**(https://dl.acm.org/doi/10.1145/3539597.3570418)。
如需获取论文中提及的模型与我们预处理后的知识库,请访问我们的GitHub仓库:https://github.com/HITsz-TMG/Hansel。
## 数据集结构
### 数据样例
{"id": "hansel-eval-zs-1463",
"text": "1905电影网讯 已经筹备了十余年的吉尔莫·德尔·托罗的《匹诺曹》,在上个月顺利被网飞公司买下,成为了流媒体巨头旗下的新片。近日,这部备受关注的影片确定了自己的档期:2021年。虽然具体时间未定,但影片却已经实实在在地向前迈出了一步。",
"start": 29,
"end": 32,
"mention": "匹诺曹",
"gold_id": "Q73895818",
"source": "https://www.1905.com/news/20181107/1325389.shtml",
"domain": "news"
}
### 数据划分
| | 实体提及数 | 实体总数 | 所属领域 |
| ---- | ---- | ---- | ---- |
| 训练集 | 9,879,813 | 541,058 | 维基百科 |
| 验证集 | 9,674 | 6,320 | 维基百科 |
| Hansel-FS | 5,260 | 2,720 | 新闻、社交媒体 |
| Hansel-ZS | 4,715 | 4,046 | 新闻、社交媒体、电子书等 |
## 引用
若您的研究使用本数据集,请引用我们的论文。
bibtex
@inproceedings{xu2022hansel,
author = {Xu, Zhenran and Shan, Zifei and Li, Yuxin and Hu, Baotian and Qin, Bing},
title = {Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark},
year = {2023},
publisher = {Association for Computing Machinery},
url = {https://doi.org/10.1145/3539597.3570418},
booktitle = {Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining},
pages = {832–840}
}
提供机构:
HIT-TMG
原始信息汇总
数据集概述:Hansel
数据集描述
- 名称:Hansel
- 语言:中文
- 许可:cc-by-sa-4.0
- 任务类别:text-retrieval
- 任务ID:entity-linking-retrieval
- 数据集大小:1M<n<10M 和 1K<n<10K
- 数据来源:原始数据
- 数据集特点:
- 专注于尾实体和新兴实体
- 测试集包含Few-shot (FS) 和 zero-shot (ZS) 切片,共10K示例,使用Wikidata作为知识库
- 训练和验证集来自Wikipedia超链接
数据集结构
数据实例
- 字段:id, text, start, end, mention, gold_id, source, domain
- 示例: json { "id": "hansel-eval-zs-1463", "text": "1905电影网讯 已经筹备了十余年的吉尔莫·德尔·托罗的《匹诺曹》,在上个月顺利被网飞公司买下,成为了流媒体巨头旗下的新片。近日,这部备受关注的影片确定了自己的档期:2021年。虽然具体时间未定,但影片却已经实实在在地向前迈出了一步。", "start": 29, "end": 32, "mention": "匹诺曹", "gold_id": "Q73895818", "source": "https://www.1905.com/news/20181107/1325389.shtml", "domain": "news" }
数据分割
| 分割 | 提及数 | 实体数 | 领域 |
|---|---|---|---|
| 训练 | 9,879,813 | 541,058 | Wikipedia |
| 验证 | 9,674 | 6,320 | Wikipedia |
| Hansel-FS | 5,260 | 2,720 | News, Social Media |
| Hansel-ZS | 4,715 | 4,046 | News, Social Media, E-books, etc. |
引用信息
bibtex @inproceedings{xu2022hansel, author = {Xu, Zhenran and Shan, Zifei and Li, Yuxin and Hu, Baotian and Qin, Bing}, title = {Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark}, year = {2023}, publisher = {Association for Computing Machinery}, url = {https://doi.org/10.1145/3539597.3570418}, booktitle = {Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining}, pages = {832–840} }
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



