five

Hansel

收藏
魔搭社区2025-12-04 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/HIT-TMG/Hansel
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for "Hansel" ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Splits](#data-splits) - [Citation](#citation) ## Dataset Description - **Homepage:** https://github.com/HITsz-TMG/Hansel - **Paper:** https://arxiv.org/abs/2207.13005 Hansel is a high-quality human-annotated Chinese entity linking (EL) dataset, focusing on tail entities and emerging entities: - The test set contains Few-shot (FS) and zero-shot (ZS) slices, has 10K examples and uses Wikidata as the corresponding knowledge base. - The training and validation sets are from Wikipedia hyperlinks, useful for large-scale pretraining of Chinese EL systems. Please see our [WSDM 2023](https://www.wsdm-conference.org/2023/) paper [**"Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark"**](https://dl.acm.org/doi/10.1145/3539597.3570418) to learn more about our dataset. For models in the paper and our processed knowledge base, please see our [Github repository](https://github.com/HITsz-TMG/Hansel). ## Dataset Structure ### Data Instances {"id": "hansel-eval-zs-1463", "text": "1905电影网讯 已经筹备了十余年的吉尔莫·德尔·托罗的《匹诺曹》,在上个月顺利被网飞公司买下,成为了流媒体巨头旗下的新片。近日,这部备受关注的影片确定了自己的档期:2021年。虽然具体时间未定,但影片却已经实实在在地向前迈出了一步。", "start": 29, "end": 32, "mention": "匹诺曹", "gold_id": "Q73895818", "source": "https://www.1905.com/news/20181107/1325389.shtml", "domain": "news" } ### Data Splits | | # Mentions | # Entities | Domain | | ---- | ---- | ---- | ---- | | Train | 9,879,813 | 541,058 | Wikipedia | | Validation | 9,674 | 6,320 | Wikipedia | | Hansel-FS | 5,260 | 2,720 | News, Social Media | | Hansel-ZS | 4,715 | 4,046 | News, Social Media, E-books, etc.| ## Citation If you find our dataset useful, please cite us. ```bibtex @inproceedings{xu2022hansel, author = {Xu, Zhenran and Shan, Zifei and Li, Yuxin and Hu, Baotian and Qin, Bing}, title = {Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark}, year = {2023}, publisher = {Association for Computing Machinery}, url = {https://doi.org/10.1145/3539597.3570418}, booktitle = {Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining}, pages = {832–840} } ```

# "Hansel"数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集概述](#dataset-description) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据划分](#data-splits) - [引用格式](#citation) ## 数据集概述 - **项目主页**:https://github.com/HITsz-TMG/Hansel - **相关论文**:https://arxiv.org/abs/2207.13005 Hansel是一个高质量人工标注的中文实体链接(Entity Linking,EL)数据集,聚焦于长尾实体与新兴实体: - 测试集包含少样本(Few-shot,FS)与零样本(Zero-shot,ZS)子集,共计10000条样本,以维基数据(Wikidata)作为对应知识库。 - 训练集与验证集均取自维基百科(Wikipedia)超链接,可用于中文实体链接系统的大规模预训练。 如需了解该数据集的更多细节,请参阅我们发表于WSDM 2023的论文《Hansel:中文少样本与零样本实体链接基准数据集》,链接为:https://dl.acm.org/doi/10.1145/3539597.3570418。 如需获取论文中提及的模型与我们预处理后的知识库,请查阅我们的GitHub仓库:https://github.com/HITsz-TMG/Hansel。 ## 数据集结构 ### 数据实例 json {"id": "hansel-eval-zs-1463", "text": "1905电影网讯 已经筹备了十余年的吉尔莫·德尔·托罗的《匹诺曹》,在上个月顺利被网飞公司买下,成为了流媒体巨头旗下的新片。近日,这部备受关注的影片确定了自己的档期:2021年。虽然具体时间未定,但影片却已经实实在在地向前迈出了一步。", "start": 29, "end": 32, "mention": "匹诺曹", "gold_id": "Q73895818", "source": "https://www.1905.com/news/20181107/1325389.shtml", "domain": "news" } ### 数据划分 | | 实体提及数 | 实体数 | 领域 | | ---- | ---- | ---- | ---- | | 训练集 | 9,879,813 | 541,058 | 维基百科 | | 验证集 | 9,674 | 6,320 | 维基百科 | | Hansel少样本子集 | 5,260 | 2,720 | 新闻、社交媒体 | | Hansel零样本子集 | 4,715 | 4,046 | 新闻、社交媒体、电子书等 | ## 引用格式 如果您认为本数据集对您的研究有所帮助,请引用我们的工作。 bibtex @inproceedings{xu2022hansel, author = {Xu, Zhenran and Shan, Zifei and Li, Yuxin and Hu, Baotian and Qin, Bing}, title = {Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark}, year = {2023}, publisher = {Association for Computing Machinery}, url = {https://doi.org/10.1145/3539597.3570418}, booktitle = {Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining}, pages = {832–840} }
提供机构:
maas
创建时间:
2025-01-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作