HIT-TMG/Hansel|自然语言处理数据集|实体链接数据集

hugging_face2023-03-13 更新2024-03-04 收录

自然语言处理

实体链接

下载链接：

https://hf-mirror.com/datasets/HIT-TMG/Hansel

下载链接

链接失效反馈

资源简介：

--- annotations_creators: - crowdsourced - found language: - zh language_creators: - found - crowdsourced license: - cc-by-sa-4.0 multilinguality: - monolingual paperswithcode_id: hansel pretty_name: Hansel size_categories: - 1M<n<10M - 1K<n<10K source_datasets: - original tags: [] task_categories: - text-retrieval task_ids: - entity-linking-retrieval dataset_info: - config_name: wiki features: - name: id dtype: string - name: text dtype: string - name: start dtype: int64 - name: end dtype: int64 - name: mention dtype: string - name: gold_id dtype: string splits: - name: train - name: validation - config_name: hansel-few-shot features: - name: id dtype: string - name: text dtype: string - name: start dtype: int64 - name: end dtype: int64 - name: mention dtype: string - name: gold_id dtype: string - name: source dtype: string - name: domain dtype: string splits: - name: test - config_name: hansel-zero-shot features: - name: id dtype: string - name: text dtype: string - name: start dtype: int64 - name: end dtype: int64 - name: mention dtype: string - name: gold_id dtype: string - name: source dtype: string - name: domain dtype: string splits: - name: test --- # Dataset Card for "Hansel" ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Splits](#data-splits) - [Citation](#citation) ## Dataset Description - **Homepage:** https://github.com/HITsz-TMG/Hansel - **Paper:** https://arxiv.org/abs/2207.13005 Hansel is a high-quality human-annotated Chinese entity linking (EL) dataset, focusing on tail entities and emerging entities: - The test set contains Few-shot (FS) and zero-shot (ZS) slices, has 10K examples and uses Wikidata as the corresponding knowledge base. - The training and validation sets are from Wikipedia hyperlinks, useful for large-scale pretraining of Chinese EL systems. Please see our [WSDM 2023](https://www.wsdm-conference.org/2023/) paper [**"Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark"**](https://dl.acm.org/doi/10.1145/3539597.3570418) to learn more about our dataset. For models in the paper and our processed knowledge base, please see our [Github repository](https://github.com/HITsz-TMG/Hansel). ## Dataset Structure ### Data Instances {"id": "hansel-eval-zs-1463", "text": "1905电影网讯已经筹备了十余年的吉尔莫·德尔·托罗的《匹诺曹》，在上个月顺利被网飞公司买下，成为了流媒体巨头旗下的新片。近日，这部备受关注的影片确定了自己的档期：2021年。虽然具体时间未定，但影片却已经实实在在地向前迈出了一步。", "start": 29, "end": 32, "mention": "匹诺曹", "gold_id": "Q73895818", "source": "https://www.1905.com/news/20181107/1325389.shtml", "domain": "news" } ### Data Splits | | # Mentions | # Entities | Domain | | ---- | ---- | ---- | ---- | | Train | 9,879,813 | 541,058 | Wikipedia | | Validation | 9,674 | 6,320 | Wikipedia | | Hansel-FS | 5,260 | 2,720 | News, Social Media | | Hansel-ZS | 4,715 | 4,046 | News, Social Media, E-books, etc.| ## Citation If you find our dataset useful, please cite us. ```bibtex @inproceedings{xu2022hansel, author = {Xu, Zhenran and Shan, Zifei and Li, Yuxin and Hu, Baotian and Qin, Bing}, title = {Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark}, year = {2023}, publisher = {Association for Computing Machinery}, url = {https://doi.org/10.1145/3539597.3570418}, booktitle = {Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining}, pages = {832–840} } ```

提供机构：

HIT-TMG

原始信息汇总

数据集概述：Hansel

数据集描述

名称：Hansel
语言：中文
许可：cc-by-sa-4.0
任务类别：text-retrieval
任务ID：entity-linking-retrieval
数据集大小：1M<n<10M 和 1K<n<10K
数据来源：原始数据
数据集特点：
- 专注于尾实体和新兴实体
- 测试集包含Few-shot (FS) 和 zero-shot (ZS) 切片，共10K示例，使用Wikidata作为知识库
- 训练和验证集来自Wikipedia超链接

数据集结构

数据实例

字段：id, text, start, end, mention, gold_id, source, domain
示例： json { "id": "hansel-eval-zs-1463", "text": "1905电影网讯已经筹备了十余年的吉尔莫·德尔·托罗的《匹诺曹》，在上个月顺利被网飞公司买下，成为了流媒体巨头旗下的新片。近日，这部备受关注的影片确定了自己的档期：2021年。虽然具体时间未定，但影片却已经实实在在地向前迈出了一步。", "start": 29, "end": 32, "mention": "匹诺曹", "gold_id": "Q73895818", "source": "https://www.1905.com/news/20181107/1325389.shtml", "domain": "news" }

数据分割

分割	提及数	实体数	领域
训练	9,879,813	541,058	Wikipedia
验证	9,674	6,320	Wikipedia
Hansel-FS	5,260	2,720	News, Social Media
Hansel-ZS	4,715	4,046	News, Social Media, E-books, etc.

引用信息

bibtex @inproceedings{xu2022hansel, author = {Xu, Zhenran and Shan, Zifei and Li, Yuxin and Hu, Baotian and Qin, Bing}, title = {Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark}, year = {2023}, publisher = {Association for Computing Machinery}, url = {https://doi.org/10.1145/3539597.3570418}, booktitle = {Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining}, pages = {832–840} }

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4099个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

DroneDeploy

DroneDeploy数据集，用于遥感影像分割源地址：https://github.com/dronedeploy/dd-ml-segmentation-benchmark

AI_Studio 收录

中国区域交通网络数据集

该数据集包含中国各区域的交通网络信息，包括道路、铁路、航空和水路等多种交通方式的网络结构和连接关系。数据集详细记录了各交通节点的位置、交通线路的类型、长度、容量以及相关的交通流量信息。

data.stats.gov.cn 收录

MVII_metal_datasets

我们发布了两个带有实例级像素注释的金属表面缺陷数据集：Casting Billet和Steel Pipe。Casting Billet数据集包含1,060张图像（780张有缺陷），分辨率从96×106到3,228×492不等，缺陷类型包括划痕、焊渣、切割开口、水渣痕迹、渣皮和纵向裂纹。Steel Pipe数据集包含1,227张图像（554张有缺陷），固定分辨率为728×544，缺陷类型包括弯曲、外部折叠、皱纹和划痕。

github 收录

OpenSonarDatasets

OpenSonarDatasets是一个致力于整合开放源代码声纳数据集的仓库，旨在为水下研究和开发提供便利。该仓库鼓励研究人员扩展当前的数据集集合，以增加开放源代码声纳数据集的可见性，并提供一个更容易查找和比较数据集的方式。

github 收录

N-Caltech 101 (Neuromorphic-Caltech101)

The Neuromorphic-Caltech101 (N-Caltech101) dataset is a spiking version of the original frame-based Caltech101 dataset. The original dataset contained both a "Faces" and "Faces Easy" class, with each consisting of different versions of the same images. The "Faces" class has been removed from N-Caltech101 to avoid confusion, leaving 100 object classes plus a background class. The N-Caltech101 dataset was captured by mounting the ATIS sensor on a motorized pan-tilt unit and having the sensor move while it views Caltech101 examples on an LCD monitor as shown in the video below. A full description of the dataset and how it was created can be found in the paper below. Please cite this paper if you make use of the dataset.

Papers with Code 收录