bigbio/scifact|科学声明验证数据集|文本分类数据集

hugging_face2022-12-22 更新2024-03-04 收录

科学声明验证

文本分类

下载链接：

https://hf-mirror.com/datasets/bigbio/scifact

下载链接

链接失效反馈

资源简介：

--- language: - en bigbio_language: - English license: cc-by-nc-2.0 multilinguality: monolingual bigbio_license_shortname: CC_BY_NC_2p0 pretty_name: SciFact homepage: https://scifact.apps.allenai.org/ bigbio_pubmed: False bigbio_public: True bigbio_tasks: - TEXT_PAIRS_CLASSIFICATION --- # Dataset Card for SciFact ## Dataset Description - **Homepage:** https://scifact.apps.allenai.org/ - **Pubmed:** False - **Public:** True - **Tasks:** TXT2CLASS ### Scifact Corpus Source SciFact is a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts, and annotated with labels and rationales. This config has abstracts and document ids. ### Scifact Claims Source {_DESCRIPTION_BASE} This config connects the claims to the evidence and doc ids. ### Scifact Rationale Bigbio Pairs {_DESCRIPTION_BASE} This task is the following: given a claim and a text span composed of one or more sentences from an abstract, predict a label from ("rationale", "not_rationale") indicating if the span is evidence (can be supporting or refuting) for the claim. This roughly corresponds to the second task outlined in Section 5 of the paper." ### Scifact Labelprediction Bigbio Pairs {_DESCRIPTION_BASE} This task is the following: given a claim and a text span composed of one or more sentences from an abstract, predict a label from ("SUPPORT", "NOINFO", "CONTRADICT") indicating if the span supports, provides no info, or contradicts the claim. This roughly corresponds to the thrid task outlined in Section 5 of the paper. ## Citation Information ``` @article{wadden2020fact, author = {David Wadden and Shanchuan Lin and Kyle Lo and Lucy Lu Wang and Madeleine van Zuylen and Arman Cohan and Hannaneh Hajishirzi}, title = {Fact or Fiction: Verifying Scientific Claims}, year = {2020}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2020.emnlp-main.609}, doi = {10.18653/v1/2020.emnlp-main.609}, pages = {7534--7550}, biburl = {}, bibsource = {} } ```

提供机构：

bigbio

原始信息汇总

数据集概述

基本信息

名称: SciFact
语言: 英语
许可证: CC BY NC 2.0
多语言性: 单语种
公开状态: 公开
PubMed链接: 无

数据集内容

数据集大小: 包含1.4K专家撰写的科学声明，每个声明均与包含证据的摘要配对，并附有标签和理由。
数据结构:
- SciFact Corpus Source: 包含摘要和文档ID。
- SciFact Claims Source: 连接声明与证据及文档ID。
- SciFact Rationale Bigbio Pairs: 任务为判断给定声明和文本跨度（由摘要中的一个或多个句子组成）是否为证据，标签为("rationale", "not_rationale")。
- SciFact Labelprediction Bigbio Pairs: 任务为判断给定声明和文本跨度是否支持、提供无信息或反驳声明，标签为("SUPPORT", "NOINFO", "CONTRADICT")。

任务类型

主要任务: TEXT_PAIRS_CLASSIFICATION
具体任务:
- 判断文本跨度是否为声明的证据。
- 判断文本跨度对声明的支持、无信息或反驳情况。

引用信息

@article{wadden2020fact, author = {David Wadden and Shanchuan Lin and Kyle Lo and Lucy Lu Wang and Madeleine van Zuylen and Arman Cohan and Hannaneh Hajishirzi}, title = {Fact or Fiction: Verifying Scientific Claims}, year = {2020}, address = {Online}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2020.emnlp-main.609}, doi = {10.18653/v1/2020.emnlp-main.609}, pages = {7534--7550}, biburl = {}, bibsource = {} }

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

LFW

人脸数据集;LFW数据集共有13233张人脸图像，每张图像均给出对应的人名，共有5749人，且绝大部分人仅有一张图片。每张图片的尺寸为250X250，绝大部分为彩色图像，但也存在少许黑白人脸图片。 URL: http://vis-www.cs.umass.edu/lfw/index.html#download

AI_Studio 收录

中国区域交通网络数据集

该数据集包含中国各区域的交通网络信息，包括道路、铁路、航空和水路等多种交通方式的网络结构和连接关系。数据集详细记录了各交通节点的位置、交通线路的类型、长度、容量以及相关的交通流量信息。

data.stats.gov.cn 收录

WideIRSTD Dataset

WideIRSTD数据集包含七个公开数据集：SIRST-V2、IRSTD-1K、IRDST、NUDT-SIRST、NUDT-SIRST-Sea、NUDT-MIRSDT、Anti-UAV，以及由国防科技大学团队开发的数据集，包括模拟陆基和太空基数据，以及真实手动标注的太空基数据。数据集包含具有各种目标形状（如点目标、斑点目标、扩展目标）、波长（如近红外、短波红外和热红外）、图像分辨率（如256、512、1024、3200等）的图像，以及不同的成像系统（如陆基、空基和太空基成像系统）。

github 收录

THUCNews

THUCNews是根据新浪新闻RSS订阅频道2005~2011年间的历史数据筛选过滤生成，包含74万篇新闻文档（2.19 GB），均为UTF-8纯文本格式。本次比赛数据集在原始新浪新闻分类体系的基础上，重新整合划分出14个候选分类类别：财经、彩票、房产、股票、家居、教育、科技、社会、时尚、时政、体育、星座、游戏、娱乐。提供训练数据共832471条。

github 收录

TCIA

TCIA（The Cancer Imaging Archive）是一个公开的癌症影像数据集，包含多种癌症类型的医学影像数据，如CT、MRI、PET等。这些数据通常与临床和病理信息相结合，用于癌症研究和临床试验。