DataScienceUIBK/ComplexTempQA|时间问答数据集|自然语言处理数据集

hugging_face2024-09-22 更新2024-06-15 收录

时间问答

自然语言处理

下载链接：

https://hf-mirror.com/datasets/DataScienceUIBK/ComplexTempQA

下载链接

链接失效反馈

资源简介：

ComplexTempQA是一个大规模的数据集，专为复杂的时间问题回答（TQA）设计。它包含超过1亿个问题-答案对，涵盖了1987年至2023年的事件、实体和时间段。问题分为属性、比较和计数三种主要类型，并进一步细分为与事件、实体或时间段相关的子类型。数据集还包含丰富的元数据，如唯一标识符、问题文本、答案、问题类型、难度评分、时间范围等。数据集可用于评估和训练大型语言模型的时间推理能力，支持时间问题回答、信息检索和语言理解的研究。

提供机构：

DataScienceUIBK

原始信息汇总

ComplexTempQA 数据集

ComplexTempQA 是一个大规模的复杂时间问答（TQA）数据集。它包含超过 1 亿个问答对，是 TQA 领域中最大的数据集之一。该数据集使用来自 Wikipedia 和 Wikidata 的数据生成，涵盖了 36 年的时间范围（1987-2023）。

数据集描述

ComplexTempQA 将问题分为三种主要类型：

属性问题
比较问题
计数问题

这些类别根据其与事件、实体或时间段的关联进一步细分。

问题类型和数量

问题类型	子类型	数量
属性	事件	83,798
属性	实体	84,079
属性	时间	9,454
比较	事件	25,353,340
比较	实体	74,678,117
比较	时间	54,022,952
计数	事件	18,325
计数	实体	10,798
计数	时间	12,732
多跳		76,933
未命名事件		8,707,123
总计		100,228,457

元数据

id: 每个问题的唯一标识符。
question: 问题的文本。
answer: 问题的答案。
type: 根据数据集分类法的问题类型。
rating: 问题的难度评级（0 表示简单，1 表示困难）。
timeframe: 问题相关的时间范围。
question_entity: 与问题中实体相关的 Wikidata ID 列表。
answer_entity: 与答案中实体相关的 Wikidata ID 列表。
question_country: 与问题中实体或事件相关的国家 Wikidata ID 列表。
answer_country: 与答案中实体或事件相关的国家 Wikidata ID 列表。
is_unnamed: 指示问题是否包含隐式描述的事件（1 表示是，0 表示否）。

数据集特征

大小

ComplexTempQA 包含超过 1 亿个问答对，重点关注 1987 年至 2023 年间的事件、实体和时间段。

复杂性

问题需要高级推理技能，包括多跳问答、时间聚合和跨时间比较。

分类法

数据集遵循独特的分类法，将问题分为属性、比较和计数类型，确保全面覆盖时间查询。

评估

数据集已评估可读性、网络搜索前后的回答难易程度以及整体清晰度。人工评分员评估了部分问题，以确保高质量。

用途

评估和训练

ComplexTempQA 可用于：

评估大型语言模型（LLMs）的时间推理能力
微调语言模型以提高时间理解能力
开发和测试检索增强生成（RAG）系统

研究应用

数据集支持以下研究：

时间问答
信息检索
语言理解

适应和持续学习

ComplexTempQA 的时间元数据有助于开发在线适应和持续训练方法，促进时间基础学习和评估的探索。

AI搜集汇总

数据集介绍

构建方式

ComplexTempQA数据集的构建，是基于Wikipedia和Wikidata的海量数据，通过精心设计的问题-答案对形式，覆盖了从1987年至2023年间的历史事件、实体和时间跨度。该数据集通过三种主要问题类型——属性问题、比较问题和计数问题，进一步细分为与事件、实体或时间相关的子类型，形成了超过一亿的问题-答案对，旨在为复杂时间问题回答任务提供全面的数据支持。

特点

ComplexTempQA数据集以其庞大的规模、复杂的问题类型和详尽的元数据而显著。它不仅包含大量的问题-答案对，而且这些问题在复杂性上要求高级推理技能，包括多跳问题回答、时间聚合和跨时间比较。数据集采用独特的分类法，将问题分类为属性、比较和计数类型，确保了对时间查询的全面覆盖。此外，数据集经过人类评估，保证了高质量的问题和答案。

使用方法

用户可以通过访问数据集的GitHub页面来获取ComplexTempQA数据集和相应的代码。该数据集适用于评估大型语言模型的时序推理能力、微调语言模型以增强其时序理解能力，以及开发和测试检索增强生成系统。此外，其丰富的时序元数据为在线适应和持续训练语言模型提供了便利，有助于探索基于时间的学习和评估方法。

背景与挑战

背景概述

ComplexTempQA数据集，作为一项大规模的复杂时态问题回答（TQA）研究工具，诞生于对时间序列信息处理需求的深刻认识。该数据集由DataScienceUIBK团队开发，并在1987年至2023年的广阔时间跨度内，依托Wikipedia与Wikidata的海量数据，构建了包含超过一亿条问题-答案对。其旨在推动对事件、实体及时间周期相关问题的深入理解与应答，对自然语言处理领域产生了显著影响，特别是在时态理解与问题解答方面。

当前挑战

ComplexTempQA数据集面临的挑战主要在于其问题的复杂性与多样性，这要求模型具备高级推理能力，如多跳问答、时态聚合以及跨时间比较等。构建过程中，确保数据的准确性、问题的合理分布以及答案的相关性是另一大挑战。此外，如何高效利用数据集中的时态元数据，以支持在线适应与持续学习策略，对于提升大型语言模型在时态处理方面的能力至关重要。

常用场景

经典使用场景

在复杂时序问题回答的研究领域中，ComplexTempQA数据集以其庞大的规模和精细的类别划分，成为检验与训练大型语言模型时序推理能力的经典资源。研究者利用该数据集，对模型进行评估，以确定其在处理涉及事件、实体和时间跨度的复杂查询时的表现。

实际应用

实际应用中，ComplexTempQA数据集为信息检索、语言理解等领域提供了强有力的支持。通过该数据集，开发者能够构建更为智能的问答系统，服务于新闻分析、历史研究、事件追踪等多个场景，极大地提升了信息处理的效率和准确性。

衍生相关工作

ComplexTempQA数据集催生了一系列相关研究工作，包括但不限于对大型语言模型进行时序推理能力的评估、检索增强生成系统的开发，以及在线适应和持续训练方法的研究。这些工作进一步拓展了数据集的应用范围，为人工智能领域的发展贡献了重要力量。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

HazyDet

HazyDet是由解放军工程大学等机构创建的一个大规模数据集，专门用于雾霾场景下的无人机视角物体检测。该数据集包含383,000个真实世界实例，收集自自然雾霾环境和正常场景中人工添加的雾霾效果，以模拟恶劣天气条件。数据集的创建过程结合了深度估计和大气散射模型，确保了数据的真实性和多样性。HazyDet主要应用于无人机在恶劣天气条件下的物体检测，旨在提高无人机在复杂环境中的感知能力。

arXiv 收录

poi

本项目收集国内POI兴趣点，当前版本数据来自于openstreetmap。

github 收录

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

ISIC 2018

ISIC 2018数据集包含2594张皮肤病变图像，用于皮肤癌检测任务。数据集分为训练集、验证集和测试集，每张图像都附有详细的元数据，包括病变类型、患者年龄、性别和解剖部位等信息。

challenge2018.isic-archive.com 收录

Traditional-Chinese-Medicine-Dataset-SFT

该数据集是一个高质量的中医数据集，主要由非网络来源的内部数据构成，包含约1GB的中医各个领域临床案例、名家典籍、医学百科、名词解释等优质内容。数据集99%为简体中文内容，质量优异，信息密度可观。数据集适用于预训练或继续预训练用途，未来将继续发布针对SFT/IFT的多轮对话和问答数据集。数据集可以独立使用，但建议先使用配套的预训练数据集对模型进行继续预训练后，再使用该数据集进行进一步的指令微调。数据集还包含一定比例的中文常识、中文多轮对话数据以及古文/文言文<->现代文翻译数据，以避免灾难性遗忘并加强模型表现。

huggingface 收录