DataScienceUIBK/ComplexTempQA|时间问答数据集|自然语言处理数据集
收藏ComplexTempQA 数据集
ComplexTempQA 是一个大规模的复杂时间问答(TQA)数据集。它包含超过 1 亿个问答对,是 TQA 领域中最大的数据集之一。该数据集使用来自 Wikipedia 和 Wikidata 的数据生成,涵盖了 36 年的时间范围(1987-2023)。
数据集描述
ComplexTempQA 将问题分为三种主要类型:
- 属性问题
- 比较问题
- 计数问题
这些类别根据其与事件、实体或时间段的关联进一步细分。
问题类型和数量
问题类型 | 子类型 | 数量 |
---|---|---|
属性 | 事件 | 83,798 |
属性 | 实体 | 84,079 |
属性 | 时间 | 9,454 |
比较 | 事件 | 25,353,340 |
比较 | 实体 | 74,678,117 |
比较 | 时间 | 54,022,952 |
计数 | 事件 | 18,325 |
计数 | 实体 | 10,798 |
计数 | 时间 | 12,732 |
多跳 | 76,933 | |
未命名事件 | 8,707,123 | |
总计 | 100,228,457 |
元数据
- id: 每个问题的唯一标识符。
- question: 问题的文本。
- answer: 问题的答案。
- type: 根据数据集分类法的问题类型。
- rating: 问题的难度评级(
0
表示简单,1
表示困难)。 - timeframe: 问题相关的时间范围。
- question_entity: 与问题中实体相关的 Wikidata ID 列表。
- answer_entity: 与答案中实体相关的 Wikidata ID 列表。
- question_country: 与问题中实体或事件相关的国家 Wikidata ID 列表。
- answer_country: 与答案中实体或事件相关的国家 Wikidata ID 列表。
- is_unnamed: 指示问题是否包含隐式描述的事件(
1
表示是,0
表示否)。
数据集特征
大小
ComplexTempQA 包含超过 1 亿个问答对,重点关注 1987 年至 2023 年间的事件、实体和时间段。
复杂性
问题需要高级推理技能,包括多跳问答、时间聚合和跨时间比较。
分类法
数据集遵循独特的分类法,将问题分为属性、比较和计数类型,确保全面覆盖时间查询。
评估
数据集已评估可读性、网络搜索前后的回答难易程度以及整体清晰度。人工评分员评估了部分问题,以确保高质量。
用途
评估和训练
ComplexTempQA 可用于:
- 评估大型语言模型(LLMs)的时间推理能力
- 微调语言模型以提高时间理解能力
- 开发和测试检索增强生成(RAG)系统
研究应用
数据集支持以下研究:
- 时间问答
- 信息检索
- 语言理解
适应和持续学习
ComplexTempQA 的时间元数据有助于开发在线适应和持续训练方法,促进时间基础学习和评估的探索。

HazyDet
HazyDet是由解放军工程大学等机构创建的一个大规模数据集,专门用于雾霾场景下的无人机视角物体检测。该数据集包含383,000个真实世界实例,收集自自然雾霾环境和正常场景中人工添加的雾霾效果,以模拟恶劣天气条件。数据集的创建过程结合了深度估计和大气散射模型,确保了数据的真实性和多样性。HazyDet主要应用于无人机在恶劣天气条件下的物体检测,旨在提高无人机在复杂环境中的感知能力。
arXiv 收录
poi
本项目收集国内POI兴趣点,当前版本数据来自于openstreetmap。
github 收录
Canadian Census
**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).
Databricks 收录
ISIC 2018
ISIC 2018数据集包含2594张皮肤病变图像,用于皮肤癌检测任务。数据集分为训练集、验证集和测试集,每张图像都附有详细的元数据,包括病变类型、患者年龄、性别和解剖部位等信息。
challenge2018.isic-archive.com 收录
Traditional-Chinese-Medicine-Dataset-SFT
该数据集是一个高质量的中医数据集,主要由非网络来源的内部数据构成,包含约1GB的中医各个领域临床案例、名家典籍、医学百科、名词解释等优质内容。数据集99%为简体中文内容,质量优异,信息密度可观。数据集适用于预训练或继续预训练用途,未来将继续发布针对SFT/IFT的多轮对话和问答数据集。数据集可以独立使用,但建议先使用配套的预训练数据集对模型进行继续预训练后,再使用该数据集进行进一步的指令微调。数据集还包含一定比例的中文常识、中文多轮对话数据以及古文/文言文<->现代文翻译数据,以避免灾难性遗忘并加强模型表现。
huggingface 收录