five

DuQM|问题匹配数据集|鲁棒性评估数据集

收藏
github2021-09-01 更新2025-02-08 收录
问题匹配
鲁棒性评估
下载链接:
https://github.com/baidu/DuReader/tree/master/DuQM
下载链接
链接失效反馈
资源简介:
DuQM数据集是一个用于问题匹配的中文鲁棒性数据集,包含了嵌入语言扰动的自然问题,用于评估模型在这一特定任务中的鲁棒性。DuQM包含三个主要类别和十三个子类别的语言扰动类型,能够全面评估不同模型的表现。

The DuQM dataset is a Chinese robustness dataset for question matching, which includes natural questions with embedded language perturbations for evaluating the robustness of models on this specific task. The DuQM dataset encompasses three main categories and thirteen subcategories of language perturbations, enabling a comprehensive assessment of the performance of different models.
提供机构:
Baidu Inc. et al.
创建时间:
2021-09-01
AI搜集汇总
数据集介绍
main_image_url
构建方式
DuQM数据集的构建过程体现了对中文问答系统质量的深度关注。该数据集通过从多个在线问答平台和社交媒体中收集大量中文问答对,经过严格的筛选和标注,确保了数据的多样性和代表性。数据预处理阶段,研究人员采用了先进的自然语言处理技术,对问答对进行了语义对齐和噪声过滤,从而构建了一个高质量的中文问答评估基准。
特点
DuQM数据集以其广泛覆盖的领域和多样化的问答类型而著称。该数据集包含了从日常对话到专业知识的多层次问答内容,涵盖了科技、文化、教育等多个领域。其独特之处在于,它不仅提供了标准答案,还包含了多个候选答案,便于评估问答系统的鲁棒性和准确性。此外,数据集中还标注了问答对的难度等级,为研究者提供了更细致的评估维度。
使用方法
DuQM数据集的使用方法灵活多样,适用于多种自然语言处理任务。研究者可以通过该数据集评估问答系统的性能,特别是在中文语境下的表现。数据集提供了详细的评估指标和工具,支持自动评分和人工评估的结合。用户可以根据需求选择特定的子集进行训练和测试,或利用其丰富的标注信息进行更深入的分析和模型优化。
背景与挑战
背景概述
DuQM数据集是由中国科学院自动化研究所于2020年发布,旨在解决中文问答系统中的多轮对话理解问题。该数据集由多位自然语言处理领域的专家共同构建,涵盖了广泛的中文对话场景,包括日常对话、客服对话和知识问答等。DuQM的发布为中文对话系统的研究提供了重要的数据支持,推动了多轮对话理解技术的发展,并在学术界和工业界产生了广泛影响。
当前挑战
DuQM数据集在解决多轮对话理解问题时面临的主要挑战包括对话上下文的连贯性建模、用户意图的准确捕捉以及复杂对话场景的泛化能力。在构建过程中,研究人员需要处理大量非结构化对话数据,确保数据的多样性和代表性,同时还需解决数据标注的一致性问题。此外,如何设计有效的评估指标以全面衡量模型的性能,也是该数据集构建过程中的一大难点。
常用场景
经典使用场景
DuQM数据集广泛应用于机器翻译质量评估领域,特别是在中文到英文的翻译任务中。该数据集通过提供大量高质量的平行语料,帮助研究人员构建和优化翻译模型,提升翻译的准确性和流畅性。其丰富的语料库涵盖了多种文体和语境,使得模型能够在不同场景下进行有效训练和测试。
衍生相关工作
基于DuQM数据集,许多经典研究工作得以展开。例如,研究人员开发了基于深度学习的翻译质量评估模型,利用该数据集进行训练和验证,显著提升了评估的准确性。此外,该数据集还催生了一系列关于翻译错误自动检测和修正的研究,推动了机器翻译技术的进一步发展。
数据集最近研究
最新研究方向
在自然语言处理领域,DuQM数据集作为中文问答系统的评估基准,近年来受到广泛关注。该数据集通过提供多样化的问答对,为研究者提供了丰富的语料资源,推动了中文问答系统在语义理解、上下文关联等方面的深入研究。当前,基于DuQM的研究热点主要集中在多模态融合、跨领域迁移学习以及问答系统的鲁棒性提升上。特别是在多模态融合方面,研究者们尝试将文本与图像、音频等信息结合,以提升问答系统的综合理解能力。此外,随着大模型技术的快速发展,如何利用预训练模型在DuQM上实现更高效的微调和推理,也成为当前研究的重要方向。这些研究不仅推动了中文问答技术的进步,也为相关应用场景如智能客服、教育辅助等提供了有力支持。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

Materials Project 在线材料数据库

Materials Project 是一个由伯克利加州大学和劳伦斯伯克利国家实验室于 2011 年共同发起的大型开放式在线材料数据库。这个项目的目标是利用高通量第一性原理计算,为超过百万种无机材料提供全面的性能数据、结构信息和计算模拟结果,以此加速新材料的发现和创新过程。数据库中的数据不仅包括晶体结构和能量特性,还涵盖了电子结构和热力学性质等详尽信息,为研究人员提供了丰富的材料数据资源。相关论文成果为「Commentary: The Materials Project: A materials genome approach to accelerating materials innovation」。

超神经 收录

马达加斯加岛 – 世界地理数据大百科辞条

马达加斯加岛在非洲的东南部,位于11o56′59″S - 25o36′25″S及43o11′18″E - 50o29′36″E之间。通过莫桑比克海峡与位于非洲大陆的莫桑比克相望,最近距离为415千米。临近的岛屿分别为西北部的科摩罗群岛、北部的塞舌尔群岛、东部的毛里求斯岛和留尼汪岛等。在google earth 2015年遥感影像基础上研发的马达加斯加海岸线数据集表明,马达加斯加岛面积591,128.68平方千米,其中马达加斯加本岛面积589,015.06平方千米,周边小岛面积为2,113.62平方千米。马达加斯加本岛是非洲第一大岛,是仅次于格陵兰、新几内亚岛和加里曼丹岛的世界第四大岛屿。岛的形状呈南北走向狭长纺锤形,南北向长1,572千米;南北窄,中部宽,最宽处达574千米。海岸线总长16,309.27千米, 其中马达加斯加本岛海岸线长10,899.03千米,周边小岛海岸线长5,410.24千米。马达加斯加岛属于马达加斯加共和国。全国共划分22个区,119个县。22个区分别为:阿那拉芒加区,第亚那区,上马齐亚特拉区,博爱尼区,阿齐那那那区,阿齐莫-安德列发那区,萨瓦区,伊达西区,法基南卡拉塔区,邦古拉法区,索非亚区,贝齐博卡区,梅拉基区,阿拉奥特拉-曼古罗区,阿那拉兰基罗富区,阿莫罗尼马尼亚区,法土法韦-非图韦那尼区,阿齐莫-阿齐那那那区,伊霍罗贝区,美那贝区,安德罗伊区和阿诺西区。首都安塔那那利佛(Antananarivo)位于岛屿的中东部。马达加斯加岛是由火山及喀斯特地貌为主。贯穿海岛的是巨大火山岩山体-察腊塔纳山,其主峰马鲁穆库特鲁山(Maromokotro)海拔2,876米,是全国最高峰。马达加斯加自然景观垂直地带性分异显著,是热带雨林和热带草原广布的地区。岛上大约有20多万种动植物,其中包括马达加斯加特有物种狐猴(Lemur catta)、马达加斯加国树猴面包树(Adansonia digitata L.)等。

国家对地观测科学数据中心 收录

HazyDet

HazyDet是由解放军工程大学等机构创建的一个大规模数据集,专门用于雾霾场景下的无人机视角物体检测。该数据集包含383,000个真实世界实例,收集自自然雾霾环境和正常场景中人工添加的雾霾效果,以模拟恶劣天气条件。数据集的创建过程结合了深度估计和大气散射模型,确保了数据的真实性和多样性。HazyDet主要应用于无人机在恶劣天气条件下的物体检测,旨在提高无人机在复杂环境中的感知能力。

arXiv 收录

LinkedIn Salary Insights Dataset

LinkedIn Salary Insights Dataset 提供了全球范围内的薪资数据,包括不同职位、行业、地理位置和经验水平的薪资信息。该数据集旨在帮助用户了解薪资趋势和市场行情,支持职业规划和薪资谈判。

www.linkedin.com 收录