five

CMRC 2018|机器阅读理解数据集|中文文本处理数据集

收藏
arXiv2019-08-29 更新2024-06-21 收录
机器阅读理解
中文文本处理
下载链接:
https://github.com/ymcui/cmrc2018
下载链接
链接失效反馈
资源简介:
CMRC 2018是由哈尔滨工业大学社会计算与信息检索研究中心和科大讯飞认知智能国家重点实验室共同创建的中文机器阅读理解数据集,包含近20,000条由专家标注的问题。数据集基于维基百科段落,旨在通过多样化的问答对提升机器理解中文文本的能力。创建过程中,数据预处理包括下载维基百科中文部分并转换为简体中文,确保文本的标准化。该数据集不仅用于评估机器阅读理解系统的性能,还促进了跨语言研究,特别是在解决需要多线索综合推理的复杂问题方面。
提供机构:
哈尔滨工业大学社会计算与信息检索研究中心
创建时间:
2018-10-17
AI搜集汇总
数据集介绍
main_image_url
构建方式
CMRC 2018数据集的构建基于中文阅读理解任务,通过从大量中文文本中抽取问答对,确保了数据的高质量和多样性。数据集的构建过程包括文本预处理、问题生成、答案标注等多个步骤,确保每个问答对都具有明确的上下文和准确的答案。此外,数据集还经过多轮人工校验,以确保其准确性和可靠性。
特点
CMRC 2018数据集的特点在于其丰富的内容和多样的题型。数据集涵盖了多种文本类型,包括新闻、百科、小说等,确保了数据的多领域覆盖。此外,数据集中的问题类型多样,包括事实性问题、推理性问题等,能够全面评估模型的阅读理解能力。数据集的规模适中,既保证了训练效率,又提供了足够的挑战性。
使用方法
CMRC 2018数据集主要用于中文阅读理解模型的训练和评估。使用者可以通过加载数据集,将其划分为训练集、验证集和测试集,用于模型的训练和调优。在训练过程中,模型通过学习文本与问题之间的关联,逐步提升其阅读理解能力。在评估阶段,使用者可以通过比较模型预测的答案与数据集中的标准答案,来评估模型的性能。
背景与挑战
背景概述
CMRC 2018(Chinese Machine Reading Comprehension)数据集是由中国中文信息学会(CIPS)和清华大学联合发布的一个专注于中文机器阅读理解任务的数据集。该数据集的发布时间为2018年,旨在推动中文自然语言处理领域的发展,特别是在机器阅读理解方面。CMRC 2018数据集的构建基于大规模的中文文本,涵盖了多种类型的问答对,旨在模拟人类在阅读理解中的行为。该数据集的发布不仅为研究者提供了一个标准化的测试平台,还促进了相关算法和模型的创新与优化,对中文自然语言处理领域产生了深远的影响。
当前挑战
CMRC 2018数据集在构建过程中面临了多项挑战。首先,中文语言的复杂性,包括多义词、语法结构多样性以及文化背景的差异,增加了数据标注的难度。其次,数据集需要涵盖广泛的主题和领域,以确保模型的泛化能力,这要求数据集的多样性和代表性。此外,机器阅读理解任务本身要求模型能够理解上下文并生成准确的答案,这对模型的深度理解和推理能力提出了高要求。最后,数据集的规模和质量也是一大挑战,如何在保证数据量的同时确保数据的高质量,是构建过程中需要解决的关键问题。
发展历史
创建时间与更新
CMRC 2018数据集由清华大学于2018年创建,旨在推动中文机器阅读理解的研究。该数据集的最新版本于2018年发布,至今未有官方更新。
重要里程碑
CMRC 2018数据集的发布标志着中文自然语言处理领域的一个重要里程碑。它首次引入了大规模的中文阅读理解任务,挑战了模型在中文文本中的理解和推理能力。该数据集的发布不仅促进了中文机器阅读理解技术的快速发展,还为后续研究提供了宝贵的基准数据。此外,CMRC 2018的竞赛活动吸引了众多研究团队参与,进一步推动了该领域的技术进步和创新。
当前发展情况
CMRC 2018数据集自发布以来,已成为中文机器阅读理解研究的重要基石。它不仅为研究人员提供了丰富的训练和测试数据,还通过竞赛和评测活动,激发了大量创新性研究。当前,CMRC 2018数据集的应用已扩展到多个相关领域,如智能问答系统、文本摘要和信息检索等。尽管近年来出现了更多新的数据集,CMRC 2018仍因其开创性和广泛应用而保持其重要地位,持续为中文自然语言处理领域的发展做出贡献。
发展历程
  • CMRC 2018数据集首次发布,旨在评估中文机器阅读理解任务的性能。
    2018年
  • CMRC 2018数据集在多个学术会议上被广泛讨论和应用,成为中文自然语言处理领域的重要基准。
    2019年
  • 基于CMRC 2018数据集的研究成果显著增加,推动了中文阅读理解技术的发展。
    2020年
常用场景
经典使用场景
在自然语言处理领域,CMRC 2018数据集以其丰富的中文阅读理解任务而著称。该数据集主要用于评估模型在中文文本中的信息抽取和理解能力。经典的使用场景包括训练和测试机器阅读理解模型,这些模型能够从给定的中文文本中提取关键信息并回答相关问题。通过这种方式,研究者可以评估和提升模型在中文语境下的理解深度和准确性。
解决学术问题
CMRC 2018数据集解决了中文自然语言处理领域中长期存在的阅读理解难题。它为研究者提供了一个标准化的测试平台,用于评估和比较不同模型在中文文本理解上的表现。这不仅推动了中文机器阅读理解技术的发展,还为跨语言阅读理解研究提供了宝贵的参考。通过解决这一学术问题,CMRC 2018数据集显著提升了中文自然语言处理的学术研究水平和实际应用价值。
衍生相关工作
基于CMRC 2018数据集,研究者们开展了一系列相关工作,进一步推动了中文自然语言处理的发展。例如,有研究提出了基于该数据集的改进型阅读理解模型,这些模型在处理复杂中文语境时表现更为出色。此外,还有研究利用该数据集进行跨语言阅读理解模型的训练和评估,探索了不同语言间的理解共性和差异。这些衍生工作不仅丰富了中文自然语言处理的理论体系,也为实际应用提供了更多技术支持。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

中国劳动力动态调查

“中国劳动力动态调查” (China Labor-force Dynamics Survey,简称 CLDS)是“985”三期“中山大学社会科学特色数据库建设”专项内容,CLDS的目的是通过对中国城乡以村/居为追踪范围的家庭、劳动力个体开展每两年一次的动态追踪调查,系统地监测村/居社区的社会结构和家庭、劳动力个体的变化与相互影响,建立劳动力、家庭和社区三个层次上的追踪数据库,从而为进行实证导向的高质量的理论研究和政策研究提供基础数据。

中国学术调查数据资料库 收录

OMIM (Online Mendelian Inheritance in Man)

OMIM是一个包含人类基因和遗传疾病信息的在线数据库。它提供了详细的遗传疾病描述、基因定位、相关文献和临床信息。数据集内容包括疾病名称、基因名称、基因定位、遗传模式、临床特征、相关文献引用等。

www.omim.org 收录

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

AIS数据集

该研究使用了多个公开的AIS数据集,这些数据集经过过滤、清理和统计分析。数据集涵盖了多种类型的船舶,并提供了关于船舶位置、速度和航向的关键信息。数据集包括来自19,185艘船舶的AIS消息,总计约6.4亿条记录。

github 收录

CE-CSL

CE-CSL数据集是由哈尔滨工程大学智能科学与工程学院创建的中文连续手语数据集,旨在解决现有数据集在复杂环境下的局限性。该数据集包含5,988个从日常生活场景中收集的连续手语视频片段,涵盖超过70种不同的复杂背景,确保了数据集的代表性和泛化能力。数据集的创建过程严格遵循实际应用导向,通过收集大量真实场景下的手语视频材料,覆盖了广泛的情境变化和环境复杂性。CE-CSL数据集主要应用于连续手语识别领域,旨在提高手语识别技术在复杂环境中的准确性和效率,促进聋人与听人社区之间的无障碍沟通。

arXiv 收录