five

MATINF (Maternal and Infant Dataset)

收藏
OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/MATINF
下载链接
链接失效反馈
官方服务:
资源简介:
Maternal and Infant (MATINF) Dataset 是一个联合标注的大规模数据集,用于中文母婴护理领域的分类、问答和总结。数据集中的一个条目包括四个字段:问题(Q)、描述(D)、类别(C)和答案(A)。从中国大型母婴护理 QA 网站收集了近 200 万对问答对,其中包含细粒度的人工标记类。作者进行自动和手动数据清洗并删除:(1)样本不足的类; (2) 描述字段长度小于问题字段长度的条目; (3) 任何字段长度超过 256 个字符的数据; (4) 人为发现的格式错误的数据。数据清洗后,用剩余的 107 万个条目构建 MATINF

The Maternal and Infant (MATINF) Dataset is a large-scale jointly annotated dataset designed for classification, question answering and summarization tasks in the Chinese maternal and infant care domain. Each entry in the dataset contains four fields: Question (Q), Description (D), Category (C) and Answer (A). Nearly 2 million QA pairs were collected from large-scale Chinese maternal and infant care QA websites, which come with fine-grained manually labeled categories. The authors conducted both automatic and manual data cleaning and removed the following data: (1) categories with insufficient samples; (2) entries where the length of the Description field is shorter than that of the Question field; (3) data where any single field exceeds 256 characters in length; (4) data with manually identified formatting errors. After the data cleaning process, the remaining 1.07 million entries were used to construct the MATINF Dataset.
提供机构:
OpenDataLab
创建时间:
2022-08-16
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作