five

Mxode/University-News-Instruction-Zh

收藏
Hugging Face2023-10-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Mxode/University-News-Instruction-Zh
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - zero-shot-classification - summarization - text-generation tags: - news - campus size_categories: - 100K<n<1M language: - zh --- 一些高校校园新闻,约 65k * 3(类任务) 条,**稍微做了一点点脱敏,尽可能地遮盖了作者名等**。数据已经整理成了指令的形式,格式如下: ```json { "id": <id>, "category": "(title_summarize|news_classify|news_generate)", "instruction": <对应的具体指令>, "input": <空>, "output": <指令对应的输出> } ``` 总共三类任务:**标题总结**、**栏目分类**、**新闻生成**,本质上是利用新闻元数据中的标题、栏目、内容排列组合生成的,所以可以保证数据完全准确。每个字段内容已经整理成了**单行**的格式。下面是三类任务的样例: ```json // 标题总结 { "id": 22106, "category": "title_summarize", "instruction": "请你给下面的新闻取一则标题:\n点击图片观看视频   科技创新潜质人才是海院自主选拔的一种新尝试,根据规定,凡在高中阶段获得省级科技创新大赛一等奖或以上的应届高中毕业生,都可申请报名。今年苏浙沪三地共有300多名学生,申请以科技创新潜质人才身份入学海城海院。经过专家甄选,今天上午,有82名学生参加了考试。   上午的面试不考语数外,考生要靠自己的科技成果来证明自己是具有科技创新潜质的人才。记者发现,学生们带来的创新成果五花八门,涉及生物、微电子等多种前沿学科。经过上午的筛选,下午只剩下一半的学生进入了特色测试部分,这部分仍不考语数外,重点考察学生的发散性思维,及兴趣爱好和潜质,海院将根据测试结果,最终确定获得科技创新潜质人才的自主选拔认定名单,进入名单的考生,高考成绩只要达到当地的一本线,就能被海院录取。", "input": "", "output": "该新闻的标题可以是:《不考语数外 海院自主选拔创新潜质人才》" } ``` ```json // 栏目分类 { "id": 134781, "category": "news_classify", "instruction": "根据下面的新闻,归纳出它的分类:\n## 英国格拉斯哥大学医学院副院长访问海城海院生命学院\n\n11月20日,英国格拉斯哥大学(University of Glasgow,以下简称格大)医学院副院长Chris Loughrey一行访问海城海院生命科学技术学院,生命学院副院长张晓君出席会议,魏芳、王晓雷、徐沁等多位教师与会参加交流。格大和海城海院同为国际大学交流组织Universitas 21的成员,在生命和医学领域合作紧密。双方互相介绍了两校在生命科学领域的学科专业、课程体系、学生培养等发展情况。在相互交流中,双方希望在这个宏观的合作框架下,逐步增进相互了解,在人才培养、科学研究等方面形成更多更实质性的合作。格拉斯哥大学是英联邦地区建校历史最悠久的四所大学之一。研究经费及教育质量均排名英国前十,出过7位诺奖获得者及3位英国首相。格大下设四大学院,其中医学院有教职员工2000余人,学生5000余人,医学院下设七个研究中心,涉及分子/细胞/系统生物学、植物、微生物、免疫、癌症、生物多样性、神经科学、心血管医学等领域。", "input": "", "output": "本则新闻的分类栏目是:综合新闻" } ``` ```json // 新闻生成 { "id": 61402, "category": "news_generate", "instruction": "帮我写一则综合新闻栏目的新闻,标题是机械与动力工程学院举办心理健康月系列活动之“与压力一路同行”", "input": "", "output": "《机械与动力工程学院举办心理健康月系列活动之“与压力一路同行”》:\n\n5月27日,机械与动力工程学院心理健康月系列活动之“与压力一路同行”讲座在高田会堂举行。活动特邀海城铁路学院兼职心理咨询师李毅作为主讲嘉宾,与大家分享如何正确认识压力以及积极应对压力 。本次活动由机械与动力工程学院王睿瑞主持。相较于受到心理教育较为全面的本科生同学,研究生同学面临更多的现实问题,更加需要心理健康的知识,来实现压力与情绪的合理管理。李毅首先结合心理学知识,从生理学、心理学等多个角度带领同学们认识了压力的本质,并指出并非所有压力都是有害的,负性的慢性压力对人产生危害,合理规避有害的压力尤为重要,关键是找到自己的最佳压力水平。李毅进一步讲述了心理压力与认知、情绪、行为之间的关系。通过对“情绪的身体地图”的说明,使同学们更加深刻地体会到了“情绪丰富了体验,情绪述说着需求,情绪激发出行为,情绪联结了他人”的含义。最后,李毅从时间管理、简化生活、认知需求、活在当下、友善待人等方面给出了应对压力的建议,并寄语大家“与并不完美的自我共处,与压力一路同行”。机械与动力工程学院心理健康月系列活动已举办4年,覆盖2700余名研究生,力求以多样的活动形式帮助学生和导师了解更多的心理知识,引导大家关爱自己、关心他人。" } ``` 这里也给出数据的分位点(`instruction` + `output` 的长度),可按需修剪: ``` count 196101.000000 mean 1059.199912 std 571.623822 min 32.000000 0% 32.000000 5% 369.000000 10% 471.000000 15% 539.000000 20% 596.000000 25% 649.000000 30% 700.000000 35% 753.000000 40% 806.000000 45% 862.000000 50% 920.000000 55% 983.000000 60% 1053.000000 65% 1132.000000 70% 1225.000000 75% 1337.000000 80% 1473.000000 85% 1652.000000 90% 1898.000000 95% 2270.000000 max 3054.000000 ```

license: apache-2.0 task_categories: - zero-shot-classification - summarization - text-generation tags: - news - campus size_categories: - 100K<n<1M language: - zh This dataset consists of approximately 65,000 × 3 (task types) entries of university campus news, with minor de-identification processing to obscure author names and other sensitive information as much as possible. All data has been formatted into instruction-tuning style, following the schema below: json { "id": <id>, "category": "(title_summarize|news_classify|news_generate)", "instruction": <corresponding specific instruction>, "input": "", "output": <output corresponding to the instruction> } There are three total task categories: **title summarization**, **news category classification**, and **news generation**. These tasks are essentially generated by permuting the title, category, and content from the news metadata, ensuring full accuracy of the dataset. All fields are formatted as single-line strings. Below are examples of the three task types: json // Title Summarization { "id": 22106, "category": "title_summarize", "instruction": "Please write a headline for the following news article: Click the image to watch the video   Talented students with innovative potential in science and technology are a new initiative for independent enrollment at Haicheng Maritime University. According to regulations, any high school graduate who has won first prize or higher in a provincial-level science and technology innovation competition during high school can apply for enrollment under this category. This year, more than 300 students from Jiangsu, Zhejiang, and Shanghai applied to enroll at Haicheng Maritime University as talented students with innovative potential in science and technology. After expert selection, 82 students took the exam this morning.   The morning interview does not test Chinese, Math, or English; candidates must prove their innovative potential in science and technology through their own technological achievements. Reporters found that the innovative achievements brought by the students are diverse, covering multiple cutting-edge disciplines such as biology and microelectronics. After the morning screening, only half of the students remained for the special test in the afternoon, which also does not test Chinese, Math, or English, focusing on examining students' divergent thinking, interests, and potential. Haicheng Maritime University will finalize the list of students granted independent enrollment qualification for talented science and technology innovation talents based on the test results. Students on the list will be admitted to the university as long as their college entrance exam scores reach the local first-tier admission threshold.", "input": "", "output": "A suitable headline for this news article is: "Exempt from Chinese, Math, and English Exams: Haicheng Maritime University Selects Talents with Innovative Potential"" } json // News Category Classification { "id": 134781, "category": "news_classify", "instruction": "Summarize the category of the following news article: ## Vice Dean of the University of Glasgow's School of Medicine Visits the School of Life Sciences at Haicheng Maritime University On November 20th, a delegation led by Chris Loughrey, Vice Dean of the University of Glasgow's School of Medicine, visited the School of Life Science and Technology at Haicheng Maritime University. Zhang Xiaojun, Vice Dean of the School of Life Sciences, attended the meeting, and several teachers including Wei Fang, Wang Xiaolei, and Xu Qin participated in the exchange. Both the University of Glasgow and Haicheng Maritime University are members of Universitas 21, an international university exchange organization, and have close cooperation in the fields of life sciences and medicine. The two sides introduced each other's development of disciplines, curriculum systems, and student training in the field of life sciences. During the exchange, both sides expressed hope to gradually enhance mutual understanding under this macro cooperation framework and form more substantial cooperation in talent training, scientific research, and other aspects. The University of Glasgow is one of the four oldest universities in the Commonwealth of Nations in terms of founding history. Its research funding and education quality rank among the top 10 in the UK, and it has produced 7 Nobel laureates and 3 British Prime Ministers. The university has four schools, among which the School of Medicine has more than 2,000 faculty and staff and more than 5,000 students. The School of Medicine has seven research centers covering molecular/cellular/systems biology, botany, microbiology, immunology, cancer, biodiversity, neuroscience, cardiovascular medicine, and other fields.", "input": "", "output": "The category of this news article is: Comprehensive News" } json // News Generation { "id": 61402, "category": "news_generate", "instruction": "Help me write a news article for the Comprehensive News column with the headline "School of Mechanical and Power Engineering Holds 'Walk with Pressure' Series Activities for Mental Health Month"", "input": "", "output": ""School of Mechanical and Power Engineering Holds 'Walk with Pressure' Series Activities for Mental Health Month": On May 27th, the "Walk with Pressure" lecture, a part of the Mental Health Month series activities held by the School of Mechanical and Power Engineering, was held at the Gaotian Auditorium. The activity invited Li Yi, a part-time psychological counselor from Haicheng Railway College, as the keynote speaker, to share with everyone how to correctly recognize stress and actively cope with it. This activity was hosted by Wang Ruirui from the School of Mechanical and Power Engineering. Compared with undergraduate students who have received relatively comprehensive psychological education, graduate students face more practical problems and are more in need of psychological knowledge to properly manage stress and emotions. Li Yi first combined psychological knowledge, leading students to understand the essence of stress from multiple perspectives such as physiology and psychology, and pointed out that not all stress is harmful; negative chronic stress is harmful, and it is particularly important to reasonably avoid harmful stress, and the key is to find one's own optimal stress level. Li Yi further explained the relationship between psychological stress, cognition, emotion, and behavior. Through the explanation of the "body map of emotions", students deeply understood the meaning of "emotions enrich experience, emotions express needs, emotions trigger behaviors, and emotions connect with others". Finally, Li Yi gave suggestions for coping with stress from aspects such as time management, simplifying life, cognitive needs, living in the moment, and being kind to others, and encouraged everyone to "get along with the imperfect self and walk with pressure". The Mental Health Month series activities held by the School of Mechanical and Power Engineering have been held for 4 years, covering more than 2,700 graduate students, aiming to help students and tutors learn more psychological knowledge through various activity forms, and guide everyone to care for themselves and others." } Here are the quantiles of the combined length of `instruction` + `output` for optional trimming: count 196101.000000 mean 1059.199912 std 571.623822 min 32.000000 0% 32.000000 5% 369.000000 10% 471.000000 15% 539.000000 20% 596.000000 25% 649.000000 30% 700.000000 35% 753.000000 40% 806.000000 45% 862.000000 50% 920.000000 55% 983.000000 60% 1053.000000 65% 1132.000000 70% 1225.000000 75% 1337.000000 80% 1473.000000 85% 1652.000000 90% 1898.000000 95% 2270.000000 max 3054.000000
提供机构:
Mxode
原始信息汇总

数据集概述

基本信息

  • 许可证:Apache 2.0
  • 任务类别
    • 零样本分类
    • 摘要生成
    • 文本生成
  • 标签:新闻、校园
  • 数据规模:100K<n<1M
  • 语言:中文

数据内容

  • 数据量:约 65k * 3(类任务) 条

  • 数据格式: json { "id": <id>, "category": "(title_summarize|news_classify|news_generate)", "instruction": <对应的具体指令>, "input": <空>, "output": <指令对应的输出> }

  • 任务类型

    • 标题总结
    • 栏目分类
    • 新闻生成

数据样例

  • 标题总结: json { "id": 22106, "category": "title_summarize", "instruction": "请你给下面的新闻取一则标题: 点击图片观看视频   科技创新潜质人才是海院自主选拔的一种新尝试,根据规定,凡在高中阶段获得省级科技创新大赛一等奖或以上的应届高中毕业生,都可申请报名。今年苏浙沪三地共有300多名学生,申请以科技创新潜质人才身份入学海城海院。经过专家甄选,今天上午,有82名学生参加了考试。   上午的面试不考语数外,考生要靠自己的科技成果来证明自己是具有科技创新潜质的人才。记者发现,学生们带来的创新成果五花八门,涉及生物、微电子等多种前沿学科。经过上午的筛选,下午只剩下一半的学生进入了特色测试部分,这部分仍不考语数外,重点考察学生的发散性思维,及兴趣爱好和潜质,海院将根据测试结果,最终确定获得科技创新潜质人才的自主选拔认定名单,进入名单的考生,高考成绩只要达到当地的一本线,就能被海院录取。", "input": "", "output": "该新闻的标题可以是:《不考语数外 海院自主选拔创新潜质人才》" }

  • 栏目分类: json { "id": 134781, "category": "news_classify", "instruction": "根据下面的新闻,归纳出它的分类:

英国格拉斯哥大学医学院副院长访问海城海院生命学院

11月20日,英国格拉斯哥大学(University of Glasgow,以下简称格大)医学院副院长Chris Loughrey一行访问海城海院生命科学技术学院,生命学院副院长张晓君出席会议,魏芳、王晓雷、徐沁等多位教师与会参加交流。格大和海城海院同为国际大学交流组织Universitas 21的成员,在生命和医学领域合作紧密。双方互相介绍了两校在生命科学领域的学科专业、课程体系、学生培养等发展情况。在相互交流中,双方希望在这个宏观的合作框架下,逐步增进相互了解,在人才培养、科学研究等方面形成更多更实质性的合作。格拉斯哥大学是英联邦地区建校历史最悠久的四所大学之一。研究经费及教育质量均排名英国前十,出过7位诺奖获得者及3位英国首相。格大下设四大学院,其中医学院有教职员工2000余人,学生5000余人,医学院下设七个研究中心,涉及分子/细胞/系统生物学、植物、微生物、免疫、癌症、生物多样性、神经科学、心血管医学等领域。", "input": "", "output": "本则新闻的分类栏目是:综合新闻" }

  • 新闻生成: json { "id": 61402, "category": "news_generate", "instruction": "帮我写一则综合新闻栏目的新闻,标题是机械与动力工程学院举办心理健康月系列活动之“与压力一路同行”", "input": "", "output": "《机械与动力工程学院举办心理健康月系列活动之“与压力一路同行”》:

5月27日,机械与动力工程学院心理健康月系列活动之“与压力一路同行”讲座在高田会堂举行。活动特邀海城铁路学院兼职心理咨询师李毅作为主讲嘉宾,与大家分享如何正确认识压力以及积极应对压力 。本次活动由机械与动力工程学院王睿瑞主持。相较于受到心理教育较为全面的本科生同学,研究生同学面临更多的现实问题,更加需要心理健康的知识,来实现压力与情绪的合理管理。李毅首先结合心理学知识,从生理学、心理学等多个角度带领同学们认识了压力的本质,并指出并非所有压力都是有害的,负性的慢性压力对人产生危害,合理规避有害的压力尤为重要,关键是找到自己的最佳压力水平。李毅进一步讲述了心理压力与认知、情绪、行为之间的关系。通过对“情绪的身体地图”的说明,使同学们更加深刻地体会到了“情绪丰富了体验,情绪述说着需求,情绪激发出行为,情绪联结了他人”的含义。最后,李毅从时间管理、简化生活、认知需求、活在当下、友善待人等方面给出了应对压力的建议,并寄语大家“与并不完美的自我共处,与压力一路同行”。机械与动力工程学院心理健康月系列活动已举办4年,覆盖2700余名研究生,力求以多样的活动形式帮助学生和导师了解更多的心理知识,引导大家关爱自己、关心他人。" }

数据统计

  • instruction + output 的长度统计

    count 196101.000000 mean 1059.199912 std 571.623822 min 32.000000 0% 32.000000 5% 369.000000 10% 471.000000 15% 539.000000 20% 596.000000 25% 649.000000 30% 700.000000 35% 753.000000 40% 806.000000 45% 862.000000 50% 920.000000 55% 983.000000 60% 1053.000000 65% 1132.000000 70% 1225.000000 75% 1337.000000 80% 1473.000000 85% 1652.000000 90% 1898.000000 95% 2270.000000 max 3054.000000

搜集汇总
数据集介绍
main_image_url
构建方式
在高校新闻文本挖掘领域,Mxode/University-News-Instruction-Zh数据集通过系统化方法构建而成。该数据集从约65,000条高校校园新闻中提取元数据,包括标题、栏目及正文内容,并经过脱敏处理以遮盖作者名等敏感信息。构建过程巧妙地将新闻元素重新组合,生成了三类指令式任务样本:标题总结、栏目分类与新闻生成。这种基于元数据排列组合的生成策略,确保了数据在语义层面上的高度准确性,同时所有字段均被整理为单行文本格式,便于模型处理与解析。
特点
该数据集在校园新闻自然语言处理中展现出鲜明特色。其核心在于囊括了三大典型任务,覆盖了从文本摘要、分类到生成的完整流程,为模型提供了多维度的训练场景。数据经过精心脱敏,在保留新闻实质内容的同时维护了隐私安全。每条样本均以结构化指令形式呈现,指令与输出严格对应,保障了任务定义的一致性。此外,数据集提供了文本长度的详细分位统计,包括均值、标准差及各百分位数值,为用户根据计算资源灵活裁剪文本提供了可靠依据,体现了其在实用性上的细致考量。
使用方法
针对高校新闻的智能化处理需求,该数据集的使用方法清晰而高效。用户可直接加载其JSON格式文件,每条数据均包含任务类别、具体指令和预期输出。对于模型训练,可根据任务类别(如title_summarize, news_classify, news_generate)进行筛选,构建特定的监督学习或指令微调流程。鉴于数据已提供完整的长度分布信息,使用者能够依据模型输入限制,参考提供的分位点(如95%分位数为2270字符)对过长的instruction和output字段进行智能修剪,以优化训练效率与内存使用。这种即拿即用的结构化设计,显著降低了数据预处理的门槛。
背景与挑战
背景概述
在自然语言处理领域,高校校园新闻文本因其独特的语言风格和结构化特征,成为研究指令微调与多任务学习的重要语料。Mxode/University-News-Instruction-Zh数据集由国内研究团队于近期构建,聚焦于中文校园新闻的自动化处理,旨在通过标题总结、栏目分类与新闻生成三类核心任务,探索语言模型在特定垂直领域的适应能力。该数据集以约六万五千条经过脱敏处理的新闻条目为基础,通过元数据重组生成精确的指令-输出对,为学术机构提供了研究新闻文本理解与生成的高质量资源,推动了教育场景下自然语言处理技术的应用深化。
当前挑战
该数据集致力于解决校园新闻自动化处理中的多任务挑战,包括从长文本中凝练标题的概括性难题、对新闻内容进行精准栏目归类的分类复杂性,以及依据结构化指令生成连贯新闻的创造性要求。在构建过程中,面临的主要挑战涉及对原始新闻数据的脱敏处理,需在保留语义完整性的同时隐去作者等敏感信息;同时,通过元数据组合生成指令时,需确保任务形式的多样性与数据准确性之间的平衡,避免因模式化重组导致模型过拟合。此外,文本长度分布跨度较大,要求模型具备处理可变长度输入与输出的鲁棒性。
常用场景
经典使用场景
在自然语言处理领域,高校校园新闻数据集为指令微调任务提供了丰富的语料资源。该数据集通过标题总结、栏目分类和新闻生成三类任务,构建了结构化的训练样本,支持模型在零样本分类、摘要生成和文本生成等场景下的性能评估与优化。其经典使用场景聚焦于多任务学习框架,研究者可借助该数据集训练通用语言模型,提升模型对中文新闻文本的理解与生成能力,尤其在教育领域的垂直应用中展现出显著价值。
解决学术问题
该数据集有效应对了中文教育新闻领域缺乏高质量指令数据的挑战,为学术研究提供了标准化的评估基准。它解决了自然语言处理中指令跟随能力的量化问题,支持模型在零样本设置下的泛化性能研究。通过提供精确标注的新闻元数据,该数据集促进了文本摘要、分类和生成任务的算法创新,为探索模型在领域特定语言理解中的局限性提供了实证基础,推动了教育信息化与人工智能的交叉研究进展。
衍生相关工作
围绕该数据集衍生的经典工作主要包括基于指令微调的中文预训练模型优化研究,例如在T5、BART等架构上进行的多任务学习实验。相关研究探索了新闻领域自适应训练策略,提升了模型在有限数据下的泛化能力。同时,该数据集启发了教育文本生成质量的评估方法创新,如结合事实一致性与风格匹配的度量指标设计,为垂直领域语言模型的可信生成提供了重要参考。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作