liwu/MNBVC
收藏数据集概述
基本信息
- 名称: MNBVC
- 语言: 中文
- 许可证: MIT
- 多语言性: 单语种
- 数据来源: 原始数据
- 任务类别:
- 文本生成
- 填充掩码
- 任务ID:
- 语言建模
- 掩码语言建模
数据集介绍
MNBVC数据集由中文互联网上的里屋社区创建,旨在提供最大的中文互联网语料集。数据集在不断更新中,用户可通过GitHub获取更多未清洗的数据。
数据子集
MNBVC数据集包含以下子集:
law_judgement
: 法律文书文本gov_xuexiqiangguo
: 学习强国文本gov_report
: 政府工作报告文本co_ann_report
: 企业年报文本code_metadata
: 代码元数据qa_zhihu
: 知乎问答数据qa_wikihow
: 来自wikihow的问答数据qa_mfa
: 外交部问答数据news_peoples_daily
: 人民日报文本数据wikipedia
: 维基百科文本数据qa_stackexchange
: StackExchange问答数据qa_chatgpt
: 使用ChatGPT构造的问答语料math_qa
: 数学领域问答数据math_chat
: 数学领域对话数据crawler_oscar
: 从CommonCrawl清洗出的通用文本数据
数据格式
MNBVC数据集包含以下几类数据格式:
- 通用文本
- 问答语料
- 代码语料
- 多轮对话
- 论坛语料
- 平行语料
早期数据格式如下,未来将被废弃并重新上传: json { "text": datasets.Value("string"), "meta": datasets.Value("string") }

Asteroids by the Minor Planet Center
包含所有已知小行星的轨道数据和观测数据。数据来源于Minor Planet Center,格式包括Fortran (.DAT)和JSON,数据集大小为81MB(压缩)和450MB(未压缩),记录数约750,000条,每日更新。
github 收录
Billboard-Hot-100
该数据集包含了自1958年以来所有Billboard Hot 100榜单的历史数据,详细记录了每首歌曲的排名、日期、表演者等信息。
github 收录
Infrared Thermal Image Dataset of High Voltage Electrical Power Equipment under Different Operating Conditions
Recognizing high voltage power equipment in electrical substations is the fundamental platform for effective condition monitoring of electrical power system. It enables proper identification and analysis of anomalies within the equipment, especially when in operation. The result such investigation can be applied for effective real-time measurement, control and protection schemes in the network. The use of visual images for this purpose would be limited during poor lighting conditions. However, Infrared (IR) images of the equipment are invariant to poor illumination condition. Hence, we have acquired the thermographic images of the high voltage power equipment using the portable professional FLIR C5 Infrared camera at different times of the day and load conditions. The dataset contains 5 categories of high voltages equipment common to most air-insulated electrical power substation at 132kV level, namely: circuit breakers, power transformers, surge arresters, disconnectors, and wave traps. The number of IR images for each class of equipment are: circuit breakers 203, power transformers 178, surge arresters 181, disconnectors 180, and wave traps 153. The IR images are 640 x 480 pixel RGB images captured using the rainbow color palette and properly segmented in labeled folders. The color bar in each IR image identifies the thermal range used during its acquisition. The dataset can be used for implementing novel research in computer vision based deep learning models, especially in object recognition, identification, fault classification or detection algorithms. The thermal profile of the equipment in the dataset could be applied for detection of hotspots and other related anomalies.
DataCite Commons 收录
Autism-Datasets
收集了一些关于自闭症的数据集。
github 收录
UniMed
UniMed是一个大规模、开源的多模态医学数据集,包含超过530万张图像-文本对,涵盖六种不同的医学成像模态:X射线、CT、MRI、超声、病理学和眼底。该数据集通过利用大型语言模型(LLMs)将特定模态的分类数据集转换为图像-文本格式,并结合现有的医学领域的图像-文本数据,以促进可扩展的视觉语言模型(VLM)预训练。
github 收录