TheBlueScrubs-v1-fixed

Name: TheBlueScrubs-v1-fixed
Creator: maas
Published: 2025-11-27 16:46:55
License: 暂无描述

魔搭社区2025-11-27 更新2025-08-30 收录

下载链接：

https://modelscope.cn/datasets/openmed-community/TheBlueScrubs-v1-fixed

下载链接

链接失效反馈

官方服务：

资源简介：

# openmed-community/TheBlueScrubs-v1-fixed ## What is this? **TheBlueScrubs-v1-fixed** is a maintenance fork of the upstream [TheBlueScrubs/TheBlueScrubs-v1](https://huggingface.co/datasets/TheBlueScrubs/TheBlueScrubs-v1) *train split* that resolves a schema bug in the `meta` column. In the original train files, some rows serialized `meta` incorrectly (appearing as the literal string `"dict"`). This fork **re-exports the entire train split without `meta` column**, preserving text field and values. - **Document count:** 11,080,331 texts (train) - **Tokens (upstream estimate across all splits):** ~20B tokens - **Sources:** Curated from SlimPajama/RedPajama (Common Crawl, C4, GitHub, Books, arXiv, Wikipedia, StackExchange) - **Quality signals:** per-text medical probability (0.8–1.0) + three 1–5 LLM-based scores (relevance, precision/factual detail, safety/ethics); oncology label covering ~11B tokens across the full corpus. > Upstream details: The Blue Scrubs is a large, curated medical corpus designed for clinical LLMs, filtered via a logistic-regression screen and then Llama-3.1-70B evaluation; clinician and external checks reported high concordance. An oncology classifier adds cancer labels at scale. --- ## Why this fork? - **Fix:** Removes the `meta` column, unblocking usage with `datasets` streaming and dataframe backends. - **Scope:** Content is otherwise **unchanged** relative to upstream train split (same rows, fields, and values). - **Goal:** Provide a drop-in train split that **loads cleanly** in `datasets` without ad-hoc parsing workarounds. --- ## Data fields (train) | Field | Type | Description | |---|---|---| | `text` | string | Raw medical text extracted from SlimPajama/RedPajama sources. | --- ## Splits This repository publishes the **train** split only (11,080,331 documents). For methods, scope, and aggregate corpus statistics (including validation/test in the upstream project), see the original dataset card and paper. --- ## How to load ```python from datasets import load_dataset # streaming ds = load_dataset("openmed-community/TheBlueScrubs-v1-fixed", split="train", streaming=True) row = next(iter(ds)) row["text"] # non-streaming (if you have local storage/network bandwidth) ds = load_dataset("openmed-community/TheBlueScrubs-v1-fixed", split="train") ds.features

# openmed-community/TheBlueScrubs-v1-fixed ## 这是什么？ **TheBlueScrubs-v1-fixed** 是上游 [TheBlueScrubs/TheBlueScrubs-v1](https://huggingface.co/datasets/TheBlueScrubs/TheBlueScrubs-v1) **训练子集**的维护分支，修复了`meta`列的架构错误。在原始训练文件中，部分行的`meta`字段序列化异常，表现为字面字符串`"dict"`。此分支**重新导出完整训练子集并移除`meta`列**，保留所有文本字段及其取值。 - **文档数量（训练集）：** 11,080,331 条文本 - **Token 数（全划分上游估算）：** 约 200 亿 Token - **数据来源：** 精选自 SlimPajama/RedPajama（涵盖 Common Crawl、C4、GitHub、图书、arXiv、维基百科、StackExchange） - **质量标注信号：** 单文本医疗概率（0.8–1.0）+ 三项基于大语言模型（Large Language Model，LLM）的1-5分评分（相关性、精准性/事实细节、安全性/伦理合规性）；全语料覆盖约110亿Token的肿瘤学标签。 > 上游详情：The Blue Scrubs 是一款专为临床大语言模型打造的大规模精选医疗语料库，先通过逻辑回归筛选，再经 Llama-3.1-70B 评估；经临床医生与外部校验，一致性表现优异。一款肿瘤学分类器可批量为文本添加癌症标签。 --- ## 为何创建此分支？ - **修复内容：** 移除`meta`列，解决了`datasets`流式加载与数据帧后端的使用障碍。 - **范围说明：** 相对于上游训练子集，内容**未做任何修改**（保留原有的行、字段与取值）。 - **项目目标：** 提供可直接导入的训练子集，可在`datasets`中**干净加载**，无需自定义解析变通方案。 --- ## 训练集数据字段 | 字段名 | 数据类型 | 说明 | |---|---|---| | `text` | 字符串 | 从 SlimPajama/RedPajama 数据源提取的原始医疗文本。 | --- ## 数据集划分此仓库仅发布**训练子集**（共11,080,331份文档）。如需了解构建方法、语料范围与聚合统计信息（包括上游项目中的验证/测试划分），请参阅原始数据集卡片与相关论文。 --- ## 加载方式 python from datasets import load_dataset # 流式加载 ds = load_dataset("openmed-community/TheBlueScrubs-v1-fixed", split="train", streaming=True) row = next(iter(ds)) row["text"] # 非流式加载（若具备本地存储或充足网络带宽） ds = load_dataset("openmed-community/TheBlueScrubs-v1-fixed", split="train") ds.features

提供机构：

maas

创建时间：

2025-08-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集