TheBlueScrubs-v1-fixed
收藏魔搭社区2025-11-27 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/openmed-community/TheBlueScrubs-v1-fixed
下载链接
链接失效反馈官方服务:
资源简介:
# openmed-community/TheBlueScrubs-v1-fixed
## What is this?
**TheBlueScrubs-v1-fixed** is a maintenance fork of the upstream [TheBlueScrubs/TheBlueScrubs-v1](https://huggingface.co/datasets/TheBlueScrubs/TheBlueScrubs-v1) *train split* that resolves a schema bug in the `meta` column.
In the original train files, some rows serialized `meta` incorrectly (appearing as the literal string `"dict"`). This fork **re-exports the entire train split without `meta` column**, preserving text field and values.
- **Document count:** 11,080,331 texts (train)
- **Tokens (upstream estimate across all splits):** ~20B tokens
- **Sources:** Curated from SlimPajama/RedPajama (Common Crawl, C4, GitHub, Books, arXiv, Wikipedia, StackExchange)
- **Quality signals:** per-text medical probability (0.8–1.0) + three 1–5 LLM-based scores (relevance, precision/factual detail, safety/ethics); oncology label covering ~11B tokens across the full corpus.
> Upstream details: The Blue Scrubs is a large, curated medical corpus designed for clinical LLMs, filtered via a logistic-regression screen and then Llama-3.1-70B evaluation; clinician and external checks reported high concordance. An oncology classifier adds cancer labels at scale.
---
## Why this fork?
- **Fix:** Removes the `meta` column, unblocking usage with `datasets` streaming and dataframe backends.
- **Scope:** Content is otherwise **unchanged** relative to upstream train split (same rows, fields, and values).
- **Goal:** Provide a drop-in train split that **loads cleanly** in `datasets` without ad-hoc parsing workarounds.
---
## Data fields (train)
| Field | Type | Description |
|---|---|---|
| `text` | string | Raw medical text extracted from SlimPajama/RedPajama sources. |
---
## Splits
This repository publishes the **train** split only (11,080,331 documents). For methods, scope, and aggregate corpus statistics (including validation/test in the upstream project), see the original dataset card and paper.
---
## How to load
```python
from datasets import load_dataset
# streaming
ds = load_dataset("openmed-community/TheBlueScrubs-v1-fixed", split="train", streaming=True)
row = next(iter(ds))
row["text"]
# non-streaming (if you have local storage/network bandwidth)
ds = load_dataset("openmed-community/TheBlueScrubs-v1-fixed", split="train")
ds.features
# openmed-community/TheBlueScrubs-v1-fixed
## 这是什么?
**TheBlueScrubs-v1-fixed** 是上游 [TheBlueScrubs/TheBlueScrubs-v1](https://huggingface.co/datasets/TheBlueScrubs/TheBlueScrubs-v1) **训练子集**的维护分支,修复了`meta`列的架构错误。
在原始训练文件中,部分行的`meta`字段序列化异常,表现为字面字符串`"dict"`。此分支**重新导出完整训练子集并移除`meta`列**,保留所有文本字段及其取值。
- **文档数量(训练集):** 11,080,331 条文本
- **Token 数(全划分上游估算):** 约 200 亿 Token
- **数据来源:** 精选自 SlimPajama/RedPajama(涵盖 Common Crawl、C4、GitHub、图书、arXiv、维基百科、StackExchange)
- **质量标注信号:** 单文本医疗概率(0.8–1.0)+ 三项基于大语言模型(Large Language Model,LLM)的1-5分评分(相关性、精准性/事实细节、安全性/伦理合规性);全语料覆盖约110亿Token的肿瘤学标签。
> 上游详情:The Blue Scrubs 是一款专为临床大语言模型打造的大规模精选医疗语料库,先通过逻辑回归筛选,再经 Llama-3.1-70B 评估;经临床医生与外部校验,一致性表现优异。一款肿瘤学分类器可批量为文本添加癌症标签。
---
## 为何创建此分支?
- **修复内容:** 移除`meta`列,解决了`datasets`流式加载与数据帧后端的使用障碍。
- **范围说明:** 相对于上游训练子集,内容**未做任何修改**(保留原有的行、字段与取值)。
- **项目目标:** 提供可直接导入的训练子集,可在`datasets`中**干净加载**,无需自定义解析变通方案。
---
## 训练集数据字段
| 字段名 | 数据类型 | 说明 |
|---|---|---|
| `text` | 字符串 | 从 SlimPajama/RedPajama 数据源提取的原始医疗文本。 |
---
## 数据集划分
此仓库仅发布**训练子集**(共11,080,331份文档)。如需了解构建方法、语料范围与聚合统计信息(包括上游项目中的验证/测试划分),请参阅原始数据集卡片与相关论文。
---
## 加载方式
python
from datasets import load_dataset
# 流式加载
ds = load_dataset("openmed-community/TheBlueScrubs-v1-fixed", split="train", streaming=True)
row = next(iter(ds))
row["text"]
# 非流式加载(若具备本地存储或充足网络带宽)
ds = load_dataset("openmed-community/TheBlueScrubs-v1-fixed", split="train")
ds.features
提供机构:
maas
创建时间:
2025-08-26



