wisconsin-breast-cancer-diagnostic
收藏魔搭社区2025-12-05 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/mnemoraorg/wisconsin-breast-cancer-diagnostic
下载链接
链接失效反馈官方服务:
资源简介:
This dataset, derived from the **Wisconsin Breast Cancer (Diagnostic)**, is a comprehensive resource for developing and evaluating machine learning models focused on the binary classification of breast tumors as either **benign (B)** or **malignant (M)**. The data consists of features computed from digitized images of fine needle aspirates (FNA) of breast masses, offering a rich set of quantitative metrics for computational pathology and diagnostic research.
The dataset is a critical tool for healthcare analytics and is widely used for educational purposes and for benchmarking classification algorithms such as Support Vector Machines, Logistic Regression, and Neural Networks. Its primary utility lies in providing a clean, preprocessed, and balanced foundation for predictive modeling.
### Dataset Structure and Column Descriptions
The dataset contains **569 instances** (samples) and **32 columns**. Each instance represents a unique breast mass sample, with the features describing the characteristics of the cell nuclei.
#### Key Columns
* `id`: A unique numerical identifier assigned to each patient sample.
* `diagnosis`: The target variable, indicating the classification of the tumor. It is a categorical variable with two possible values: **'M'** for malignant and **'B'** for benign.
#### Feature Columns
The core of the dataset comprises **30 numerical features** derived from image analysis of the cell nuclei. These features are organized into three categories: **Mean**, **Standard Error**, and **Worst**. Each category contains 10 specific measurements.
* **Mean**: The average value of the 10 measurements across the cell nuclei in a sample. These features are labeled with a `_mean` suffix (e.g., `radius_mean`).
* **Standard Error (SE)**: The standard error of the 10 measurements. These features are labeled with a `_se` suffix (e.g., `area_se`).
* **Worst**: The largest or "worst" value of the 10 measurements, corresponding to the mean of the three largest values. These features are labeled with a `_worst` suffix (e.g., `perimeter_worst`).
#### The 10 Specific Measurements
1. `radius`: The mean of distances from the center to points on the perimeter.
2. `texture`: The standard deviation of grayscale values within the region of interest.
3. `perimeter`: The perimeter of the cell nuclei's contour.
4. `area`: The area of the cell nuclei's contour.
5. `smoothness`: The local variation in radius lengths.
6. `compactness`: Defined as `(perimeter^2 / area - 1.0)`.
7. `concavity`: The severity of concave portions of the contour.
8. `concave points`: The number of concave portions of the contour.
9. `symmetry`: The symmetry of the cell nuclei.
10. `fractal_dimension`: The "coastline approximation" of the cell nuclei's boundary, calculated as `1 - fractal dimension`.
Additionally, the dataset contains a redundant column, `Unnamed: 32`, which is empty and has no associated data.
### Key Characteristics and Use Cases
* **Data Integrity**: The dataset is entirely free of missing values, making it immediately suitable for analysis without preprocessing for null data.
* **Class Distribution**: With 357 benign and 212 malignant cases, the class distribution is relatively balanced, which is advantageous for training robust predictive models.
* **Dimensionality**: The 30 features provide a high-dimensional space for advanced machine learning techniques, including feature selection and dimensionality reduction to identify the most predictive variables.
* **Applications**: Ideal for building and evaluating classification models, conducting exploratory data analysis to visualize feature relationships, and supporting medical research to better understand the morphological characteristics of cancerous cells.
本数据集源自**威斯康星乳腺癌(诊断)数据集(Wisconsin Breast Cancer (Diagnostic))**,是用于开发和评估聚焦于乳腺肿瘤良恶性二分类任务的机器学习模型的综合性资源。其中,良性(benign,B)与恶性(malignant,M)两类肿瘤的分类为该任务的核心目标。数据由乳腺肿块细针抽吸活检(fine needle aspirate, FNA)数字化图像计算得到的特征构成,可为计算病理学与诊断研究提供丰富的定量度量指标。
该数据集是医疗健康分析的关键工具,被广泛应用于教学场景以及支持向量机(Support Vector Machines)、逻辑回归(Logistic Regression)、神经网络(Neural Networks)等分类算法的基准测试。其核心价值在于为预测建模提供了干净、已预处理且类别分布均衡的基础数据集。
### 数据集结构与字段说明
本数据集包含**569条实例(样本)**与**32个字段**。每条实例对应唯一的乳腺肿块样本,各字段描述细胞核的特征属性。
#### 核心字段
* `id`:为每位患者样本分配的唯一数值标识符。
* `diagnosis`:目标变量,用于标记肿瘤的分类结果。该分类变量包含两种取值:**'M'代表恶性肿瘤,'B'代表良性肿瘤**。
#### 特征字段
数据集的核心由30项从细胞核图像分析中提取的数值特征构成,这些特征被划分为三类:**均值(Mean)、标准误差(Standard Error, SE)与最差值(Worst)**,每类包含10项具体测量指标。
* **均值(Mean)**:单一样本中10项细胞核测量值的平均值,此类特征以`_mean`作为后缀(例如`radius_mean`,即半径均值)。
* **标准误差(Standard Error, SE)**:10项测量值的标准误差,此类特征以`_se`作为后缀(例如`area_se`,即面积标准误差)。
* **最差值(Worst)**:10项测量值中的最大值或“最差”值,对应三项最大测量值的平均值,此类特征以`_worst`作为后缀(例如`perimeter_worst`,即周长最差值)。
#### 10项具体测量指标
1. `radius`(半径):细胞核轮廓中心到边界各点距离的平均值。
2. `texture`(纹理):感兴趣区域内灰度值的标准差。
3. `perimeter`(周长):细胞核轮廓的周长。
4. `area`(面积):细胞核轮廓的面积。
5. `smoothness`(平滑度):半径长度的局部变化程度。
6. `compactness`(紧致度):计算公式为`(perimeter² / area - 1.0)`。
7. `concavity`(凹度):细胞核轮廓凹陷部分的严重程度。
8. `concave points`(凹点数):细胞核轮廓的凹陷区域数量。
9. `symmetry`(对称性):细胞核的对称程度。
10. `fractal_dimension`(分形维度):细胞核边界的“海岸线近似”值,计算公式为`1 - 分形维度`。
此外,数据集还包含一个冗余字段`Unnamed: 32`,该字段为空且无有效数据。
### 核心特性与应用场景
* **数据完整性**:数据集无缺失值,无需针对空值进行预处理即可直接用于分析。
* **类别分布**:包含357例良性样本与212例恶性样本,类别分布相对均衡,有利于训练鲁棒性更强的预测模型。
* **维度特性**:30项特征提供了高维特征空间,可支持特征选择、维度约简等高级机器学习技术,以识别最具预测价值的变量。
* **应用场景**:适用于构建与评估分类模型、开展探索性数据分析以可视化特征间关联,以及辅助医学研究以深入解析癌细胞的形态学特征。
提供机构:
maas
创建时间:
2025-09-08
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集源自威斯康星乳腺癌诊断数据,包含569个样本和32个列,其中30个数值特征从细针穿刺细胞核图像中计算得出,用于将肿瘤分类为良性或恶性。它作为机器学习分类模型的基准工具,广泛应用于医疗分析和教育领域。
以上内容由遇见数据集搜集并总结生成



