machinelearninglm-scm-synthetic-tabularml
收藏魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/MachineLearningLM/machinelearninglm-scm-synthetic-tabularml
下载链接
链接失效反馈官方服务:
资源简介:
# MachineLearningLM Pretraining Corpus
This repository contains the pretraining corpus for **MachineLearningLM**, a framework designed to equip large language models (LLMs) with robust in-context machine learning (ML) capabilities. The dataset consists of ML tasks synthesized from millions of structural causal models (SCMs), spanning various shot counts up to 1,024. It is designed to enable LLMs to learn from many in-context examples on standard ML tasks purely via in-context learning (ICL) without gradient descent.
* **Paper**: [MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining](https://huggingface.co/papers/2509.06806) | [arXiv:2509.06806](https://arxiv.org/pdf/2509.06806)
* **Code**: [https://github.com/HaoAreYuDong/MachineLearningLM](https://github.com/HaoAreYuDong/MachineLearningLM)
* **Project Page**: [https://huggingface.co/MachineLearningLM](https://huggingface.co/MachineLearningLM)
## Dataset Description
The corpus is built from millions of structural causal models (SCMs), which are used to synthesize diverse machine learning tasks. This synthetic data allows for training LLMs to develop strong in-context learning abilities, particularly for tabular classification tasks across various domains such as finance, physics, biology, and healthcare. The goal is for LLMs to achieve random-forest-level accuracy without any task-specific training, demonstrating a striking many-shot scaling law where accuracy increases monotonically with more in-context demonstrations.
## Full Dataset
All Datasets have been open-sourced on Hugging Face. Due to the large file size, the dataset has been split into multiple parts. The complete datasets are hosted on Google Drive:
* **Warmup Dataset**: [https://drive.google.com/file/d/1OjD0jICy95lOFp52_2hJoO7KzSiFegLH/view?usp=sharing](https://drive.google.com/file/d/1OjD0jICy95lOFp52_2hJoO7KzSiFegLH/view?usp=sharing)
* **Full Dataset**: [https://drive.google.com/file/d/1TYsEMI1WNYDzzE_z83Ah-QAmcoaVHKQA/view?usp=sharing](https://drive.google.com/file/d/1TYsEMI1WNYDzzE_z83Ah-QAmcoaVHKQA/view?usp=sharing)
## Dataset Structure
The pretraining corpus consists of prompts formatted in LLaMA Factory's Alpaca format. Each sample in the dataset is a JSONL entry with the following structure:
```json
{
"instruction": "The task instruction for the specific machine learning problem (e.g., 'Classify the following tabular data:').",
"input": "The input data for the task, serialized as text, potentially including in-context examples.",
"output": "The expected output or prediction for the machine learning task."
}
```
This token-efficient prompt format is designed to enable LLMs to process a high density of examples within their context window, facilitating effective in-context learning.
## Sample Usage
The associated [GitHub repository](https://github.com/HaoAreYuDong/MachineLearningLM) provides a comprehensive evaluation framework for using and evaluating models with this dataset. Below are examples of how to perform local model inference and single-file evaluation.
### Installation
```bash
# Install Python dependencies
pip install -r requirements.txt
```
### Local Model Usage Example
To perform inference using a local model with an input JSONL file:
```bash
python ./src/evaluation/model_pred/dl_model_pred.py \
--input_dir ./demo_input.jsonl \
--output_dir ./demo_output.jsonl \
--model_name MachineLearningLM/MachineLearningLM-7B-v1
```
### Single File Evaluation
To evaluate the predictions generated from a single JSONL response file:
```bash
python ./src/evaluation/result_proc/evaluator.py \
--input_dir ./demo_response.jsonl \
--output_dir ./output_demo.txt # Can also be .jsonl
```
**Note**: The evaluation framework is specifically designed for results generated by its `dl_model_pred` inference pipeline. Please use outputs from this inference module as input for evaluation to ensure compatibility. For more details on batch processing, cloud model usage, or generating prior data, please refer to the [GitHub repository](https://github.com/HaoAreYuDong/MachineLearningLM).
# MachineLearningLM 预训练语料库
本仓库收录了面向**MachineLearningLM**的预训练语料库。MachineLearningLM是一款旨在为大语言模型(Large Language Model,LLM)赋予强大的上下文机器学习(Machine Learning,ML)能力的框架。该数据集包含由数百万个结构因果模型(Structural Causal Model,SCM)生成的机器学习任务,覆盖最多1024的不同上下文示例数。其设计目标是让大语言模型仅通过上下文学习(In-context Learning,ICL),无需梯度下降,即可从标准机器学习任务的大量上下文示例中完成学习。
* **论文**:[MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining](https://huggingface.co/papers/2509.06806) | [arXiv:2509.06806](https://arxiv.org/pdf/2509.06806)
* **代码**:[https://github.com/HaoAreYuDong/MachineLearningLM](https://github.com/HaoAreYuDong/MachineLearningLM)
* **项目主页**:[https://huggingface.co/MachineLearningLM](https://huggingface.co/MachineLearningLM)
## 数据集说明
该语料库基于数百万个结构因果模型构建,用于生成多样化的机器学习任务。这类合成数据可用于训练大语言模型,使其具备优秀的上下文学习能力,尤其适用于金融、物理、生物、医疗等多个领域的表格分类任务。其目标是让大语言模型无需任何任务专属训练,即可达到随机森林级别的准确率,并展现出显著的多示例上下文学习缩放规律:准确率随上下文演示示例的增加呈单调上升趋势。
## 完整数据集
所有数据集已在Hugging Face平台开源。由于文件体积较大,数据集已拆分为多个分卷。完整数据集托管于Google Drive:
* **预热数据集**:[https://drive.google.com/file/d/1OjD0jICy95lOFp52_2hJoO7KzSiFegLH/view?usp=sharing](https://drive.google.com/file/d/1OjD0jICy95lOFp52_2hJoO7KzSiFegLH/view?usp=sharing)
* **完整数据集**:[https://drive.google.com/file/d/1TYsEMI1WNYDzzE_z83Ah-QAmcoaVHKQA/view?usp=sharing](https://drive.google.com/file/d/1TYsEMI1WNYDzzE_z83Ah-QAmcoaVHKQA/view?usp=sharing)
## 数据集结构
该预训练语料库的提示词采用LLaMA Factory的Alpaca格式进行组织。数据集中的每个样本均为一条JSONL条目,格式如下:
json
{
"instruction": "The task instruction for the specific machine learning problem (e.g., 'Classify the following tabular data:').",
"input": "The input data for the task, serialized as text, potentially including in-context examples.",
"output": "The expected output or prediction for the machine learning task."
}
这种兼顾Token效率的提示格式,旨在让大语言模型在其上下文窗口内处理高密度的示例,从而实现高效的上下文学习。
## 示例用法
配套的[GitHub仓库](https://github.com/HaoAreYuDong/MachineLearningLM)提供了完整的评估框架,用于基于该数据集使用与评估模型。下文将展示本地模型推理与单文件评估的示例。
### 安装
bash
# 安装Python依赖包
pip install -r requirements.txt
### 本地模型使用示例
若需使用本地模型对输入JSONL文件进行推理,可执行如下命令:
bash
python ./src/evaluation/model_pred/dl_model_pred.py
--input_dir ./demo_input.jsonl
--output_dir ./demo_output.jsonl
--model_name MachineLearningLM/MachineLearningLM-7B-v1
### 单文件评估
若需对单份JSONL格式的预测结果文件进行评估,可执行如下命令:
bash
python ./src/evaluation/result_proc/evaluator.py
--input_dir ./demo_response.jsonl
--output_dir ./output_demo.txt # 也可保存为.jsonl格式
**注意**:本评估框架仅针对其`dl_model_pred`推理流水线生成的结果进行了优化。请使用该推理模块的输出作为评估输入,以确保兼容性。如需了解批量处理、云模型使用或前置数据生成的更多细节,请参阅配套的[GitHub仓库](https://github.com/HaoAreYuDong/MachineLearningLM)。
提供机构:
maas
创建时间:
2025-09-11



