cahlen/Convergent-7B-data
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/cahlen/Convergent-7B-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- number-theory
- computational-mathematics
- continued-fractions
- cuda
- tool-calling
- agentic
- research-companion
size_categories:
- 1K<n<10K
---
# Convergent-7B Training Data
<p align="center">
<img src="convergent-banner.jpg" alt="Convergent-7B — bigcompute.science research companion" width="800">
</p>
**Training data for the [bigcompute.science](https://bigcompute.science) research companion model.**
> **Early Preview** — This dataset is a work in progress. It is expressly designed to train a research assistant for the [bigcompute.science](https://bigcompute.science) MCP server as part of the Convergent conjecture-driven GPU research project. The dataset will be updated frequently as new experiments, findings, and tool definitions are added. Expect changes to schema, tool names, and content until we reach a GA release.
The complete training dataset used to fine-tune [cahlen/Convergent-7B](https://huggingface.co/cahlen/Convergent-7B).
| Repository | Description |
|------------|-------------|
| **[cahlen/Convergent-7B](https://huggingface.co/cahlen/Convergent-7B)** | Trained model weights |
| **[cahlen/Convergent-7B-data](https://huggingface.co/datasets/cahlen/Convergent-7B-data)** | This repo — training dataset |
| **[cahlen/convergent](https://github.com/cahlen/convergent)** | Training code, eval, CLI toolkit |
## Dataset Description
5,799 training entries in ChatML message format (cleaned and deduplicated), covering:
- **Computational number theory**: continued fractions, Zaremba's conjecture, Hausdorff dimensions, Kronecker coefficients, Ramsey numbers, Flint Hills series, Cohen-Lenstra heuristics
- **Agentic tool calling**: Hermes-format function calls to the bigcompute.science MCP server, including multi-turn ReAct trajectories
- **CUDA kernel development**: GPU programming for number theory with architecture-specific optimization
- **Research methodology**: proof strategies, experiment design, student guidance
- **Synthetic reasoning**: Deep mathematical Chain-of-Thought from Qwen2.5-Math-72B and creative synthesis from Gemma-4-26B
## Format
Each entry is a JSON object with a `messages` array in ChatML format:
```json
{
"messages": [
{"role": "system", "content": "You are Convergent, the bigcompute.science research companion..."},
{"role": "user", "content": "How many Zaremba exceptions exist for digit set {1,2,3}?"},
{"role": "assistant", "content": "<tool_call>\n{\"name\": \"get_zaremba_exceptions\", \"arguments\": {}}\n</tool_call>"}
]
}
```
Multi-turn entries include `tool` role messages for agentic ReAct trajectories:
```json
{
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "Check the Zaremba verification status"},
{"role": "assistant", "content": "<tool_call>...</tool_call>"},
{"role": "tool", "content": "{\"status\": \"completed\", \"exceptions\": 0}"},
{"role": "assistant", "content": "The verification is complete with zero exceptions..."}
]
}
```
## Composition
| Source | Entries | Description |
|--------|---------|-------------|
| Curated domain blocks (40+ modules) | ~1,150 | Identity, tool calling (23 MCP tools), nvcc-validated CUDA, number theory, error recovery, paper comprehension, student guidance |
| Qwen2.5-Math-72B (synthetic) | ~3,100 | Deep mathematical reasoning and Chain-of-Thought |
| Gemma-4-26B (synthetic) | ~1,200 | Creative synthesis, experiment design, long-form reasoning |
| Hermes FC (external) | 300 | Diverse tool-calling patterns from NousResearch |
| **Total (after dedup + cleaning)** | **5,799** | Off-topic entries and near-duplicates removed |
### Category Breakdown
| Category | Count | Percentage |
|----------|-------|------------|
| Mathematical reasoning (CoT) | ~3,500 | 60% |
| Tool-calling (agentic) | ~710 | 12% |
| Knowledge / factual | ~800 | 14% |
| Multi-turn conversations | ~520 | 9% |
| CUDA code generation | ~270 | 5% |
## Data Sources
See [DATA_SOURCES.md](https://github.com/cahlen/convergent/blob/main/DATA_SOURCES.md) for complete documentation of all sources, including:
- Cahlen Humphreys' paper on prime convergents of continued fractions
- Boise State University and Florida Atlantic University number theory research
- Open Erdős problems
- NVIDIA GPU architecture specifications
- NousResearch/hermes-function-calling-v1
- bigcompute.science experimental findings
## Generation Pipeline
The training toolkit is open-source: [github.com/cahlen/convergent](https://github.com/cahlen/convergent)
```bash
./convergent generate-blocks # Generate curated domain training blocks
./convergent generate-synthetic # Generate synthetic data from remote LLMs
./convergent merge # Merge, deduplicate, remove eval leaks
./convergent validate # Validate format and quality
```
## License
CC-BY-4.0 — You are free to share and adapt this dataset with attribution.
## Links
- [bigcompute.science](https://bigcompute.science) — Conjecture-driven GPU research in computational mathematics
- [MCP Server](https://mcp.bigcompute.science) — Model Context Protocol server for experimental data and tools
- [Convergent-7B Model](https://huggingface.co/cahlen/Convergent-7B) — Trained model weights on HuggingFace
- [Training Toolkit](https://github.com/cahlen/convergent) — Full pipeline source code on GitHub
- [guerrillamathematics.com](https://guerrillamathematics.com) — Mathematical research blog
## Citation
```bibtex
@misc{humphreys2026convergent,
author = {Humphreys, Cahlen},
title = {Convergent-7B Training Data: Computational Number Theory for Agentic Research},
year = {2026},
url = {https://huggingface.co/datasets/cahlen/Convergent-7B-data}
}
```
---
*This project is maintained by a single person. If you run into issues, please file them on [GitHub](https://github.com/cahlen/convergent/issues) or [HuggingFace](https://huggingface.co/cahlen/Convergent-7B/discussions) and I will do my best to address them. I apologize in advance for any delays in response time.*
许可证:CC BY 4.0(知识共享署名4.0协议)
任务类别:
- 文本生成
- 问答
语言:
- 英语
标签:
- 数论(number theory)
- 计算数学(computational mathematics)
- 连分数(continued fractions)
- CUDA
- 工具调用
- 智能体(agentic)
- 研究助手(research-companion)
规模类别:
- 1K<n<10K(1000至10000条数据)
# Convergent-7B 训练数据集
<p align="center">
<img src="convergent-banner.jpg" alt="Convergent-7B — bigcompute.science 研究助手" width="800">
</p>
**适用于 [bigcompute.science](https://bigcompute.science) 研究助手模型的训练数据。**
> **早期预览版** — 本数据集仍处于开发阶段,专为Convergent猜想驱动的GPU研究项目中,`bigcompute.science` MCP服务器的研究助手模型训练而设计。随着新实验、研究发现与工具定义的加入,本数据集将持续更新。在正式版发布前,数据集的结构、工具名称与内容均可能发生变动。
本数据集为微调 [cahlen/Convergent-7B](https://huggingface.co/cahlen/Convergent-7B) 所用的完整训练数据集。
| 仓库地址 | 描述 |
|------------|-------------|
| **[cahlen/Convergent-7B](https://huggingface.co/cahlen/Convergent-7B)** | 已训练完成的模型权重 |
| **[cahlen/Convergent-7B-data](https://huggingface.co/datasets/cahlen/Convergent-7B-data)** | 本仓库 — 训练数据集 |
| **[cahlen/convergent](https://github.com/cahlen/convergent)** | 训练代码、评估脚本与CLI工具包 |
## 数据集概览
本数据集包含5799条经清理与去重后的ChatML格式训练条目,覆盖以下内容:
- **计算数论**:连分数、扎雷姆巴猜想(Zaremba's conjecture)、豪斯多夫维数、克罗内克系数、拉姆齐数、弗林特希尔斯级数、科恩-伦斯特拉启发式算法
- **智能体工具调用**:适配`bigcompute.science` MCP服务器的Hermes格式函数调用,包含多轮ReAct轨迹
- **CUDA内核开发**:面向数论任务的GPU编程,支持架构专属优化
- **研究方法论**:证明策略、实验设计、学生指导
- **合成推理**:源自Qwen2.5-Math-72B的深度数学思维链(Chain-of-Thought, CoT),以及源自Gemma-4-26B的创意合成内容
## 数据格式
每条数据为包含ChatML格式`messages`数组的JSON对象,示例如下:
json
{
"messages": [
{"role": "system", "content": "You are Convergent, the bigcompute.science research companion..."},
{"role": "user", "content": "How many Zaremba exceptions exist for digit set {1,2,3}?"},
{"role": "assistant", "content": "<tool_call>
{"name": "get_zaremba_exceptions", "arguments": {}}
</tool_call>"}
]
}
多轮对话条目包含`tool`角色消息,用于智能体ReAct轨迹:
json
{
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "Check the Zaremba verification status"},
{"role": "assistant", "content": "<tool_call>...</tool_call>"},
{"role": "tool", "content": "{"status": "completed", "exceptions": 0}"},
{"role": "assistant", "content": "The verification is complete with zero exceptions..."}
]
}
## 数据构成
| 数据来源 | 条目数 | 描述 |
|--------|---------|-------------|
| 精选领域模块(40+个) | ~1,150 | 包含身份设定、工具调用(23个MCP工具)、经nvcc验证的CUDA代码、数论内容、错误恢复、论文理解、学生指导 |
| Qwen2.5-Math-72B(合成数据) | ~3,100 | 深度数学推理与思维链内容 |
| Gemma-4-26B(合成数据) | ~1,200 | 创意合成、实验设计、长文本推理 |
| Hermes FC(外部来源) | 300 | 源自NousResearch的多样化工具调用模式 |
| **总计(去重清理后)** | **5,799** | 已移除无关条目与近似重复内容 |
### 类别分布
| 类别 | 条目数 | 占比 |
|----------|-------|------------|
| 数学推理(思维链) | ~3,500 | 60% |
| 工具调用(智能体) | ~710 | 12% |
| 知识/事实性内容 | ~800 | 14% |
| 多轮对话 | ~520 | 9% |
| CUDA代码生成 | ~270 | 5% |
## 数据来源
完整来源文档请参见 [DATA_SOURCES.md](https://github.com/cahlen/convergent/blob/main/DATA_SOURCES.md),涵盖:
- Cahlen Humphreys 关于连分数素数收敛性的论文
- 博伊西州立大学与佛罗里达大西洋大学的数论研究成果
- 开放的厄尔多斯问题
- NVIDIA GPU架构规范
- NousResearch/hermes-function-calling-v1
- `bigcompute.science` 实验研究发现
## 生成流水线
训练工具包已开源:[github.com/cahlen/convergent](https://github.com/cahlen/convergent)
bash
./convergent generate-blocks # 生成精选领域训练模块
./convergent generate-synthetic # 从远程大语言模型生成合成数据
./convergent merge # 合并、去重并移除评估集泄露数据
./convergent validate # 验证数据格式与质量
## 许可证
CC BY 4.0协议 — 您可自由共享并改编本数据集,但需注明原作者。
## 相关链接
- [bigcompute.science](https://bigcompute.science) — 面向计算数学的猜想驱动GPU研究平台
- [MCP服务器](https://mcp.bigcompute.science) — 用于实验数据与工具的模型上下文协议(Model Context Protocol, MCP)服务器
- [Convergent-7B模型](https://huggingface.co/cahlen/Convergent-7B) — HuggingFace平台上的已训练模型权重
- [训练工具包](https://github.com/cahlen/convergent) — GitHub平台上的完整流水线源代码
- [guerrillamathematics.com](https://guerrillamathematics.com) — 数学研究博客
## 引用格式
bibtex
@misc{humphreys2026convergent,
author = {Humphreys, Cahlen},
title = {Convergent-7B Training Data: Computational Number Theory for Agentic Research},
year = {2026},
url = {https://huggingface.co/datasets/cahlen/Convergent-7B-data}
}
---
*本项目由个人独立维护。若遇到问题,请在 [GitHub](https://github.com/cahlen/convergent/issues) 或 [HuggingFace](https://huggingface.co/cahlen/Convergent-7B/discussions) 提交issue,我会尽力处理。对于响应延迟深表歉意,提前感谢您的理解。*
提供机构:
cahlen



