cahlen/Convergent-7B-data

Name: cahlen/Convergent-7B-data
Creator: cahlen
Published: 2026-04-09 03:12:59
License: 暂无描述

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/cahlen/Convergent-7B-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation - question-answering language: - en tags: - number-theory - computational-mathematics - continued-fractions - cuda - tool-calling - agentic - research-companion size_categories: - 1K<n<10K --- # Convergent-7B Training Data <p align="center"> <img src="convergent-banner.jpg" alt="Convergent-7B — bigcompute.science research companion" width="800"> </p> **Training data for the [bigcompute.science](https://bigcompute.science) research companion model.** > **Early Preview** — This dataset is a work in progress. It is expressly designed to train a research assistant for the [bigcompute.science](https://bigcompute.science) MCP server as part of the Convergent conjecture-driven GPU research project. The dataset will be updated frequently as new experiments, findings, and tool definitions are added. Expect changes to schema, tool names, and content until we reach a GA release. The complete training dataset used to fine-tune [cahlen/Convergent-7B](https://huggingface.co/cahlen/Convergent-7B). | Repository | Description | |------------|-------------| | **[cahlen/Convergent-7B](https://huggingface.co/cahlen/Convergent-7B)** | Trained model weights | | **[cahlen/Convergent-7B-data](https://huggingface.co/datasets/cahlen/Convergent-7B-data)** | This repo — training dataset | | **[cahlen/convergent](https://github.com/cahlen/convergent)** | Training code, eval, CLI toolkit | ## Dataset Description 5,799 training entries in ChatML message format (cleaned and deduplicated), covering: - **Computational number theory**: continued fractions, Zaremba's conjecture, Hausdorff dimensions, Kronecker coefficients, Ramsey numbers, Flint Hills series, Cohen-Lenstra heuristics - **Agentic tool calling**: Hermes-format function calls to the bigcompute.science MCP server, including multi-turn ReAct trajectories - **CUDA kernel development**: GPU programming for number theory with architecture-specific optimization - **Research methodology**: proof strategies, experiment design, student guidance - **Synthetic reasoning**: Deep mathematical Chain-of-Thought from Qwen2.5-Math-72B and creative synthesis from Gemma-4-26B ## Format Each entry is a JSON object with a `messages` array in ChatML format: ```json { "messages": [ {"role": "system", "content": "You are Convergent, the bigcompute.science research companion..."}, {"role": "user", "content": "How many Zaremba exceptions exist for digit set {1,2,3}?"}, {"role": "assistant", "content": "<tool_call>\n{\"name\": \"get_zaremba_exceptions\", \"arguments\": {}}\n</tool_call>"} ] } ``` Multi-turn entries include `tool` role messages for agentic ReAct trajectories: ```json { "messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "Check the Zaremba verification status"}, {"role": "assistant", "content": "<tool_call>...</tool_call>"}, {"role": "tool", "content": "{\"status\": \"completed\", \"exceptions\": 0}"}, {"role": "assistant", "content": "The verification is complete with zero exceptions..."} ] } ``` ## Composition | Source | Entries | Description | |--------|---------|-------------| | Curated domain blocks (40+ modules) | ~1,150 | Identity, tool calling (23 MCP tools), nvcc-validated CUDA, number theory, error recovery, paper comprehension, student guidance | | Qwen2.5-Math-72B (synthetic) | ~3,100 | Deep mathematical reasoning and Chain-of-Thought | | Gemma-4-26B (synthetic) | ~1,200 | Creative synthesis, experiment design, long-form reasoning | | Hermes FC (external) | 300 | Diverse tool-calling patterns from NousResearch | | **Total (after dedup + cleaning)** | **5,799** | Off-topic entries and near-duplicates removed | ### Category Breakdown | Category | Count | Percentage | |----------|-------|------------| | Mathematical reasoning (CoT) | ~3,500 | 60% | | Tool-calling (agentic) | ~710 | 12% | | Knowledge / factual | ~800 | 14% | | Multi-turn conversations | ~520 | 9% | | CUDA code generation | ~270 | 5% | ## Data Sources See [DATA_SOURCES.md](https://github.com/cahlen/convergent/blob/main/DATA_SOURCES.md) for complete documentation of all sources, including: - Cahlen Humphreys' paper on prime convergents of continued fractions - Boise State University and Florida Atlantic University number theory research - Open Erdős problems - NVIDIA GPU architecture specifications - NousResearch/hermes-function-calling-v1 - bigcompute.science experimental findings ## Generation Pipeline The training toolkit is open-source: [github.com/cahlen/convergent](https://github.com/cahlen/convergent) ```bash ./convergent generate-blocks # Generate curated domain training blocks ./convergent generate-synthetic # Generate synthetic data from remote LLMs ./convergent merge # Merge, deduplicate, remove eval leaks ./convergent validate # Validate format and quality ``` ## License CC-BY-4.0 — You are free to share and adapt this dataset with attribution. ## Links - [bigcompute.science](https://bigcompute.science) — Conjecture-driven GPU research in computational mathematics - [MCP Server](https://mcp.bigcompute.science) — Model Context Protocol server for experimental data and tools - [Convergent-7B Model](https://huggingface.co/cahlen/Convergent-7B) — Trained model weights on HuggingFace - [Training Toolkit](https://github.com/cahlen/convergent) — Full pipeline source code on GitHub - [guerrillamathematics.com](https://guerrillamathematics.com) — Mathematical research blog ## Citation ```bibtex @misc{humphreys2026convergent, author = {Humphreys, Cahlen}, title = {Convergent-7B Training Data: Computational Number Theory for Agentic Research}, year = {2026}, url = {https://huggingface.co/datasets/cahlen/Convergent-7B-data} } ``` --- *This project is maintained by a single person. If you run into issues, please file them on [GitHub](https://github.com/cahlen/convergent/issues) or [HuggingFace](https://huggingface.co/cahlen/Convergent-7B/discussions) and I will do my best to address them. I apologize in advance for any delays in response time.*

许可证：CC BY 4.0（知识共享署名4.0协议）任务类别： - 文本生成 - 问答语言： - 英语标签： - 数论（number theory） - 计算数学（computational mathematics） - 连分数（continued fractions） - CUDA - 工具调用 - 智能体（agentic） - 研究助手（research-companion）规模类别： - 1K<n<10K（1000至10000条数据） # Convergent-7B 训练数据集 <p align="center"> <img src="convergent-banner.jpg" alt="Convergent-7B — bigcompute.science 研究助手" width="800"> </p> **适用于 [bigcompute.science](https://bigcompute.science) 研究助手模型的训练数据。** > **早期预览版** — 本数据集仍处于开发阶段，专为Convergent猜想驱动的GPU研究项目中，`bigcompute.science` MCP服务器的研究助手模型训练而设计。随着新实验、研究发现与工具定义的加入，本数据集将持续更新。在正式版发布前，数据集的结构、工具名称与内容均可能发生变动。本数据集为微调 [cahlen/Convergent-7B](https://huggingface.co/cahlen/Convergent-7B) 所用的完整训练数据集。 | 仓库地址 | 描述 | |------------|-------------| | **[cahlen/Convergent-7B](https://huggingface.co/cahlen/Convergent-7B)** | 已训练完成的模型权重 | | **[cahlen/Convergent-7B-data](https://huggingface.co/datasets/cahlen/Convergent-7B-data)** | 本仓库 — 训练数据集 | | **[cahlen/convergent](https://github.com/cahlen/convergent)** | 训练代码、评估脚本与CLI工具包 | ## 数据集概览本数据集包含5799条经清理与去重后的ChatML格式训练条目，覆盖以下内容： - **计算数论**：连分数、扎雷姆巴猜想（Zaremba's conjecture）、豪斯多夫维数、克罗内克系数、拉姆齐数、弗林特希尔斯级数、科恩-伦斯特拉启发式算法 - **智能体工具调用**：适配`bigcompute.science` MCP服务器的Hermes格式函数调用，包含多轮ReAct轨迹 - **CUDA内核开发**：面向数论任务的GPU编程，支持架构专属优化 - **研究方法论**：证明策略、实验设计、学生指导 - **合成推理**：源自Qwen2.5-Math-72B的深度数学思维链（Chain-of-Thought, CoT），以及源自Gemma-4-26B的创意合成内容 ## 数据格式每条数据为包含ChatML格式`messages`数组的JSON对象，示例如下： json { "messages": [ {"role": "system", "content": "You are Convergent, the bigcompute.science research companion..."}, {"role": "user", "content": "How many Zaremba exceptions exist for digit set {1,2,3}?"}, {"role": "assistant", "content": "<tool_call> {"name": "get_zaremba_exceptions", "arguments": {}} </tool_call>"} ] } 多轮对话条目包含`tool`角色消息，用于智能体ReAct轨迹： json { "messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "Check the Zaremba verification status"}, {"role": "assistant", "content": "<tool_call>...</tool_call>"}, {"role": "tool", "content": "{"status": "completed", "exceptions": 0}"}, {"role": "assistant", "content": "The verification is complete with zero exceptions..."} ] } ## 数据构成 | 数据来源 | 条目数 | 描述 | |--------|---------|-------------| | 精选领域模块（40+个） | ~1,150 | 包含身份设定、工具调用（23个MCP工具）、经nvcc验证的CUDA代码、数论内容、错误恢复、论文理解、学生指导 | | Qwen2.5-Math-72B（合成数据） | ~3,100 | 深度数学推理与思维链内容 | | Gemma-4-26B（合成数据） | ~1,200 | 创意合成、实验设计、长文本推理 | | Hermes FC（外部来源） | 300 | 源自NousResearch的多样化工具调用模式 | | **总计（去重清理后）** | **5,799** | 已移除无关条目与近似重复内容 | ### 类别分布 | 类别 | 条目数 | 占比 | |----------|-------|------------| | 数学推理（思维链） | ~3,500 | 60% | | 工具调用（智能体） | ~710 | 12% | | 知识/事实性内容 | ~800 | 14% | | 多轮对话 | ~520 | 9% | | CUDA代码生成 | ~270 | 5% | ## 数据来源完整来源文档请参见 [DATA_SOURCES.md](https://github.com/cahlen/convergent/blob/main/DATA_SOURCES.md)，涵盖： - Cahlen Humphreys 关于连分数素数收敛性的论文 - 博伊西州立大学与佛罗里达大西洋大学的数论研究成果 - 开放的厄尔多斯问题 - NVIDIA GPU架构规范 - NousResearch/hermes-function-calling-v1 - `bigcompute.science` 实验研究发现 ## 生成流水线训练工具包已开源：[github.com/cahlen/convergent](https://github.com/cahlen/convergent) bash ./convergent generate-blocks # 生成精选领域训练模块 ./convergent generate-synthetic # 从远程大语言模型生成合成数据 ./convergent merge # 合并、去重并移除评估集泄露数据 ./convergent validate # 验证数据格式与质量 ## 许可证 CC BY 4.0协议 — 您可自由共享并改编本数据集，但需注明原作者。 ## 相关链接 - [bigcompute.science](https://bigcompute.science) — 面向计算数学的猜想驱动GPU研究平台 - [MCP服务器](https://mcp.bigcompute.science) — 用于实验数据与工具的模型上下文协议（Model Context Protocol, MCP）服务器 - [Convergent-7B模型](https://huggingface.co/cahlen/Convergent-7B) — HuggingFace平台上的已训练模型权重 - [训练工具包](https://github.com/cahlen/convergent) — GitHub平台上的完整流水线源代码 - [guerrillamathematics.com](https://guerrillamathematics.com) — 数学研究博客 ## 引用格式 bibtex @misc{humphreys2026convergent, author = {Humphreys, Cahlen}, title = {Convergent-7B Training Data: Computational Number Theory for Agentic Research}, year = {2026}, url = {https://huggingface.co/datasets/cahlen/Convergent-7B-data} } --- *本项目由个人独立维护。若遇到问题，请在 [GitHub](https://github.com/cahlen/convergent/issues) 或 [HuggingFace](https://huggingface.co/cahlen/Convergent-7B/discussions) 提交issue，我会尽力处理。对于响应延迟深表歉意，提前感谢您的理解。*

提供机构：

cahlen

5,000+

优质数据集

54 个

任务类型

进入经典数据集