five

cahlen/Convergent-7B-data

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/cahlen/Convergent-7B-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - question-answering language: - en tags: - number-theory - computational-mathematics - continued-fractions - cuda - tool-calling - agentic - research-companion size_categories: - 1K<n<10K --- # Convergent-7B Training Data <p align="center"> <img src="convergent-banner.jpg" alt="Convergent-7B — bigcompute.science research companion" width="800"> </p> **Training data for the [bigcompute.science](https://bigcompute.science) research companion model.** > **Early Preview** — This dataset is a work in progress. It is expressly designed to train a research assistant for the [bigcompute.science](https://bigcompute.science) MCP server as part of the Convergent conjecture-driven GPU research project. The dataset will be updated frequently as new experiments, findings, and tool definitions are added. Expect changes to schema, tool names, and content until we reach a GA release. The complete training dataset used to fine-tune [cahlen/Convergent-7B](https://huggingface.co/cahlen/Convergent-7B). | Repository | Description | |------------|-------------| | **[cahlen/Convergent-7B](https://huggingface.co/cahlen/Convergent-7B)** | Trained model weights | | **[cahlen/Convergent-7B-data](https://huggingface.co/datasets/cahlen/Convergent-7B-data)** | This repo — training dataset | | **[cahlen/convergent](https://github.com/cahlen/convergent)** | Training code, eval, CLI toolkit | ## Dataset Description 5,799 training entries in ChatML message format (cleaned and deduplicated), covering: - **Computational number theory**: continued fractions, Zaremba's conjecture, Hausdorff dimensions, Kronecker coefficients, Ramsey numbers, Flint Hills series, Cohen-Lenstra heuristics - **Agentic tool calling**: Hermes-format function calls to the bigcompute.science MCP server, including multi-turn ReAct trajectories - **CUDA kernel development**: GPU programming for number theory with architecture-specific optimization - **Research methodology**: proof strategies, experiment design, student guidance - **Synthetic reasoning**: Deep mathematical Chain-of-Thought from Qwen2.5-Math-72B and creative synthesis from Gemma-4-26B ## Format Each entry is a JSON object with a `messages` array in ChatML format: ```json { "messages": [ {"role": "system", "content": "You are Convergent, the bigcompute.science research companion..."}, {"role": "user", "content": "How many Zaremba exceptions exist for digit set {1,2,3}?"}, {"role": "assistant", "content": "<tool_call>\n{\"name\": \"get_zaremba_exceptions\", \"arguments\": {}}\n</tool_call>"} ] } ``` Multi-turn entries include `tool` role messages for agentic ReAct trajectories: ```json { "messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "Check the Zaremba verification status"}, {"role": "assistant", "content": "<tool_call>...</tool_call>"}, {"role": "tool", "content": "{\"status\": \"completed\", \"exceptions\": 0}"}, {"role": "assistant", "content": "The verification is complete with zero exceptions..."} ] } ``` ## Composition | Source | Entries | Description | |--------|---------|-------------| | Curated domain blocks (40+ modules) | ~1,150 | Identity, tool calling (23 MCP tools), nvcc-validated CUDA, number theory, error recovery, paper comprehension, student guidance | | Qwen2.5-Math-72B (synthetic) | ~3,100 | Deep mathematical reasoning and Chain-of-Thought | | Gemma-4-26B (synthetic) | ~1,200 | Creative synthesis, experiment design, long-form reasoning | | Hermes FC (external) | 300 | Diverse tool-calling patterns from NousResearch | | **Total (after dedup + cleaning)** | **5,799** | Off-topic entries and near-duplicates removed | ### Category Breakdown | Category | Count | Percentage | |----------|-------|------------| | Mathematical reasoning (CoT) | ~3,500 | 60% | | Tool-calling (agentic) | ~710 | 12% | | Knowledge / factual | ~800 | 14% | | Multi-turn conversations | ~520 | 9% | | CUDA code generation | ~270 | 5% | ## Data Sources See [DATA_SOURCES.md](https://github.com/cahlen/convergent/blob/main/DATA_SOURCES.md) for complete documentation of all sources, including: - Cahlen Humphreys' paper on prime convergents of continued fractions - Boise State University and Florida Atlantic University number theory research - Open Erdős problems - NVIDIA GPU architecture specifications - NousResearch/hermes-function-calling-v1 - bigcompute.science experimental findings ## Generation Pipeline The training toolkit is open-source: [github.com/cahlen/convergent](https://github.com/cahlen/convergent) ```bash ./convergent generate-blocks # Generate curated domain training blocks ./convergent generate-synthetic # Generate synthetic data from remote LLMs ./convergent merge # Merge, deduplicate, remove eval leaks ./convergent validate # Validate format and quality ``` ## License CC-BY-4.0 — You are free to share and adapt this dataset with attribution. ## Links - [bigcompute.science](https://bigcompute.science) — Conjecture-driven GPU research in computational mathematics - [MCP Server](https://mcp.bigcompute.science) — Model Context Protocol server for experimental data and tools - [Convergent-7B Model](https://huggingface.co/cahlen/Convergent-7B) — Trained model weights on HuggingFace - [Training Toolkit](https://github.com/cahlen/convergent) — Full pipeline source code on GitHub - [guerrillamathematics.com](https://guerrillamathematics.com) — Mathematical research blog ## Citation ```bibtex @misc{humphreys2026convergent, author = {Humphreys, Cahlen}, title = {Convergent-7B Training Data: Computational Number Theory for Agentic Research}, year = {2026}, url = {https://huggingface.co/datasets/cahlen/Convergent-7B-data} } ``` --- *This project is maintained by a single person. If you run into issues, please file them on [GitHub](https://github.com/cahlen/convergent/issues) or [HuggingFace](https://huggingface.co/cahlen/Convergent-7B/discussions) and I will do my best to address them. I apologize in advance for any delays in response time.*

许可证:CC BY 4.0(知识共享署名4.0协议) 任务类别: - 文本生成 - 问答 语言: - 英语 标签: - 数论(number theory) - 计算数学(computational mathematics) - 连分数(continued fractions) - CUDA - 工具调用 - 智能体(agentic) - 研究助手(research-companion) 规模类别: - 1K<n<10K(1000至10000条数据) # Convergent-7B 训练数据集 <p align="center"> <img src="convergent-banner.jpg" alt="Convergent-7B — bigcompute.science 研究助手" width="800"> </p> **适用于 [bigcompute.science](https://bigcompute.science) 研究助手模型的训练数据。** > **早期预览版** — 本数据集仍处于开发阶段,专为Convergent猜想驱动的GPU研究项目中,`bigcompute.science` MCP服务器的研究助手模型训练而设计。随着新实验、研究发现与工具定义的加入,本数据集将持续更新。在正式版发布前,数据集的结构、工具名称与内容均可能发生变动。 本数据集为微调 [cahlen/Convergent-7B](https://huggingface.co/cahlen/Convergent-7B) 所用的完整训练数据集。 | 仓库地址 | 描述 | |------------|-------------| | **[cahlen/Convergent-7B](https://huggingface.co/cahlen/Convergent-7B)** | 已训练完成的模型权重 | | **[cahlen/Convergent-7B-data](https://huggingface.co/datasets/cahlen/Convergent-7B-data)** | 本仓库 — 训练数据集 | | **[cahlen/convergent](https://github.com/cahlen/convergent)** | 训练代码、评估脚本与CLI工具包 | ## 数据集概览 本数据集包含5799条经清理与去重后的ChatML格式训练条目,覆盖以下内容: - **计算数论**:连分数、扎雷姆巴猜想(Zaremba's conjecture)、豪斯多夫维数、克罗内克系数、拉姆齐数、弗林特希尔斯级数、科恩-伦斯特拉启发式算法 - **智能体工具调用**:适配`bigcompute.science` MCP服务器的Hermes格式函数调用,包含多轮ReAct轨迹 - **CUDA内核开发**:面向数论任务的GPU编程,支持架构专属优化 - **研究方法论**:证明策略、实验设计、学生指导 - **合成推理**:源自Qwen2.5-Math-72B的深度数学思维链(Chain-of-Thought, CoT),以及源自Gemma-4-26B的创意合成内容 ## 数据格式 每条数据为包含ChatML格式`messages`数组的JSON对象,示例如下: json { "messages": [ {"role": "system", "content": "You are Convergent, the bigcompute.science research companion..."}, {"role": "user", "content": "How many Zaremba exceptions exist for digit set {1,2,3}?"}, {"role": "assistant", "content": "<tool_call> {"name": "get_zaremba_exceptions", "arguments": {}} </tool_call>"} ] } 多轮对话条目包含`tool`角色消息,用于智能体ReAct轨迹: json { "messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "Check the Zaremba verification status"}, {"role": "assistant", "content": "<tool_call>...</tool_call>"}, {"role": "tool", "content": "{"status": "completed", "exceptions": 0}"}, {"role": "assistant", "content": "The verification is complete with zero exceptions..."} ] } ## 数据构成 | 数据来源 | 条目数 | 描述 | |--------|---------|-------------| | 精选领域模块(40+个) | ~1,150 | 包含身份设定、工具调用(23个MCP工具)、经nvcc验证的CUDA代码、数论内容、错误恢复、论文理解、学生指导 | | Qwen2.5-Math-72B(合成数据) | ~3,100 | 深度数学推理与思维链内容 | | Gemma-4-26B(合成数据) | ~1,200 | 创意合成、实验设计、长文本推理 | | Hermes FC(外部来源) | 300 | 源自NousResearch的多样化工具调用模式 | | **总计(去重清理后)** | **5,799** | 已移除无关条目与近似重复内容 | ### 类别分布 | 类别 | 条目数 | 占比 | |----------|-------|------------| | 数学推理(思维链) | ~3,500 | 60% | | 工具调用(智能体) | ~710 | 12% | | 知识/事实性内容 | ~800 | 14% | | 多轮对话 | ~520 | 9% | | CUDA代码生成 | ~270 | 5% | ## 数据来源 完整来源文档请参见 [DATA_SOURCES.md](https://github.com/cahlen/convergent/blob/main/DATA_SOURCES.md),涵盖: - Cahlen Humphreys 关于连分数素数收敛性的论文 - 博伊西州立大学与佛罗里达大西洋大学的数论研究成果 - 开放的厄尔多斯问题 - NVIDIA GPU架构规范 - NousResearch/hermes-function-calling-v1 - `bigcompute.science` 实验研究发现 ## 生成流水线 训练工具包已开源:[github.com/cahlen/convergent](https://github.com/cahlen/convergent) bash ./convergent generate-blocks # 生成精选领域训练模块 ./convergent generate-synthetic # 从远程大语言模型生成合成数据 ./convergent merge # 合并、去重并移除评估集泄露数据 ./convergent validate # 验证数据格式与质量 ## 许可证 CC BY 4.0协议 — 您可自由共享并改编本数据集,但需注明原作者。 ## 相关链接 - [bigcompute.science](https://bigcompute.science) — 面向计算数学的猜想驱动GPU研究平台 - [MCP服务器](https://mcp.bigcompute.science) — 用于实验数据与工具的模型上下文协议(Model Context Protocol, MCP)服务器 - [Convergent-7B模型](https://huggingface.co/cahlen/Convergent-7B) — HuggingFace平台上的已训练模型权重 - [训练工具包](https://github.com/cahlen/convergent) — GitHub平台上的完整流水线源代码 - [guerrillamathematics.com](https://guerrillamathematics.com) — 数学研究博客 ## 引用格式 bibtex @misc{humphreys2026convergent, author = {Humphreys, Cahlen}, title = {Convergent-7B Training Data: Computational Number Theory for Agentic Research}, year = {2026}, url = {https://huggingface.co/datasets/cahlen/Convergent-7B-data} } --- *本项目由个人独立维护。若遇到问题,请在 [GitHub](https://github.com/cahlen/convergent/issues) 或 [HuggingFace](https://huggingface.co/cahlen/Convergent-7B/discussions) 提交issue,我会尽力处理。对于响应延迟深表歉意,提前感谢您的理解。*
提供机构:
cahlen
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作