melephant/2l-bilinear-attn

Name: melephant/2l-bilinear-attn
Creator: melephant
Published: 2026-04-18 12:02:04
License: 暂无描述

Hugging Face2026-04-18 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/melephant/2l-bilinear-attn

下载链接

链接失效反馈

官方服务：

资源简介：

--- {} --- # pile Quadratic/bilinear attention causal language model trained with the tensor-mars research stack. This repository packages the final checkpoint, configuration, and reference model code. ## Training configuration ```yaml batch_size: 384 max_steps: 33333 warmup_steps: 200 lr: 0.0003 optimizer: Muon + AdamW dtype: bfloat16 grad_clip: 1.0 ``` ## Data + tokenizer - Context length: 512 | Vocab size: 4096 ## Metrics - **train_loss**: 3.9820 - **val_loss**: 3.9987 ## Checkpoints - Latest checkpoint exported as `pytorch_model.bin`. - Full training log available in `metrics.jsonl`. ## Usage ```python import torch from models.transformer import AttentionLM checkpoint = torch.load("pytorch_model.bin", map_location="cpu") model = AttentionLM.from_config(json.load(open("config.json"))) model.load_state_dict(checkpoint["model_state_dict"]) model.eval() ``` ## Limitations - This model is research-grade and not aligned for deployment. - Quadratic/bilinear attention stacks can exhibit instability outside the training distribution.

# pile 本项目为基于tensor-mars研究栈训练的二次/双线性注意力因果语言模型（causal language model），本仓库封装了最终训练检查点（checkpoint）、配置文件与参考模型代码。 ## 训练配置 yaml batch_size: 384 max_steps: 33333 warmup_steps: 200 lr: 0.0003 optimizer: Muon + AdamW dtype: bfloat16 grad_clip: 1.0 ## 数据与分词器 - 上下文长度（Context length）：512 | 词表大小（Vocab size）：4096 ## 评估指标 - **训练损失（train_loss）**: 3.9820 - **验证损失（val_loss）**: 3.9987 ## 训练检查点 - 最新训练检查点已导出为`pytorch_model.bin`。 - 完整训练日志可在`metrics.jsonl`中获取。 ## 使用方法 python import torch from models.transformer import AttentionLM checkpoint = torch.load("pytorch_model.bin", map_location="cpu") model = AttentionLM.from_config(json.load(open("config.json"))) model.load_state_dict(checkpoint["model_state_dict"]) model.eval() ## 局限性 - 本模型属于研究级原型，未经过对齐适配，不适用于实际部署。 - 二次/双线性注意力栈在训练分布外场景下可能出现训练不稳定的问题。

提供机构：

melephant

5,000+

优质数据集

54 个

任务类型

进入经典数据集