zeta
收藏魔搭社区2025-12-04 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/zeta
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset for Zeta
This is the open dataset used to train Zeta, an edit prediction model that powers Zed's predictive coding feature. Zeta is derived from Qwen2.5-Coder-7B and predicts the developer's next code edit based on their recent programming patterns and cursor position, allowing for intelligent completion with a simple tab press.
This dataset is split into three parts:
- `train.jsonl`: Contains the training data for supervised fine-tuning.
- `dpo.jsonl`: Contains the data for the direct preference optimization.
- `eval.jsonl`: Contains the evaluation data for the Zeta dataset.
These files are generated from the markdown files in the respective directories.
## Scripts
There are several scripts to help with data processing and evaluation:
- `script/pull-predictions`: Pulls predictions from Snowflake.
- `script/verify_server.py`: Simple webserver to manually verify the predictions and adding them to the dataset.
- `script/gen-dataset`: Reads all the markdown files, validates them, and generates the dataset files.
- `script/sft.ipynb`: Jupyter notebook for supervised fine-tuning.
- `script/dpo.ipynb`: Jupyter notebook for direct preference optimization.
### Running Python Scripts
Set up Python environment:
```bash
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install fastapi uvicorn
```
Run the verification UI:
```bash
python script/verify_server.py predictions train --trash-dir trash
```
Open http://localhost:8000 and use:
- 'G' to accept (moves to `train/`)
- 'B' to reject (moves to `trash/`)
# Labeling feedback
Set up Python environment:
```bash
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install anthropic
```
Set Anthropic API key:
```bash
export ANTHROPIC_API_KEY=your_api_key
```
Run the `label-data` script:
```bash
python script/label-data
```
Maybe some files weren't labeled because the model didn't reply with a comma-separated list of labels:
```bash
python script/see-label-data
```
# Zeta数据集
本数据集为用于训练Zeta的开源数据集,Zeta是一款代码编辑预测模型,为Zed的智能编码补全功能提供底层支撑。Zeta基于Qwen2.5-Coder-7B开发,可依据开发者近期的编程习惯与光标位置预测其下一步代码编辑操作,仅需按下Tab键即可实现智能代码补全。
本数据集分为三个部分:
- `train.jsonl`:用于监督微调(Supervised Fine-Tuning,SFT)的训练数据。
- `dpo.jsonl`:用于直接偏好优化(Direct Preference Optimization,DPO)的训练数据。
- `eval.jsonl`:用于Zeta模型评估的测试数据集。
上述文件均由对应目录下的Markdown文件生成。
## 脚本工具
本项目提供多款用于数据处理与模型评估的脚本:
- `script/pull-predictions`:从Snowflake拉取预测结果。
- `script/verify_server.py`:用于手动校验预测结果并将其纳入数据集的简易Web服务器。
- `script/gen-dataset`:读取所有Markdown文件并完成格式校验,随后生成标准数据集文件。
- `script/sft.ipynb`:用于监督微调(Supervised Fine-Tuning,SFT)的Jupyter Notebook。
- `script/dpo.ipynb`:用于直接偏好优化(Direct Preference Optimization,DPO)的Jupyter Notebook。
### Python脚本运行指南
1. 配置Python运行环境:
bash
python -m venv .venv
source .venv/bin/activate # On Windows: .venvScriptsactivate
pip install fastapi uvicorn
2. 启动校验界面:
bash
python script/verify_server.py predictions train --trash-dir trash
3. 打开浏览器访问 http://localhost:8000,可通过以下快捷键完成操作:
- 按下`G`键接受预测结果(将文件移动至`train/`目录)
- 按下`B`键拒绝预测结果(将文件移动至`trash/`目录)
## 标注反馈流程
1. 配置Python运行环境:
bash
python -m venv .venv
source .venv/bin/activate # On Windows: .venvScriptsactivate
pip install anthropic
2. 设置Anthropic API密钥:
bash
export ANTHROPIC_API_KEY=your_api_key
3. 运行标注脚本:
bash
python script/label-data
若存在因模型未返回以逗号分隔的标签列表而未完成标注的文件,可执行以下脚本查看未标注内容:
bash
python script/see-label-data
提供机构:
maas
创建时间:
2025-02-18



