five

zeta

收藏
魔搭社区2025-12-04 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/zeta
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset for Zeta This is the open dataset used to train Zeta, an edit prediction model that powers Zed's predictive coding feature. Zeta is derived from Qwen2.5-Coder-7B and predicts the developer's next code edit based on their recent programming patterns and cursor position, allowing for intelligent completion with a simple tab press. This dataset is split into three parts: - `train.jsonl`: Contains the training data for supervised fine-tuning. - `dpo.jsonl`: Contains the data for the direct preference optimization. - `eval.jsonl`: Contains the evaluation data for the Zeta dataset. These files are generated from the markdown files in the respective directories. ## Scripts There are several scripts to help with data processing and evaluation: - `script/pull-predictions`: Pulls predictions from Snowflake. - `script/verify_server.py`: Simple webserver to manually verify the predictions and adding them to the dataset. - `script/gen-dataset`: Reads all the markdown files, validates them, and generates the dataset files. - `script/sft.ipynb`: Jupyter notebook for supervised fine-tuning. - `script/dpo.ipynb`: Jupyter notebook for direct preference optimization. ### Running Python Scripts Set up Python environment: ```bash python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install fastapi uvicorn ``` Run the verification UI: ```bash python script/verify_server.py predictions train --trash-dir trash ``` Open http://localhost:8000 and use: - 'G' to accept (moves to `train/`) - 'B' to reject (moves to `trash/`) # Labeling feedback Set up Python environment: ```bash python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install anthropic ``` Set Anthropic API key: ```bash export ANTHROPIC_API_KEY=your_api_key ``` Run the `label-data` script: ```bash python script/label-data ``` Maybe some files weren't labeled because the model didn't reply with a comma-separated list of labels: ```bash python script/see-label-data ```

# Zeta数据集 本数据集为用于训练Zeta的开源数据集,Zeta是一款代码编辑预测模型,为Zed的智能编码补全功能提供底层支撑。Zeta基于Qwen2.5-Coder-7B开发,可依据开发者近期的编程习惯与光标位置预测其下一步代码编辑操作,仅需按下Tab键即可实现智能代码补全。 本数据集分为三个部分: - `train.jsonl`:用于监督微调(Supervised Fine-Tuning,SFT)的训练数据。 - `dpo.jsonl`:用于直接偏好优化(Direct Preference Optimization,DPO)的训练数据。 - `eval.jsonl`:用于Zeta模型评估的测试数据集。 上述文件均由对应目录下的Markdown文件生成。 ## 脚本工具 本项目提供多款用于数据处理与模型评估的脚本: - `script/pull-predictions`:从Snowflake拉取预测结果。 - `script/verify_server.py`:用于手动校验预测结果并将其纳入数据集的简易Web服务器。 - `script/gen-dataset`:读取所有Markdown文件并完成格式校验,随后生成标准数据集文件。 - `script/sft.ipynb`:用于监督微调(Supervised Fine-Tuning,SFT)的Jupyter Notebook。 - `script/dpo.ipynb`:用于直接偏好优化(Direct Preference Optimization,DPO)的Jupyter Notebook。 ### Python脚本运行指南 1. 配置Python运行环境: bash python -m venv .venv source .venv/bin/activate # On Windows: .venvScriptsactivate pip install fastapi uvicorn 2. 启动校验界面: bash python script/verify_server.py predictions train --trash-dir trash 3. 打开浏览器访问 http://localhost:8000,可通过以下快捷键完成操作: - 按下`G`键接受预测结果(将文件移动至`train/`目录) - 按下`B`键拒绝预测结果(将文件移动至`trash/`目录) ## 标注反馈流程 1. 配置Python运行环境: bash python -m venv .venv source .venv/bin/activate # On Windows: .venvScriptsactivate pip install anthropic 2. 设置Anthropic API密钥: bash export ANTHROPIC_API_KEY=your_api_key 3. 运行标注脚本: bash python script/label-data 若存在因模型未返回以逗号分隔的标签列表而未完成标注的文件,可执行以下脚本查看未标注内容: bash python script/see-label-data
提供机构:
maas
创建时间:
2025-02-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作