Gkop/QuestCrafter
收藏Hugging Face2026-02-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Gkop/QuestCrafter
下载链接
链接失效反馈官方服务:
资源简介:
# QuestCrafter-AI_Project
QuestCrafter is a lightweight generative AI project that builds a "Dungeon Master"
for RPG quests. It fine-tunes a small language model on curated prompt-to-quest
data, compares baseline vs fine-tuned outputs, evaluates quality with automatic
metrics and a human rubric, and delivers a simple interactive demo (Streamlit or
Gradio).
## Setup
- Python 3.10+
- Create a virtual env and install dependencies:
- `pip install -r requirements.txt`
## Repository structure (W1)
- `data/` raw and processed datasets
- `scripts/` data and training scripts (W1 starts with `download_data.py`)
- `models/` model checkpoints (later)
- `docs/` project docs (board, roles)
## Data pipeline (W1)
We use the Reddit Jokes dataset (CSV). The script cleans, filters, splits, and
exports JSONL files with a consistent schema.
### Download + preprocess
Example:
`python download_data.py --dataset redditjokes --local_csv path/to/reddit_jokes.csv`
Output:
`data/raw/redditjokes/train.jsonl`, `val.jsonl`, `test.jsonl`
### JSONL schema
Each line has:
- `prompt`
- `response`
- `source`
- optional `metadata` (e.g., `score`, `author`)
### Filters (defaults)
- Prompt: 5 to 300 characters (if prompt exists)
- Response: 20 to 800 characters
- Drops `[deleted]` / `[removed]`
Override with:
`--min_prompt_chars`, `--max_prompt_chars`, `--min_response_chars`, `--max_response_chars`, `--keep_deleted`
## Team docs
- GitHub board and issues: `docs/github_board.md`
- Roles, branches, and tasks: `docs/team_roles.md`
## Upload dataset to Hugging Face
Use the script below to upload `archive.zip` to your dataset repo.
1) Install deps:
`pip install -r requirements.txt`
2) Upload (choose one):
- With token env:
`set HF_TOKEN=your_token`
`python upload_to_hf.py --repo_id GemimaOndele/questcrafter-dataset --file "C:\Users\gemim\OneDrive\Bureau\M1-cours-Data engineer\MSC 1 AI\Semestre 2\Foundations of machine learning and datascience\Project\archive.zip"`
- With token argument:
`python upload_to_hf.py --token your_token --repo_id GemimaOndele/questcrafter-dataset --file "C:\Users\gemim\OneDrive\Bureau\M1-cours-Data engineer\MSC 1 AI\Semestre 2\Foundations of machine learning and datascience\Project\archive.zip"`
If you already logged in with `huggingface-cli login`, the script will use that cached token.
# QuestCrafter-AI 项目
QuestCrafter 是一款轻量级生成式 AI 项目,旨在为角色扮演游戏(RPG,Role-Playing Game)任务构建一名“地下城主(Dungeon Master)”。该项目基于精心整理的提示词到任务(prompt-to-quest)数据集对小型语言模型进行微调,对比基线模型与微调模型的输出结果,通过自动评估指标与人工评分标准对模型质量进行评测,并提供一款简易的交互式演示程序(基于 Streamlit 或 Gradio 构建)。
## 环境搭建
- Python 3.10 及以上版本
- 创建虚拟环境并安装依赖项:`pip install -r requirements.txt`
## 仓库结构(W1)
- `data/`:原始与预处理后的数据集
- `scripts/`:数据与训练脚本(W1 从`download_data.py`开始)
- `models/`:模型检查点(后续迭代补充)
- `docs/`:项目文档(包含看板、角色分工等)
## 数据流水线(W1)
我们使用 Reddit 笑话数据集(CSV 格式)。配套脚本将对数据进行清洗、过滤、划分,并导出符合统一格式的 JSONL 文件。
### 下载与预处理
示例命令:`python download_data.py --dataset redditjokes --local_csv path/to/reddit_jokes.csv`
输出文件:`data/raw/redditjokes/train.jsonl`、`val.jsonl` 与 `test.jsonl`
### JSONL 格式规范
每一行包含以下字段:
- `prompt`:提示词
- `response`:回复内容
- `source`:数据来源
- 可选字段`metadata`(元数据,例如`score`(点赞数)、`author`(作者))
### 过滤规则(默认配置)
- 提示词长度:5 至 300 个字符(若存在提示词)
- 回复内容长度:20 至 800 个字符
- 过滤掉内容为`[deleted]`(已删除)或`[removed]`(已移除)的条目
可通过以下参数覆盖默认配置:`--min_prompt_chars`、`--max_prompt_chars`、`--min_response_chars`、`--max_response_chars` 以及 `--keep_deleted`
## 团队文档
- GitHub 看板与议题:`docs/github_board.md`
- 角色分工、分支规范与任务安排:`docs/team_roles.md`
## 将数据集上传至 Hugging Face
使用以下脚本将`archive.zip`上传至您的数据集仓库。
1. 安装依赖项:`pip install -r requirements.txt`
2. 上传(二选一即可):
- 通过环境变量传入令牌:`set HF_TOKEN=your_token`
`python upload_to_hf.py --repo_id GemimaOndele/questcrafter-dataset --file "C:UsersgemimOneDriveBureauM1-cours-Data engineerMSC 1 AISemestre 2Foundations of machine learning and datascienceProjectarchive.zip"`
- 通过命令行参数传入令牌:`python upload_to_hf.py --token your_token --repo_id GemimaOndele/questcrafter-dataset --file "C:UsersgemimOneDriveBureauM1-cours-Data engineerMSC 1 AISemestre 2Foundations of machine learning and datascienceProjectarchive.zip"`
若您已通过`huggingface-cli login`完成登录,则脚本将自动使用缓存的令牌。
提供机构:
Gkop



