five

Gkop/QuestCrafter

收藏
Hugging Face2026-02-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Gkop/QuestCrafter
下载链接
链接失效反馈
官方服务:
资源简介:
# QuestCrafter-AI_Project QuestCrafter is a lightweight generative AI project that builds a "Dungeon Master" for RPG quests. It fine-tunes a small language model on curated prompt-to-quest data, compares baseline vs fine-tuned outputs, evaluates quality with automatic metrics and a human rubric, and delivers a simple interactive demo (Streamlit or Gradio). ## Setup - Python 3.10+ - Create a virtual env and install dependencies: - `pip install -r requirements.txt` ## Repository structure (W1) - `data/` raw and processed datasets - `scripts/` data and training scripts (W1 starts with `download_data.py`) - `models/` model checkpoints (later) - `docs/` project docs (board, roles) ## Data pipeline (W1) We use the Reddit Jokes dataset (CSV). The script cleans, filters, splits, and exports JSONL files with a consistent schema. ### Download + preprocess Example: `python download_data.py --dataset redditjokes --local_csv path/to/reddit_jokes.csv` Output: `data/raw/redditjokes/train.jsonl`, `val.jsonl`, `test.jsonl` ### JSONL schema Each line has: - `prompt` - `response` - `source` - optional `metadata` (e.g., `score`, `author`) ### Filters (defaults) - Prompt: 5 to 300 characters (if prompt exists) - Response: 20 to 800 characters - Drops `[deleted]` / `[removed]` Override with: `--min_prompt_chars`, `--max_prompt_chars`, `--min_response_chars`, `--max_response_chars`, `--keep_deleted` ## Team docs - GitHub board and issues: `docs/github_board.md` - Roles, branches, and tasks: `docs/team_roles.md` ## Upload dataset to Hugging Face Use the script below to upload `archive.zip` to your dataset repo. 1) Install deps: `pip install -r requirements.txt` 2) Upload (choose one): - With token env: `set HF_TOKEN=your_token` `python upload_to_hf.py --repo_id GemimaOndele/questcrafter-dataset --file "C:\Users\gemim\OneDrive\Bureau\M1-cours-Data engineer\MSC 1 AI\Semestre 2\Foundations of machine learning and datascience\Project\archive.zip"` - With token argument: `python upload_to_hf.py --token your_token --repo_id GemimaOndele/questcrafter-dataset --file "C:\Users\gemim\OneDrive\Bureau\M1-cours-Data engineer\MSC 1 AI\Semestre 2\Foundations of machine learning and datascience\Project\archive.zip"` If you already logged in with `huggingface-cli login`, the script will use that cached token.

# QuestCrafter-AI 项目 QuestCrafter 是一款轻量级生成式 AI 项目,旨在为角色扮演游戏(RPG,Role-Playing Game)任务构建一名“地下城主(Dungeon Master)”。该项目基于精心整理的提示词到任务(prompt-to-quest)数据集对小型语言模型进行微调,对比基线模型与微调模型的输出结果,通过自动评估指标与人工评分标准对模型质量进行评测,并提供一款简易的交互式演示程序(基于 Streamlit 或 Gradio 构建)。 ## 环境搭建 - Python 3.10 及以上版本 - 创建虚拟环境并安装依赖项:`pip install -r requirements.txt` ## 仓库结构(W1) - `data/`:原始与预处理后的数据集 - `scripts/`:数据与训练脚本(W1 从`download_data.py`开始) - `models/`:模型检查点(后续迭代补充) - `docs/`:项目文档(包含看板、角色分工等) ## 数据流水线(W1) 我们使用 Reddit 笑话数据集(CSV 格式)。配套脚本将对数据进行清洗、过滤、划分,并导出符合统一格式的 JSONL 文件。 ### 下载与预处理 示例命令:`python download_data.py --dataset redditjokes --local_csv path/to/reddit_jokes.csv` 输出文件:`data/raw/redditjokes/train.jsonl`、`val.jsonl` 与 `test.jsonl` ### JSONL 格式规范 每一行包含以下字段: - `prompt`:提示词 - `response`:回复内容 - `source`:数据来源 - 可选字段`metadata`(元数据,例如`score`(点赞数)、`author`(作者)) ### 过滤规则(默认配置) - 提示词长度:5 至 300 个字符(若存在提示词) - 回复内容长度:20 至 800 个字符 - 过滤掉内容为`[deleted]`(已删除)或`[removed]`(已移除)的条目 可通过以下参数覆盖默认配置:`--min_prompt_chars`、`--max_prompt_chars`、`--min_response_chars`、`--max_response_chars` 以及 `--keep_deleted` ## 团队文档 - GitHub 看板与议题:`docs/github_board.md` - 角色分工、分支规范与任务安排:`docs/team_roles.md` ## 将数据集上传至 Hugging Face 使用以下脚本将`archive.zip`上传至您的数据集仓库。 1. 安装依赖项:`pip install -r requirements.txt` 2. 上传(二选一即可): - 通过环境变量传入令牌:`set HF_TOKEN=your_token` `python upload_to_hf.py --repo_id GemimaOndele/questcrafter-dataset --file "C:UsersgemimOneDriveBureauM1-cours-Data engineerMSC 1 AISemestre 2Foundations of machine learning and datascienceProjectarchive.zip"` - 通过命令行参数传入令牌:`python upload_to_hf.py --token your_token --repo_id GemimaOndele/questcrafter-dataset --file "C:UsersgemimOneDriveBureauM1-cours-Data engineerMSC 1 AISemestre 2Foundations of machine learning and datascienceProjectarchive.zip"` 若您已通过`huggingface-cli login`完成登录,则脚本将自动使用缓存的令牌。
提供机构:
Gkop
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作