five

xthor/Qwen3-Embedding-GraphQL-v1

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/xthor/Qwen3-Embedding-GraphQL-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - sentence-similarity - feature-extraction language: - en tags: - graphql - retrieval - embeddings - owner-disambiguation size_categories: - 1K<n<10K --- # Qwen3-Embedding-GraphQL v1 Training and evaluation data for [`xthor/Qwen3-Embedding-0.6B-GraphQL`](https://huggingface.co/xthor/Qwen3-Embedding-0.6B-GraphQL) — mapping natural-language questions to GraphQL field coordinates (`Type.field`). The training signal targets **owner-type disambiguation** across cross-type field-name collisions (e.g. `Issue.author` vs `PullRequest.author`). ## Splits | split | rows | purpose | |---------|--------|---------------------------------------------------------| | train | 4,788 | anchor + positive + hard negatives per row | | val | 94 | held-out eval during training | | test | 223 | final retrieval eval | | corpus | 28,893 | every `Type.field` coordinate the model can retrieve | 30% of eval queries come from real-world SDLs (GitHub GHES, Saleor, Shopify, AniList). The rest come from ~60 synthetic "worlds" — see [How it was built](#how-it-was-built). ## Benchmarks Ready-to-run retrieval eval sets under `benchmarks/`: | benchmark | what it stresses | |----------------------------|------------------------------------------------------------| | `curated_challenge_eval` | hand-written realistic queries (release gate) | | `real_schema_eval` | real SDLs — GitHub, Saleor, Shopify, AniList | | `realism_eval` | natural phrasing vs. field-name-ish phrasing | | `adversarial_eval` | cross-owner distractors sharing the same field name | | `ambiguity_eval` | multiple coordinates are legitimately correct | | `synthetic_holdout` | held-out synthetic queries from the same distribution | ## Record shapes ### Query row (`train.jsonl`, `val.jsonl`, `test.jsonl`, `benchmarks/*.jsonl`) The fields you need: | field | type | what it is | |------------------------|------------|---------------------------------------------------------------------------| | `query` | string | the natural-language question — the anchor | | `positive_coordinate` | string | correct answer in `Type.field` form | | `negative_coordinates` | list[str] | hard negatives; the top-1 base-model distractor is stamped first | | `owner_type` | string | `Type` portion of the coordinate | | `field_name` | string | `field` portion | | `world_id` | string | schema the row belongs to (`world_0000…`, or `github-ghes`, `saleor`, …) | | `world_split` | string | whole-world split — `train` or `test` (a world is never in both) | Useful for slicing / diagnostics: | field | what it is | |------------------------|---------------------------------------------------------------------------| | `relevant_coordinates` | all coordinates judged correct (length >1 only in `ambiguity_eval`) | | `family_id` | `{world_id}:{positive_coordinate}` — groups seed variants for dedup | | `source` | provenance: `openai-world-seed`, `curated-challenge`, `manual-realism-seed`, `adversarial-ambiguity`, `deterministic-augment`, `bootstrap` | | `quality_score` | provenance confidence — see [quality_score](#quality_score) | | `domain`, `intent`, `difficulty` | coarse tags for per-bucket eval | | `rationale_tags` | includes `base_{hard,medium,easy,unmined}` + `margin={value}` — base-model difficulty on this row | ### Corpus row (`corpus.jsonl`) One row per `Type.field`. Four retrieval views are provided — pick whichever your setup needs: | view | example | |------------------------|-------------------------------------------------------------------------| | `coordinate_text` | `Room.priceCents` | | `field_signature_text` | `Room.priceCents: Int` | | `field_semantic_text` | prose with description, domain, related types (**default**) | | `sdl_snippet_text` | valid GraphQL block: `type Room { priceCents: Int }` | `retrieval_text` is an alias for the view used during training (currently `field_semantic_text`). Other fields: `coordinate`, `owner_type`, `field_name`, `return_type`, `description`, `aliases`, `path_to_root`, `metadata`. ## `quality_score` Provenance confidence in `[0, 1]` — **assigned by rule based on where the row came from**, not a learned quality rating. | value | source | |-------|-----------------------------------------------------------| | 1.00 | `curated-challenge` (hand-written release gate) | | 0.92 | `openai-world-seed` (successful LLM generation) | | 0.90+ | `manual-realism-seed` (hand-seeded cleanup) | | 0.85 | `bootstrap` (first variant of an LLM seed) | | 0.80 | `adversarial-ambiguity` (generated cross-owner distractors) | | 0.65 | `deterministic-augment` (case/punct perturbations) | The builder drops rows below `0.25` before splitting (nothing released is near that threshold) and uses the score as a tie-breaker when two rows normalize to the same text. It does **not** weight training batches — you can, and probably should. For a per-row *difficulty* signal, use `rationale_tags`: `base_hard` means the base model got it wrong, `base_medium` means it got it right with thin margin, `base_easy` means it got it right comfortably. ## Hard negatives Each training row carries ~6 negatives in `negative_coordinates`, from a mix of strategies (tagged on `confuser_tags`): - `name_similarity` — same `field_name`, different `owner_type` (the core disambiguation task) - `structural` — neighboring fields on the same owner - `lexical` — tokens overlap with the query but wrong meaning - `argument-shape` — same return type or arguments - `semantic` — near neighbors from a base-model encoding pass Position 0 in `negative_coordinates` is the coordinate the base model ranked top-1 (when the base got it wrong) — use it as your highest-priority negative. ## How it was built Deterministic given seeds: 1. **World generation.** ~60 synthetic worlds (domain + entity catalog + relationships → GraphQL schema). Real SDLs (GitHub, Saleor, Shopify, AniList) are ingested alongside. 2. **Corpus build.** Every `Type.field` → one corpus row with four views. Produces `corpus.jsonl` (28,893 rows). 3. **Seed queries.** An LLM proposes a few natural phrasings per field. Curated and adversarial variants are added. Produces 7,626 raw seeds. 4. **Negative mining.** Per seed, six+ negatives from the same world. The base model ranks each candidate set; the top distractor is promoted to `negative_coordinates[0]` and the row gets a `base_hard/medium/easy` tag. 5. **Filtering + splitting.** - **World leakage** — whole-world splits; no query's owner-type appears on both sides. - **Strict leakage** — rows with heavy token overlap between query and corpus are dropped. - **Semantic dedup** — cosine-similar queries inside a `family_id` collapse. 7,626 raw → train 4,788 / val 94 / test 223. Val/test shrink is aggressive on real-SDL queries by design — the model is graded on generalization to schemas and phrasings it has never seen. `manifest.json` and `sanity_report.json` in the repo record the build config and per-stage counts. ## Load it ```python from datasets import load_dataset train = load_dataset("xthor/Qwen3-Embedding-GraphQL-v1", data_files="train.jsonl", split="train") test = load_dataset("xthor/Qwen3-Embedding-GraphQL-v1", data_files="test.jsonl", split="train") ``` The corpus has a union-typed `metadata` struct that `datasets` can't auto-cast. Either drop it with an explicit schema: ```python from datasets import load_dataset, Features, Value corpus = load_dataset( "xthor/Qwen3-Embedding-GraphQL-v1", data_files="corpus.jsonl", split="train", features=Features({ "coordinate": Value("string"), "owner_type": Value("string"), "field_name": Value("string"), "return_type": Value("string"), "description": Value("string"), "coordinate_text": Value("string"), "field_signature_text": Value("string"), "field_semantic_text": Value("string"), "sdl_snippet_text": Value("string"), "retrieval_text": Value("string"), }), ) ``` …or read it as plain JSONL: ```python import json from huggingface_hub import hf_hub_download path = hf_hub_download("xthor/Qwen3-Embedding-GraphQL-v1", "corpus.jsonl", repo_type="dataset") corpus = [json.loads(line) for line in open(path)] ``` ## Citation - Model: [xthor/Qwen3-Embedding-0.6B-GraphQL](https://huggingface.co/xthor/Qwen3-Embedding-0.6B-GraphQL) - Base model: [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
提供机构:
xthor
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作