xthor/Qwen3-Embedding-GraphQL-v1
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/xthor/Qwen3-Embedding-GraphQL-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- sentence-similarity
- feature-extraction
language:
- en
tags:
- graphql
- retrieval
- embeddings
- owner-disambiguation
size_categories:
- 1K<n<10K
---
# Qwen3-Embedding-GraphQL v1
Training and evaluation data for [`xthor/Qwen3-Embedding-0.6B-GraphQL`](https://huggingface.co/xthor/Qwen3-Embedding-0.6B-GraphQL) — mapping natural-language questions to GraphQL field coordinates (`Type.field`). The training signal targets **owner-type disambiguation** across cross-type field-name collisions (e.g. `Issue.author` vs `PullRequest.author`).
## Splits
| split | rows | purpose |
|---------|--------|---------------------------------------------------------|
| train | 4,788 | anchor + positive + hard negatives per row |
| val | 94 | held-out eval during training |
| test | 223 | final retrieval eval |
| corpus | 28,893 | every `Type.field` coordinate the model can retrieve |
30% of eval queries come from real-world SDLs (GitHub GHES, Saleor, Shopify, AniList). The rest come from ~60 synthetic "worlds" — see [How it was built](#how-it-was-built).
## Benchmarks
Ready-to-run retrieval eval sets under `benchmarks/`:
| benchmark | what it stresses |
|----------------------------|------------------------------------------------------------|
| `curated_challenge_eval` | hand-written realistic queries (release gate) |
| `real_schema_eval` | real SDLs — GitHub, Saleor, Shopify, AniList |
| `realism_eval` | natural phrasing vs. field-name-ish phrasing |
| `adversarial_eval` | cross-owner distractors sharing the same field name |
| `ambiguity_eval` | multiple coordinates are legitimately correct |
| `synthetic_holdout` | held-out synthetic queries from the same distribution |
## Record shapes
### Query row (`train.jsonl`, `val.jsonl`, `test.jsonl`, `benchmarks/*.jsonl`)
The fields you need:
| field | type | what it is |
|------------------------|------------|---------------------------------------------------------------------------|
| `query` | string | the natural-language question — the anchor |
| `positive_coordinate` | string | correct answer in `Type.field` form |
| `negative_coordinates` | list[str] | hard negatives; the top-1 base-model distractor is stamped first |
| `owner_type` | string | `Type` portion of the coordinate |
| `field_name` | string | `field` portion |
| `world_id` | string | schema the row belongs to (`world_0000…`, or `github-ghes`, `saleor`, …) |
| `world_split` | string | whole-world split — `train` or `test` (a world is never in both) |
Useful for slicing / diagnostics:
| field | what it is |
|------------------------|---------------------------------------------------------------------------|
| `relevant_coordinates` | all coordinates judged correct (length >1 only in `ambiguity_eval`) |
| `family_id` | `{world_id}:{positive_coordinate}` — groups seed variants for dedup |
| `source` | provenance: `openai-world-seed`, `curated-challenge`, `manual-realism-seed`, `adversarial-ambiguity`, `deterministic-augment`, `bootstrap` |
| `quality_score` | provenance confidence — see [quality_score](#quality_score) |
| `domain`, `intent`, `difficulty` | coarse tags for per-bucket eval |
| `rationale_tags` | includes `base_{hard,medium,easy,unmined}` + `margin={value}` — base-model difficulty on this row |
### Corpus row (`corpus.jsonl`)
One row per `Type.field`. Four retrieval views are provided — pick whichever your setup needs:
| view | example |
|------------------------|-------------------------------------------------------------------------|
| `coordinate_text` | `Room.priceCents` |
| `field_signature_text` | `Room.priceCents: Int` |
| `field_semantic_text` | prose with description, domain, related types (**default**) |
| `sdl_snippet_text` | valid GraphQL block: `type Room { priceCents: Int }` |
`retrieval_text` is an alias for the view used during training (currently `field_semantic_text`). Other fields: `coordinate`, `owner_type`, `field_name`, `return_type`, `description`, `aliases`, `path_to_root`, `metadata`.
## `quality_score`
Provenance confidence in `[0, 1]` — **assigned by rule based on where the row came from**, not a learned quality rating.
| value | source |
|-------|-----------------------------------------------------------|
| 1.00 | `curated-challenge` (hand-written release gate) |
| 0.92 | `openai-world-seed` (successful LLM generation) |
| 0.90+ | `manual-realism-seed` (hand-seeded cleanup) |
| 0.85 | `bootstrap` (first variant of an LLM seed) |
| 0.80 | `adversarial-ambiguity` (generated cross-owner distractors) |
| 0.65 | `deterministic-augment` (case/punct perturbations) |
The builder drops rows below `0.25` before splitting (nothing released is near that threshold) and uses the score as a tie-breaker when two rows normalize to the same text. It does **not** weight training batches — you can, and probably should.
For a per-row *difficulty* signal, use `rationale_tags`: `base_hard` means the base model got it wrong, `base_medium` means it got it right with thin margin, `base_easy` means it got it right comfortably.
## Hard negatives
Each training row carries ~6 negatives in `negative_coordinates`, from a mix of strategies (tagged on `confuser_tags`):
- `name_similarity` — same `field_name`, different `owner_type` (the core disambiguation task)
- `structural` — neighboring fields on the same owner
- `lexical` — tokens overlap with the query but wrong meaning
- `argument-shape` — same return type or arguments
- `semantic` — near neighbors from a base-model encoding pass
Position 0 in `negative_coordinates` is the coordinate the base model ranked top-1 (when the base got it wrong) — use it as your highest-priority negative.
## How it was built
Deterministic given seeds:
1. **World generation.** ~60 synthetic worlds (domain + entity catalog + relationships → GraphQL schema). Real SDLs (GitHub, Saleor, Shopify, AniList) are ingested alongside.
2. **Corpus build.** Every `Type.field` → one corpus row with four views. Produces `corpus.jsonl` (28,893 rows).
3. **Seed queries.** An LLM proposes a few natural phrasings per field. Curated and adversarial variants are added. Produces 7,626 raw seeds.
4. **Negative mining.** Per seed, six+ negatives from the same world. The base model ranks each candidate set; the top distractor is promoted to `negative_coordinates[0]` and the row gets a `base_hard/medium/easy` tag.
5. **Filtering + splitting.**
- **World leakage** — whole-world splits; no query's owner-type appears on both sides.
- **Strict leakage** — rows with heavy token overlap between query and corpus are dropped.
- **Semantic dedup** — cosine-similar queries inside a `family_id` collapse.
7,626 raw → train 4,788 / val 94 / test 223. Val/test shrink is aggressive on real-SDL queries by design — the model is graded on generalization to schemas and phrasings it has never seen.
`manifest.json` and `sanity_report.json` in the repo record the build config and per-stage counts.
## Load it
```python
from datasets import load_dataset
train = load_dataset("xthor/Qwen3-Embedding-GraphQL-v1", data_files="train.jsonl", split="train")
test = load_dataset("xthor/Qwen3-Embedding-GraphQL-v1", data_files="test.jsonl", split="train")
```
The corpus has a union-typed `metadata` struct that `datasets` can't auto-cast. Either drop it with an explicit schema:
```python
from datasets import load_dataset, Features, Value
corpus = load_dataset(
"xthor/Qwen3-Embedding-GraphQL-v1",
data_files="corpus.jsonl",
split="train",
features=Features({
"coordinate": Value("string"),
"owner_type": Value("string"),
"field_name": Value("string"),
"return_type": Value("string"),
"description": Value("string"),
"coordinate_text": Value("string"),
"field_signature_text": Value("string"),
"field_semantic_text": Value("string"),
"sdl_snippet_text": Value("string"),
"retrieval_text": Value("string"),
}),
)
```
…or read it as plain JSONL:
```python
import json
from huggingface_hub import hf_hub_download
path = hf_hub_download("xthor/Qwen3-Embedding-GraphQL-v1", "corpus.jsonl", repo_type="dataset")
corpus = [json.loads(line) for line in open(path)]
```
## Citation
- Model: [xthor/Qwen3-Embedding-0.6B-GraphQL](https://huggingface.co/xthor/Qwen3-Embedding-0.6B-GraphQL)
- Base model: [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
提供机构:
xthor



