mjbommar/opengloss-v1.3-query-examples-flat
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.3-query-examples-flat
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- question-answering
- text-classification
- feature-extraction
language:
- en
tags:
- query-generation
- search-intent
- information-retrieval
- question-generation
- lexicon
- synthetic
- education
- opengloss
- rag
size_categories:
- 10K<n<100K
---
# OpenGloss Query Examples v1.3 (Flattened)
## Dataset Summary
**OpenGloss Query Examples** is a synthetic dataset of search queries generated for vocabulary
terms. Each term has multiple query profiles covering different search intents and user personas,
making it ideal for training query generation, intent classification, and RAG systems.
This dataset contains flattened profile records (one per query).
It is derived from the [OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-definitions)
encyclopedic dictionary.
### Key Statistics
- **70,498 vocabulary terms**
- **563,984 total query profiles**
- **70,485 unique words/phrases**
- **70,498 unique lexemes**
- **8.0 profiles per term** (average)
- **10 search intents** × **10 user personas**
- **2,743 unique content filters**
- **6,492 unique semantic tags**
- **Average query length: 60 chars**
### Intent Categories
Each profile targets one of 10 search intents:
| Intent | Description |
|--------|-------------|
| core_definition | Basic definition and meaning |
| origin_history | Etymology and historical development |
| plain_explanation | Simple, accessible explanation |
| technical_detail | In-depth technical information |
| context_usage | Real-world usage contexts |
| examples_evidence | Examples and case studies |
| compare_nearby | Comparisons with related concepts |
| domain_specific | Domain-specific applications |
| how_to_or_practice | Practical guidance |
| risks_or_debates | Controversies and debates |
### User Personas
Each profile is written for one of 10 user personas:
| Persona | Description |
|---------|-------------|
| college_student | Undergraduate/graduate student |
| neutral_academic | Academic researcher |
| high_school_teacher | K-12 educator |
| curious_parent | Parent seeking understanding |
| historian | History specialist |
| practitioner_or_engineer | Applied professional |
| investigative_journalist | Reporter/journalist |
| policy_analyst | Policy researcher |
| skeptical_auditor | Critical verifier |
| product_manager | Business/product focus |
### Intent Distribution
| Intent | Count |
|--------|-------|
| core_definition | 94,297 |
| plain_explanation | 87,411 |
| compare_nearby | 76,873 |
| context_usage | 67,863 |
| technical_detail | 67,127 |
| examples_evidence | 58,714 |
| origin_history | 50,688 |
| domain_specific | 39,731 |
| risks_or_debates | 11,126 |
| how_to_or_practice | 6,571 |
### Persona Distribution
| Persona | Count |
|---------|-------|
| neutral_academic | 90,588 |
| college_student | 79,955 |
| high_school_teacher | 60,819 |
| curious_parent | 58,150 |
| historian | 52,473 |
| practitioner_or_engineer | 51,179 |
| investigative_journalist | 47,809 |
| skeptical_auditor | 44,667 |
| policy_analyst | 44,113 |
| product_manager | 34,041 |
### POS Distribution
| Part of Speech | Count |
|----------------|-------|
| noun | 50,406 |
| adjective | 12,422 |
| verb | 5,683 |
| adverb | 1,595 |
| determiner | 188 |
| preposition | 101 |
| interjection | 49 |
| pronoun | 36 |
| conjunction | 15 |
| numeral | 2 |
| particle | 1 |
### Top Content Filters
| Filter | Count |
|--------|-------|
| general | 193,079 |
| history | 132,664 |
| academic | 96,539 |
| science | 86,384 |
| education | 44,692 |
| geography | 30,729 |
| linguistics | 30,317 |
| language | 29,213 |
| everyday | 20,799 |
| policy | 15,968 |
| health | 15,726 |
| encyclopedia | 12,608 |
| technology | 8,922 |
| religion | 8,500 |
| kids | 8,086 |
| examples | 7,729 |
| law | 7,197 |
| biology | 6,576 |
| art | 6,474 |
| literature | 6,174 |
## Loading the Dataset
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("mjbommar/opengloss-v1.3-query-examples-flat")
# Access records
for record in dataset["train"]:
print(f"Word: {record['word']}")
print(f"Definition: {record['definition'][:100]}...")
print(f"Profiles: {record['num_profiles']}")
for profile in record["profiles"][:2]:
print(f" - [{profile['intent']}] {profile['query']}")
```
## Example Record
```python
{
"id": "growth_form",
"word": "growth form",
"lexeme_id": "growth_form",
"pos": "noun",
"definition": "The characteristic three-dimensional structure and architecture...",
"encyclopedia": "# Growth Form\n\n**Growth form** refers to the characteristic...",
"profiles": [
{
"intent": "core_definition",
"persona": "college_student",
"query": "What is a growth form in biology and ecology?",
"alternates": ["growth form definition in plant biology"],
"filters": ["academic", "biology"],
"tags": ["definition", "morphology", "plants", "persona:college_student"]
},
{
"intent": "origin_history",
"persona": "historian",
"query": "Historical use of the term growth form in botany and ecology",
"alternates": [],
"filters": ["history", "science"],
"tags": ["history", "concept_origin", "botany", "persona:historian"]
}
// ... more profiles
],
"num_profiles": 8,
"intents_covered": ["core_definition", "origin_history", ...],
"personas_covered": ["college_student", "historian", ...]
}
```
## Use Cases
### Query Generation Training
Train models to generate search queries for vocabulary terms:
```python
# Create query generation pairs
for record in dataset["train"]:
context = f"Term: {record['word']}\nDefinition: {record['definition']}"
for profile in record["profiles"]:
query = profile["query"]
# Train seq2seq: context → query
```
### Intent Classification
Train classifiers to predict search intent:
```python
# Extract intent-labeled queries
intent_data = []
for record in dataset["train"]:
for profile in record["profiles"]:
intent_data.append({
"query": profile["query"],
"intent": profile["intent"],
"persona": profile["persona"]
})
```
### RAG Query Augmentation
Generate diverse queries for retrieval training:
```python
# Get all queries for a term
def get_queries_for_term(word):
record = dataset["train"].filter(lambda x: x["word"] == word)[0]
queries = [p["query"] for p in record["profiles"]]
queries += [alt for p in record["profiles"] for alt in p["alternates"]]
return queries
```
### Persona-Aware Search
Filter by user type:
```python
# Get queries for educators
educator_queries = []
for record in dataset["train"]:
for profile in record["profiles"]:
if profile["persona"] == "high_school_teacher":
educator_queries.append(profile)
```
## Citation
If you use this dataset in your research, please cite:
```bibtex
@misc{bommarito2025opengloss_queries,
title={OpenGloss Query Examples: Multi-Intent Search Queries for Vocabulary Terms},
author={Bommarito, Michael J., II},
year={2025},
url={https://huggingface.co/datasets/mjbommar/opengloss-v1.3-query-examples-flat},
note={Dataset available under CC-BY 4.0}
}
```
## License
This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**.
## Related Datasets
- [OpenGloss v1.3 Dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-dictionary) - Word-level records
- [OpenGloss v1.3 Definitions](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-definitions) - Definition-level records
- [OpenGloss v1.3 Contrastive Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-contrastive-examples) - Semantic gradients
- [OpenGloss v1.3 Encyclopedia Variants](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-encyclopedia-variants) - Style variations
- [OpenGloss v1.3 Hard Negative Pairs](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-hard-negative-pairs) - Calibration pairs for embedding training
## Acknowledgments
This dataset was generated using:
- [OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-definitions) lexicon data
- OpenAI GPT models for query generation
- [pydantic-ai](https://github.com/pydantic/pydantic-ai) for structured generation
---
*Generated from the OpenGloss v1.3 lexicon.*
提供机构:
mjbommar



