michael0402/lakequest
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/michael0402/lakequest
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: bank_corpus
data_files:
- split: train
path: data/bank/train.jsonl
default: true
- config_name: drug_corpus
data_files:
- split: train
path: data/drug/train.jsonl
- config_name: aiml_questions
data_files:
- split: train
path: data/questions/aiml/train.jsonl
- config_name: bank_questions
data_files:
- split: train
path: data/questions/bank/train.jsonl
- config_name: drug_questions
data_files:
- split: train
path: data/questions/drug/train.jsonl
- config_name: raw_assets_index
data_files:
- split: train
path: data/raw_index/train.jsonl
- config_name: aiml
data_files:
- split: validation
path: benchmark/v1/aiml/qa_records/validation.jsonl
- split: test
path: benchmark/v1/aiml/qa_records/test.jsonl
- config_name: bank
data_files:
- split: validation
path: benchmark/v1/bank/qa_records/validation.jsonl
- split: test
path: benchmark/v1/bank/qa_records/test.jsonl
- config_name: drug
data_files:
- split: validation
path: benchmark/v1/drug/qa_records/validation.jsonl
- split: test
path: benchmark/v1/drug/qa_records/test.jsonl
---
# LakeQuest
Unified Hugging Face dataset repository for LakeQuest source assets and benchmark release artifacts.
## Configurations
Source/intermediate subsets:
- `bank_corpus`: normalized bank corpus records
- `drug_corpus`: normalized drug corpus records
- `aiml_questions`: AI/ML question rows (normalized)
- `bank_questions`: bank question rows
- `drug_questions`: drug question rows
- `raw_assets_index`: index of raw corpus bundles stored in this repo
Final benchmark subsets:
- `aiml`: benchmark v1 QA records (validation/test)
- `bank`: benchmark v1 QA records (validation/test)
- `drug`: benchmark v1 QA records (validation/test)
## Raw Asset Layout
Raw source assets used by the release pipeline are stored as compressed bundles under:
- `raw/bundles/raw_corpus_bank.tar.gz`
- `raw/bundles/raw_corpus_drug.tar.gz`
- `raw/bundles/manifest.json`
`build_release.py` downloads these raw corpus bundle files and extracts them into a local cache.
Question inputs are loaded from `data/questions/*/train.jsonl`.
## Benchmark Release Layout
Final benchmark release files are stored under:
Parquet files remain available under `benchmark/v1/`.
- `benchmark/v1/aiml/`
- `benchmark/v1/bank/`
- `benchmark/v1/drug/`
Each domain includes:
- `qa_records/{validation,test}.parquet`
- `provenance_records/{validation,test}.parquet`
- `corpus_objects.parquet`
- `split_entities.parquet`
- `manifest.json`
## Load Examples
```python
from datasets import load_dataset
bank_corpus = load_dataset("michael0402/lakequest", "bank_corpus", split="train")
bank_questions = load_dataset("michael0402/lakequest", "bank_questions", split="train")
bank_benchmark_test = load_dataset("michael0402/lakequest", "bank", split="test")
```
提供机构:
michael0402



