five

quicktensor/NanoCrumb

收藏
Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/quicktensor/NanoCrumb
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-retrieval language: - en tags: - information-retrieval - benchmark - clinical-trials - code-search - legal-qa size_categories: - 10K<n<100K configs: - config_name: clinical_trial data_files: - split: queries path: "clinical_trial/queries.jsonl" - split: documents path: "clinical_trial/documents.jsonl" - split: qrels path: "clinical_trial/qrels.jsonl" - config_name: code_retrieval data_files: - split: queries path: "code_retrieval/queries.jsonl" - split: documents path: "code_retrieval/documents.jsonl" - split: qrels path: "code_retrieval/qrels.jsonl" - config_name: legal_qa data_files: - split: queries path: "legal_qa/queries.jsonl" - split: documents path: "legal_qa/documents.jsonl" - split: qrels path: "legal_qa/qrels.jsonl" - config_name: paper_retrieval data_files: - split: queries path: "paper_retrieval/queries.jsonl" - split: documents path: "paper_retrieval/documents.jsonl" - split: qrels path: "paper_retrieval/qrels.jsonl" - config_name: set_operation_entity_retrieval data_files: - split: queries path: "set_operation_entity_retrieval/queries.jsonl" - split: documents path: "set_operation_entity_retrieval/documents.jsonl" - split: qrels path: "set_operation_entity_retrieval/qrels.jsonl" - config_name: stack_exchange data_files: - split: queries path: "stack_exchange/queries.jsonl" - split: documents path: "stack_exchange/documents.jsonl" - split: qrels path: "stack_exchange/qrels.jsonl" - config_name: theorem_retrieval data_files: - split: queries path: "theorem_retrieval/queries.jsonl" - split: documents path: "theorem_retrieval/documents.jsonl" - split: qrels path: "theorem_retrieval/qrels.jsonl" - config_name: tip_of_the_tongue data_files: - split: queries path: "tip_of_the_tongue/queries.jsonl" - split: documents path: "tip_of_the_tongue/documents.jsonl" - split: qrels path: "tip_of_the_tongue/qrels.jsonl" --- # NanoCrumb Dataset A curated subset of the [Crumb](https://huggingface.co/datasets/jfkback/crumb) retrieval dataset, designed for rapid experimentation and evaluation of information retrieval systems. ## Dataset Summary **NanoCrumb** distills the large Crumb dataset (10.5 GB, 6.36M rows) into a manageable benchmark while maintaining task diversity across 8 different retrieval domains. - **Total Size**: ~125 MB (JSONL format) - **Queries**: 400 (50 per task split) - **Documents**: 30,040 unique passages - **Query-Document Pairs**: 31,754 - **Configs**: 8 task-specific configs ## Configs (Task Splits) Each config represents a different retrieval domain: | Config Name | Queries | Documents | Docs/Query (avg) | Description | |------------|---------|-----------|------------------|-------------| | `clinical_trial` | 50 | 22,251 | 464 | Match patients to clinical trials | | `paper_retrieval` | 50 | 4,402 | 102 | Find relevant academic papers | | `set_operation_entity_retrieval` | 50 | 1,533 | 31 | Entity-based retrieval | | `code_retrieval` | 50 | 1,206 | 24 | Find relevant code snippets | | `tip_of_the_tongue` | 50 | 363 | 7 | Recall items from vague descriptions | | `stack_exchange` | 50 | 125 | 3 | Find relevant Q&A posts | | `legal_qa` | 50 | 86 | 2 | Legal question answering | | `theorem_retrieval` | 50 | 74 | 2 | Find mathematical theorems | ## Dataset Structure Each config contains three splits: ### `queries` - `query_id`: Unique query identifier (string) - `query_content`: The query text (string) - `instruction`: Task-specific instructions (string) - `passage_qrels`: List of relevant passages with graded relevance scores (list) - `task_split`: Task domain name (string) - `metadata`: Additional task-specific information (string) - `use_max_p`: Boolean flag for MaxP aggregation (bool) ### `documents` - `document_id`: Unique document identifier (string) - `document_content`: The passage text (string) - `parent_id`: Links passages to source documents (string) - `task_split`: Task domain name (string) - `metadata`: Document metadata (string) ### `qrels` - `query_id`: Query identifier (string) - `document_id`: Document identifier (string) - `relevance_score`: Graded relevance 0.0-2.0 (float) - `binary_relevance`: Binary relevance 0 or 1 (int) - `task_split`: Task domain name (string) ## Usage ```python from datasets import load_dataset # Load a specific config (task split) clinical_data = load_dataset("YOUR_USERNAME/nanocrumb", "clinical_trial") # Access the splits queries = clinical_data['queries'] documents = clinical_data['documents'] qrels = clinical_data['qrels'] # Load all configs all_configs = [ "clinical_trial", "code_retrieval", "legal_qa", "paper_retrieval", "set_operation_entity_retrieval", "stack_exchange", "theorem_retrieval", "tip_of_the_tongue" ] for config_name in all_configs: data = load_dataset("YOUR_USERNAME/nanocrumb", config_name) print(f"{config_name}: {len(data['queries'])} queries") ``` ## Sampling Methodology For each task split: 1. **Query Selection**: Randomly sampled 50 queries from evaluation set (seed=42) 2. **Document Selection**: - Include ALL positive documents (binary_relevance=1) - Fill remainder with hard negatives (relevance=0) to reach ~100 docs per query - Target: ~5,000 documents per task split 3. **Deduplication**: Documents shared across queries are deduplicated within each config ## Use Cases - 🚀 **Rapid prototyping** of retrieval models - 🧪 **Quick benchmarking** without downloading large datasets - 📚 **Educational purposes** for learning IR techniques - 🔬 **Ablation studies** across diverse domains ## Citation If you use NanoCrumb, please cite the original Crumb dataset: ```bibtex @misc{crumb2024, title={Crumb: A Comprehensive Retrieval Benchmark}, author={[Original Crumb Authors]}, year={2024}, url={https://huggingface.co/datasets/jfkback/crumb} } ``` ## License This dataset inherits the license from the original [Crumb dataset](https://huggingface.co/datasets/jfkback/crumb).
提供机构:
quicktensor
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作