racineai/VDR_Hydrogen
收藏Hugging Face2025-11-20 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/racineai/VDR_Hydrogen
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
- fr
task_categories:
- visual-document-retrieval
- text-retrieval
tags:
- retrieval
- RAG
- DSE
- hydrogen
configs:
- config_name: train
data_files: "train-*.parquet"
- config_name: filtered
data_files: "filtered-*.parquet"
---
# VDR - Organized, Grouped, Cleaned
# Hydrogen Vision DSE
> **Intended for image/text to vector (DSE)**
## Dataset Composition
Made with https://github.com/RacineAIOS/VDR_pdf-to-parquet
This dataset was created by scraping PDF documents from online sources and generating relevant synthetic queries.
We used Google's Gemini 2.0 Flash Lite model in our custom pipeline to produce the queries, allowing us to create a diverse set of questions based on the document content.
## Dataset Statistics
| Metric | Value |
|--------|-------|
| Total rows | 38,748 |
## Language Distribution
| Language | Ratio (%) |
|--------|-------|
| English (en) | ≈ 69 |
| French (fr) | ≈ 31 |
## Creators
Dataset curated by:
- **Paul Lemaistre**
- **Léo Appourchaux**
提供机构:
racineai



