lumees/wikipedia-turkish-synthetic-query
收藏Hugging Face2025-11-28 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/lumees/wikipedia-turkish-synthetic-query
下载链接
链接失效反馈官方服务:
资源简介:
---
readme: "main"
language:
- tr
license: cc-by-sa-4.0
tags:
- custom-dataset
- information-retrieval
- synthetic
size_categories:
- 10K<n<100K
---
# Turkish Wikipedia Synthetic Query Dataset
## Dataset Description
This dataset consists of pairs containing a `query` and a `pos` (positive document/passage). It is processed from raw JSONL files, ensuring no missing values in key fields.
### Data Fields
- `query`: The search query or input text.
- `pos`: The positive label, relevant document, or passage.
## Dataset Creation & Methodology
### Source Data
The positive passages (`pos`) were collected from the **Finewiki** dataset, ensuring high-quality, information-rich text as the foundation for retrieval tasks.
### Synthetic Generation
The corresponding queries (`query`) were synthetically generated to match the positive passages.
- **Model:** Generation was performed using **unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF**.
- **Compute:** The generation pipeline was executed on **NVIDIA L40** GPUs.
### Dataset State
This dataset contains the random 10K samples from the collected data. We are working on the rest to complete queries for all positives and version with hard negatives.
## Citation
If you use this dataset, please cite Lumees AI:
```bibtex
@misc{lumees2024hardnegatives,
author = {Hasan KURŞUN, Kerem Berkay YANIK},
title = {CodeSearchNet Hard Negatives (Filtered)},
year = {2025},
publisher = {Lumees AI},
howpublished = {\url{[https://lumees.io](https://lumees.io)}},
email = {hello@lumees.io}
}
```
提供机构:
lumees



