five

lumees/wikipedia-turkish-synthetic-query

收藏
Hugging Face2025-11-28 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/lumees/wikipedia-turkish-synthetic-query
下载链接
链接失效反馈
官方服务:
资源简介:
--- readme: "main" language: - tr license: cc-by-sa-4.0 tags: - custom-dataset - information-retrieval - synthetic size_categories: - 10K<n<100K --- # Turkish Wikipedia Synthetic Query Dataset ## Dataset Description This dataset consists of pairs containing a `query` and a `pos` (positive document/passage). It is processed from raw JSONL files, ensuring no missing values in key fields. ### Data Fields - `query`: The search query or input text. - `pos`: The positive label, relevant document, or passage. ## Dataset Creation & Methodology ### Source Data The positive passages (`pos`) were collected from the **Finewiki** dataset, ensuring high-quality, information-rich text as the foundation for retrieval tasks. ### Synthetic Generation The corresponding queries (`query`) were synthetically generated to match the positive passages. - **Model:** Generation was performed using **unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF**. - **Compute:** The generation pipeline was executed on **NVIDIA L40** GPUs. ### Dataset State This dataset contains the random 10K samples from the collected data. We are working on the rest to complete queries for all positives and version with hard negatives. ## Citation If you use this dataset, please cite Lumees AI: ```bibtex @misc{lumees2024hardnegatives, author = {Hasan KURŞUN, Kerem Berkay YANIK}, title = {CodeSearchNet Hard Negatives (Filtered)}, year = {2025}, publisher = {Lumees AI}, howpublished = {\url{[https://lumees.io](https://lumees.io)}}, email = {hello@lumees.io} } ```
提供机构:
lumees
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作