dclm_synthetic_queries
收藏数据集概述
数据特征
- passage: 字符串类型
- queries: 字符串序列
数据划分
- train:
- 字节数: 563252225
- 样本数: 225000
数据大小
- 下载大小: 356761086
- 数据集大小: 563252225
配置
- default:
- 数据文件:
- 划分: train
- 路径: data/train-*
- 数据文件:
数据集创建
- 输入文本收集自
dclm-baseline数据集 - 查询由 GPT-4o-mini 生成
生成提示
python prompt = ( "You will be given the contents of a web page. Your job is to generate 8-12 Google search queries where " "the page would be a good match. Observe the following guidelines: " " - Respond with just the queries, no preamble or commentary. " " - Each query should be on a new line. " " - If possible, each query should focus on a different part/aspect of the passage, and cover it well from beginning to end. " " - Queries should be diverse in length & format (some questions, some phrases, some jumbles of keywords)
" "Here is the content: {}
Now provide your queries, making sure theyre all different and cover all important parts of the passage:" ) prompts = [prompt.format(x) for x in initial_texts]




