five

sql-questions

收藏
魔搭社区2025-12-04 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/sql-questions
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset card for SQL Questions This dataset is a reformatting of the [`sql_questions_triplets`](https://huggingface.co/datasets/sergeyvi4ev/sql_questions_triplets) dataset by [@sergeyvi4ev](https://huggingface.co/sergeyvi4ev), such that the dataset can be directly used to train Sentence Transformer models. ## Dataset Subsets ### `pair` subset * Columns: "query", "positive" * Column types: `str`, `str` * Examples: ```python { 'query': 'How many zip codes are under Barre, VT?', 'positive': '"Barre, VT" is the CBSA_name', } ``` * Collection strategy: Reading the SQL Questions dataset and selecting all query-positive pairs. * Deduplified: Yes ### `triplet` subset * Columns: "query", "positive", "negative" * Column types: `str`, `str`, `str` * Examples: ```python { 'query': 'How many zip codes are under Barre, VT?', 'positive': '"Barre, VT" is the CBSA_name', 'negative': "coordinates refers to latitude, longitude; latitude = '18.090875; longitude = '-66.867756'" } ``` * Collection strategy: Reading the SQL Questions dataset and selecting all possible triplet pairs. * Deduplified: No ### `mined-negative` subset * Columns: "query", "positive", "negative_1", "negative_2", "negative_3", "negative_4", "negative_5", "negative_6", "negative_7", "negative_8", "negative_9", "negative_10" * Column types: `str`, `str`, `str`, `str`, `str`, `str`, `str`, `str`, `str`, `str`, `str`, `str` * Examples: ```python { "query": "How many zip codes are under Barre, VT?", "positive": "\"Barre, VT\" is the CBSA_name", "negative_1": "coordinates refers to latitude, longitude; latitude = '18.090875; longitude = '-66.867756'", "negative_2": "name of county refers to county", "negative_3": "median age over 40 refers to median_age > 40", "negative_4": "\"PHILLIPS\" is the county; 'Montana' is the name of state", "negative_5": "name of the CBSA officer refers to CBSA_name; position of the CBSA officer refers to CBSA_type;", "negative_6": "population greater than 10000 in 2010 refers to population_2010 > 10000;", "negative_7": "postal points refer to zip_code; under New York-Newark-Jersey City, NY-NJ-PA refers to CBSA_name = 'New York-Newark-Jersey City, NY-NJ-PA';", "negative_8": "the largest water area refers to MAX(water_area);", "negative_9": "\"Wisconsin\" is the state; largest land area refers to Max(land_area); full name refers to first_name, last_name; postal code refers to zip_code", "negative_10": "\"Alabama\" and \"Illinois\" are both state; Ratio = Divide (Count(state = 'Alabama'), Count(state = 'Illinois'))" } ``` * Collection strategy: Reading the SQL Questions dataset, filtering away the 15 samples that did not have 10 negative pairs, and formatting them in the described columns. * Deduplified: No

# SQL问题数据集卡片 本数据集是对[`sql_questions_triplets`](https://huggingface.co/datasets/sergeyvi4ev/sql_questions_triplets) 数据集的重新格式化,由[@sergeyvi4ev](https://huggingface.co/sergeyvi4ev) 制作,使其可直接用于训练Sentence Transformer(句子Transformer)模型。 ## 数据集子集 ### `pair` 子集 * 列名:"query"、"positive" * 列类型:字符串型(str)、字符串型(str) * 示例: python { 'query': 'How many zip codes are under Barre, VT?', 'positive': '"Barre, VT" is the CBSA_name', } * 收集策略:读取SQL问题数据集并选取所有查询-正样本对构建而成。 * 去重:是 ### `triplet` 子集 * 列名:"query"、"positive"、"negative" * 列类型:字符串型、字符串型、字符串型 * 示例: python { 'query': 'How many zip codes are under Barre, VT?', 'positive': '"Barre, VT" is the CBSA_name', 'negative': "coordinates refers to latitude, longitude; latitude = '18.090875; longitude = '-66.867756'" } * 收集策略:读取SQL问题数据集并选取所有可行的三元组对构建而成。 * 去重:否 ### `mined-negative` 子集 * 列名:"query", "positive", "negative_1", "negative_2", "negative_3", "negative_4", "negative_5", "negative_6", "negative_7", "negative_8", "negative_9", "negative_10" * 列类型:全部为字符串型 * 示例: python { "query": "How many zip codes are under Barre, VT?", "positive": ""Barre, VT" is the CBSA_name", "negative_1": "coordinates refers to latitude, longitude; latitude = '18.090875; longitude = '-66.867756'", "negative_2": "name of county refers to county", "negative_3": "median age over 40 refers to median_age > 40", "negative_4": ""PHILLIPS" is the county; 'Montana' is the name of state", "negative_5": "name of the CBSA officer refers to CBSA_name; position of the CBSA officer refers to CBSA_type;", "negative_6": "population greater than 10000 in 2010 refers to population_2010 > 10000;", "negative_7": "postal points refer to zip_code; under New York-Newark-Jersey City, NY-NJ-PA refers to CBSA_name = 'New York-Newark-Jersey City, NY-NJ-PA';", "negative_8": "the largest water area refers to MAX(water_area);", "negative_9": ""Wisconsin" is the state; largest land area refers to Max(land_area); full name refers to first_name, last_name; postal code refers to zip_code", "negative_10": ""Alabama" and "Illinois" are both state; Ratio = Divide (Count(state = 'Alabama'), Count(state = 'Illinois'))" } * 收集策略:读取SQL问题数据集,过滤掉不具备10组负样本的15个样本,并按照上述列格式进行整理。 * 去重:否
提供机构:
maas
创建时间:
2025-01-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作