sql-questions
收藏魔搭社区2025-12-04 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/sql-questions
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset card for SQL Questions
This dataset is a reformatting of the [`sql_questions_triplets`](https://huggingface.co/datasets/sergeyvi4ev/sql_questions_triplets) dataset by [@sergeyvi4ev](https://huggingface.co/sergeyvi4ev), such that the dataset can be directly used to train Sentence Transformer models.
## Dataset Subsets
### `pair` subset
* Columns: "query", "positive"
* Column types: `str`, `str`
* Examples:
```python
{
'query': 'How many zip codes are under Barre, VT?',
'positive': '"Barre, VT" is the CBSA_name',
}
```
* Collection strategy: Reading the SQL Questions dataset and selecting all query-positive pairs.
* Deduplified: Yes
### `triplet` subset
* Columns: "query", "positive", "negative"
* Column types: `str`, `str`, `str`
* Examples:
```python
{
'query': 'How many zip codes are under Barre, VT?',
'positive': '"Barre, VT" is the CBSA_name',
'negative': "coordinates refers to latitude, longitude; latitude = '18.090875; longitude = '-66.867756'"
}
```
* Collection strategy: Reading the SQL Questions dataset and selecting all possible triplet pairs.
* Deduplified: No
### `mined-negative` subset
* Columns: "query", "positive", "negative_1", "negative_2", "negative_3", "negative_4", "negative_5", "negative_6", "negative_7", "negative_8", "negative_9", "negative_10"
* Column types: `str`, `str`, `str`, `str`, `str`, `str`, `str`, `str`, `str`, `str`, `str`, `str`
* Examples:
```python
{
"query": "How many zip codes are under Barre, VT?",
"positive": "\"Barre, VT\" is the CBSA_name",
"negative_1": "coordinates refers to latitude, longitude; latitude = '18.090875; longitude = '-66.867756'",
"negative_2": "name of county refers to county",
"negative_3": "median age over 40 refers to median_age > 40",
"negative_4": "\"PHILLIPS\" is the county; 'Montana' is the name of state",
"negative_5": "name of the CBSA officer refers to CBSA_name; position of the CBSA officer refers to CBSA_type;",
"negative_6": "population greater than 10000 in 2010 refers to population_2010 > 10000;",
"negative_7": "postal points refer to zip_code; under New York-Newark-Jersey City, NY-NJ-PA refers to CBSA_name = 'New York-Newark-Jersey City, NY-NJ-PA';",
"negative_8": "the largest water area refers to MAX(water_area);",
"negative_9": "\"Wisconsin\" is the state; largest land area refers to Max(land_area); full name refers to first_name, last_name; postal code refers to zip_code",
"negative_10": "\"Alabama\" and \"Illinois\" are both state; Ratio = Divide (Count(state = 'Alabama'), Count(state = 'Illinois'))"
}
```
* Collection strategy: Reading the SQL Questions dataset, filtering away the 15 samples that did not have 10 negative pairs, and formatting them in the described columns.
* Deduplified: No
# SQL问题数据集卡片
本数据集是对[`sql_questions_triplets`](https://huggingface.co/datasets/sergeyvi4ev/sql_questions_triplets) 数据集的重新格式化,由[@sergeyvi4ev](https://huggingface.co/sergeyvi4ev) 制作,使其可直接用于训练Sentence Transformer(句子Transformer)模型。
## 数据集子集
### `pair` 子集
* 列名:"query"、"positive"
* 列类型:字符串型(str)、字符串型(str)
* 示例:
python
{
'query': 'How many zip codes are under Barre, VT?',
'positive': '"Barre, VT" is the CBSA_name',
}
* 收集策略:读取SQL问题数据集并选取所有查询-正样本对构建而成。
* 去重:是
### `triplet` 子集
* 列名:"query"、"positive"、"negative"
* 列类型:字符串型、字符串型、字符串型
* 示例:
python
{
'query': 'How many zip codes are under Barre, VT?',
'positive': '"Barre, VT" is the CBSA_name',
'negative': "coordinates refers to latitude, longitude; latitude = '18.090875; longitude = '-66.867756'"
}
* 收集策略:读取SQL问题数据集并选取所有可行的三元组对构建而成。
* 去重:否
### `mined-negative` 子集
* 列名:"query", "positive", "negative_1", "negative_2", "negative_3", "negative_4", "negative_5", "negative_6", "negative_7", "negative_8", "negative_9", "negative_10"
* 列类型:全部为字符串型
* 示例:
python
{
"query": "How many zip codes are under Barre, VT?",
"positive": ""Barre, VT" is the CBSA_name",
"negative_1": "coordinates refers to latitude, longitude; latitude = '18.090875; longitude = '-66.867756'",
"negative_2": "name of county refers to county",
"negative_3": "median age over 40 refers to median_age > 40",
"negative_4": ""PHILLIPS" is the county; 'Montana' is the name of state",
"negative_5": "name of the CBSA officer refers to CBSA_name; position of the CBSA officer refers to CBSA_type;",
"negative_6": "population greater than 10000 in 2010 refers to population_2010 > 10000;",
"negative_7": "postal points refer to zip_code; under New York-Newark-Jersey City, NY-NJ-PA refers to CBSA_name = 'New York-Newark-Jersey City, NY-NJ-PA';",
"negative_8": "the largest water area refers to MAX(water_area);",
"negative_9": ""Wisconsin" is the state; largest land area refers to Max(land_area); full name refers to first_name, last_name; postal code refers to zip_code",
"negative_10": ""Alabama" and "Illinois" are both state; Ratio = Divide (Count(state = 'Alabama'), Count(state = 'Illinois'))"
}
* 收集策略:读取SQL问题数据集,过滤掉不具备10组负样本的15个样本,并按照上述列格式进行整理。
* 去重:否
提供机构:
maas
创建时间:
2025-01-06



