google_wellformed_query
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/google_wellformed_query
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Google Query-wellformedness Dataset
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [GitHub](https://github.com/google-research-datasets/query-wellformedness)
- **Repository:** [GitHub](https://github.com/google-research-datasets/query-wellformedness)
- **Paper:** [ARXIV](https://arxiv.org/abs/1808.09419)
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
Google's query wellformedness dataset was created by crowdsourcing well-formedness annotations for 25,100 queries from the Paralex corpus. Every query was annotated by five raters each with 1/0 rating of whether or not the query is well-formed.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
English
## Dataset Structure
### Data Instances
```
{'rating': 0.2, 'content': 'The European Union includes how many ?'}
```
### Data Fields
- `rating`: a `float` between 0-1
- `sentence`: query which you want to rate
### Data Splits
| | Train | Valid | Test |
| ----- | ------ | ----- | ---- |
| Input Sentences | 17500 | 3750 | 3850 |
## Dataset Creation
### Curation Rationale
Understanding search queries is a hard problem as it involves dealing with “word salad” text ubiquitously issued by users. However, if a query resembles a well-formed question, a natural language processing pipeline is able to perform more accurate interpretation, thus reducing downstream compounding errors. Hence, identifying whether or not a query is well formed can enhance query understanding. This dataset introduce a new task of identifying a well-formed natural language question.
### Source Data
Used the Paralex corpus (Fader et al., 2013) that contains pairs of noisy paraphrase questions. These questions were issued by users in WikiAnswers (a Question-Answer forum) and consist of both web-search query like constructs (“5 parts of chloroplast?”) and well-formed questions (“What is the punishment for grand theft?”).
#### Initial Data Collection and Normalization
Selected 25,100 queries from the unique list of queries extracted from the corpus such that no two queries in the selected set are paraphrases.
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
The queries are annotated into well-formed or non-wellformed questions if it satisfies the following:
1. Query is grammatical.
2. Query is an explicit question.
3. Query does not contain spelling errors.
#### Who are the annotators?
Every query was labeled by five different crowdworkers with a binary label indicating whether a query is well-formed or not. And average of the ratings of the five annotators was reported, to get the probability of a query being well-formed.
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
Query-wellformedness dataset is licensed under CC BY-SA 4.0. Any third party content or data is provided “As Is” without any warranty, express or implied.
### Citation Information
```
@InProceedings{FaruquiDas2018,
title = {{Identifying Well-formed Natural Language Questions}},
author = {Faruqui, Manaal and Das, Dipanjan},
booktitle = {Proc. of EMNLP},
year = {2018}
}
```
### Contributions
Thanks to [@vasudevgupta7](https://github.com/vasudevgupta7) for adding this dataset.
# Google查询合规性(Query-wellformedness)数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概览](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [数据集构建依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页:** [GitHub](https://github.com/google-research-datasets/query-wellformedness)
- **代码仓库:** [GitHub](https://github.com/google-research-datasets/query-wellformedness)
- **相关论文:** [ARXIV](https://arxiv.org/abs/1808.09419)
- **排行榜:**
- **联系方式:**
### 数据集概览
本数据集为Google查询合规性(Query-wellformedness)数据集,通过众包方式为Paralex语料库中的25100条查询生成合规性标注。每条查询均由5名标注者进行二元标注(1代表合规,0代表不合规),以判定该查询是否符合自然语言表达规范。
### 支持任务与排行榜
[需补充更多信息]
### 语言
英语
## 数据集结构
### 数据实例
{'评分': 0.2, '查询内容': '欧盟包含多少个?'}
### 数据字段
- `rating`(评分):取值范围为0~1的浮点数
- `sentence`:待标注的查询语句
### 数据划分
| | 训练集 | 验证集 | 测试集 |
| -------------------------- | ------ | ------ | ------ |
| 输入查询语句 | 17500 | 3750 | 3850 |
## 数据集构建
### 数据集构建依据
理解搜索查询是一项颇具挑战的任务,因为用户生成的查询常存在“词语堆砌(word salad)”现象。若查询符合自然语言问句的规范,自然语言处理(Natural Language Processing, NLP)流水线便可实现更精准的语义理解,从而降低下游任务的复合误差。因此,判定查询是否合规能够提升查询理解效果。本数据集提出了一项全新任务:识别符合自然语言规范的问句。
### 源数据
本数据集采用了Paralex语料库(Fader等人,2013),该语料库包含多组存在噪声的复述问句。这些问句均来自WikiAnswers问答论坛的用户发布内容,既包含类似网络搜索查询的表达(如“叶绿体的5个组成部分?”),也包含符合规范的问句(如“盗窃重罪的刑罚是什么?”)。
#### 初始数据收集与标准化
从语料库提取的唯一查询列表中筛选出25100条查询,确保所选集合内任意两条查询均不为复述关系。
#### 源语言发布者
[需补充更多信息]
### 标注信息
#### 标注流程
若查询满足以下条件,则被标注为合规或不合规问句:
1. 语法正确
2. 为明确的问句
3. 无拼写错误
#### 标注人员
每条查询均由5名不同的众包标注者进行二元标注,以判定其是否合规。最终取5名标注者的评分平均值,作为该查询合规的概率值。
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 许可信息
本查询合规性数据集采用CC BY-SA 4.0许可协议进行授权。任何第三方内容或数据均按“现状”提供,不附带任何明示或暗示的担保。
### 引用信息
@InProceedings{FaruquiDas2018,
title = {{Identifying Well-formed Natural Language Questions}},
author = {Faruqui, Manaal and Das, Dipanjan},
booktitle = {Proc. of EMNLP},
year = {2018}
}
### 贡献致谢
感谢 [@vasudevgupta7](https://github.com/vasudevgupta7) 为本数据集提交的贡献。
提供机构:
maas
创建时间:
2025-07-07



