google_wellformed_query

Name: google_wellformed_query
Creator: maas
Published: 2025-12-05 16:41:04
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/google-research-datasets/google_wellformed_query

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Google Query-wellformedness Dataset ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [GitHub](https://github.com/google-research-datasets/query-wellformedness) - **Repository:** [GitHub](https://github.com/google-research-datasets/query-wellformedness) - **Paper:** [ARXIV](https://arxiv.org/abs/1808.09419) - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Google's query wellformedness dataset was created by crowdsourcing well-formedness annotations for 25,100 queries from the Paralex corpus. Every query was annotated by five raters each with 1/0 rating of whether or not the query is well-formed. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages English ## Dataset Structure ### Data Instances ``` {'rating': 0.2, 'content': 'The European Union includes how many ?'} ``` ### Data Fields - `rating`: a `float` between 0-1 - `sentence`: query which you want to rate ### Data Splits | | Train | Valid | Test | | ----- | ------ | ----- | ---- | | Input Sentences | 17500 | 3750 | 3850 | ## Dataset Creation ### Curation Rationale Understanding search queries is a hard problem as it involves dealing with “word salad” text ubiquitously issued by users. However, if a query resembles a well-formed question, a natural language processing pipeline is able to perform more accurate interpretation, thus reducing downstream compounding errors. Hence, identifying whether or not a query is well formed can enhance query understanding. This dataset introduce a new task of identifying a well-formed natural language question. ### Source Data Used the Paralex corpus (Fader et al., 2013) that contains pairs of noisy paraphrase questions. These questions were issued by users in WikiAnswers (a Question-Answer forum) and consist of both web-search query like constructs (“5 parts of chloroplast?”) and well-formed questions (“What is the punishment for grand theft?”). #### Initial Data Collection and Normalization Selected 25,100 queries from the unique list of queries extracted from the corpus such that no two queries in the selected set are paraphrases. #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process The queries are annotated into well-formed or non-wellformed questions if it satisfies the following: 1. Query is grammatical. 2. Query is an explicit question. 3. Query does not contain spelling errors. #### Who are the annotators? Every query was labeled by five different crowdworkers with a binary label indicating whether a query is well-formed or not. And average of the ratings of the five annotators was reported, to get the probability of a query being well-formed. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Query-wellformedness dataset is licensed under CC BY-SA 4.0. Any third party content or data is provided “As Is” without any warranty, express or implied. ### Citation Information ``` @InProceedings{FaruquiDas2018, title = {{Identifying Well-formed Natural Language Questions}}, author = {Faruqui, Manaal and Das, Dipanjan}, booktitle = {Proc. of EMNLP}, year = {2018} } ``` ### Contributions Thanks to [@vasudevgupta7](https://github.com/vasudevgupta7) for adding this dataset.

# Google查询合规性（Query-wellformedness）数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概览](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [数据集构建依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页：** [GitHub](https://github.com/google-research-datasets/query-wellformedness) - **代码仓库：** [GitHub](https://github.com/google-research-datasets/query-wellformedness) - **相关论文：** [ARXIV](https://arxiv.org/abs/1808.09419) - **排行榜：** - **联系方式：** ### 数据集概览本数据集为Google查询合规性（Query-wellformedness）数据集，通过众包方式为Paralex语料库中的25100条查询生成合规性标注。每条查询均由5名标注者进行二元标注（1代表合规，0代表不合规），以判定该查询是否符合自然语言表达规范。 ### 支持任务与排行榜 [需补充更多信息] ### 语言英语 ## 数据集结构 ### 数据实例 {'评分': 0.2, '查询内容': '欧盟包含多少个？'} ### 数据字段 - `rating`（评分）：取值范围为0~1的浮点数 - `sentence`：待标注的查询语句 ### 数据划分 | | 训练集 | 验证集 | 测试集 | | -------------------------- | ------ | ------ | ------ | | 输入查询语句 | 17500 | 3750 | 3850 | ## 数据集构建 ### 数据集构建依据理解搜索查询是一项颇具挑战的任务，因为用户生成的查询常存在“词语堆砌（word salad）”现象。若查询符合自然语言问句的规范，自然语言处理（Natural Language Processing, NLP）流水线便可实现更精准的语义理解，从而降低下游任务的复合误差。因此，判定查询是否合规能够提升查询理解效果。本数据集提出了一项全新任务：识别符合自然语言规范的问句。 ### 源数据本数据集采用了Paralex语料库（Fader等人，2013），该语料库包含多组存在噪声的复述问句。这些问句均来自WikiAnswers问答论坛的用户发布内容，既包含类似网络搜索查询的表达（如“叶绿体的5个组成部分？”），也包含符合规范的问句（如“盗窃重罪的刑罚是什么？”）。 #### 初始数据收集与标准化从语料库提取的唯一查询列表中筛选出25100条查询，确保所选集合内任意两条查询均不为复述关系。 #### 源语言发布者 [需补充更多信息] ### 标注信息 #### 标注流程若查询满足以下条件，则被标注为合规或不合规问句： 1. 语法正确 2. 为明确的问句 3. 无拼写错误 #### 标注人员每条查询均由5名不同的众包标注者进行二元标注，以判定其是否合规。最终取5名标注者的评分平均值，作为该查询合规的概率值。 ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息本查询合规性数据集采用CC BY-SA 4.0许可协议进行授权。任何第三方内容或数据均按“现状”提供，不附带任何明示或暗示的担保。 ### 引用信息 @InProceedings{FaruquiDas2018, title = {{Identifying Well-formed Natural Language Questions}}, author = {Faruqui, Manaal and Das, Dipanjan}, booktitle = {Proc. of EMNLP}, year = {2018} } ### 贡献致谢感谢 [@vasudevgupta7](https://github.com/vasudevgupta7) 为本数据集提交的贡献。

提供机构：

maas

创建时间：

2025-07-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集