five

Kamaljp/text2sql_argilla

收藏
Hugging Face2024-04-06 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/Kamaljp/text2sql_argilla
下载链接
链接失效反馈
官方服务:
资源简介:
--- size_categories: n<1K tags: - rlfh - argilla - human-feedback --- # Dataset Card for text2sql_argilla This dataset has been created with [Argilla](https://docs.argilla.io). As shown in the sections below, this dataset can be loaded into Argilla as explained in [Load with Argilla](#load-with-argilla), or used directly with the `datasets` library in [Load with `datasets`](#load-with-datasets). ## Dataset Description - **Homepage:** https://argilla.io - **Repository:** https://github.com/argilla-io/argilla - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset contains: * A dataset configuration file conforming to the Argilla dataset format named `argilla.yaml`. This configuration file will be used to configure the dataset when using the `FeedbackDataset.from_huggingface` method in Argilla. * Dataset records in a format compatible with HuggingFace `datasets`. These records will be loaded automatically when using `FeedbackDataset.from_huggingface` and can be loaded independently using the `datasets` library via `load_dataset`. * The [annotation guidelines](#annotation-guidelines) that have been used for building and curating the dataset, if they've been defined in Argilla. ### Load with Argilla To load with Argilla, you'll just need to install Argilla as `pip install argilla --upgrade` and then use the following code: ```python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("Kamaljp/text2sql_argilla") ``` ### Load with `datasets` To load this dataset with `datasets`, you'll just need to install `datasets` as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset ds = load_dataset("Kamaljp/text2sql_argilla") ``` ### Supported Tasks and Leaderboards This dataset can contain [multiple fields, questions and responses](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#feedback-dataset) so it can be used for different NLP tasks, depending on the configuration. The dataset structure is described in the [Dataset Structure section](#dataset-structure). There are no leaderboards associated with this dataset. ### Languages [More Information Needed] ## Dataset Structure ### Data in Argilla The dataset is created in Argilla with: **fields**, **questions**, **suggestions**, **metadata**, **vectors**, and **guidelines**. The **fields** are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. | Field Name | Title | Type | Required | Markdown | | ---------- | ----- | ---- | -------- | -------- | | sql_complexity_description | Sql Complexity | text | True | True | | sql_task_type_description | Task Description | text | True | True | | sql_prompt | prompt | text | True | True | | sql_context | context | text | True | True | | sql | SQL Query | text | True | True | The **questions** are the questions that will be asked to the annotators. They can be of different types, such as rating, text, label_selection, multi_label_selection, or ranking. | Question Name | Title | Type | Required | Description | Values/Labels | | ------------- | ----- | ---- | -------- | ----------- | ------------- | | sqltext | Checking output of the sql query, sql explanation | text | True | Review the SQL query field and provide feedback | N/A | The **suggestions** are human or machine generated recommendations for each question to assist the annotator during the annotation process, so those are always linked to the existing questions, and named appending "-suggestion" and "-suggestion-metadata" to those, containing the value/s of the suggestion and its metadata, respectively. So on, the possible values are the same as in the table above, but the column name is appended with "-suggestion" and the metadata is appended with "-suggestion-metadata". The **metadata** is a dictionary that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. **✨ NEW** The **vectors** are different columns that contain a vector in floating point, which is constraint to the pre-defined dimensions in the **vectors_settings** when configuring the vectors within the dataset itself, also the dimensions will always be 1-dimensional. The **vectors** are optional and identified by the pre-defined vector name in the dataset configuration file in `argilla.yaml`. | Vector Name | Title | Dimensions | |-------------|-------|------------| | domain | domain | [1, 384] | | domain_description | domain_description | [1, 384] | | Metadata Name | Title | Type | Values | Visible for Annotators | | ------------- | ----- | ---- | ------ | ---------------------- | | domain | domain of prompt | terms | - | True | | domain_description | domain explanation | terms | - | True | | sql_complexity | Complexity level of SQL query | terms | - | True | | sql_task_type | type of sql query task | terms | - | True | The **guidelines**, are optional as well, and are just a plain string that can be used to provide instructions to the annotators. Find those in the [annotation guidelines](#annotation-guidelines) section. ### Data Instances An example of a dataset instance in Argilla looks as follows: ```json { "external_id": null, "fields": { "sql": "SELECT name AS team, MAX(home_team_wins + away_team_wins) AS highest_wins FROM (SELECT name, CASE WHEN home_team = team_id AND home_team_score \u003e away_team_score THEN 1 ELSE 0 END + CASE WHEN away_team = team_id AND away_team_score \u003e home_team_score THEN 1 ELSE 0 END AS home_team_wins, CASE WHEN home_team = team_id AND home_team_score \u003c away_team_score THEN 1 ELSE 0 END + CASE WHEN away_team = team_id AND away_team_score \u003c home_team_score THEN 1 ELSE 0 END AS away_team_wins FROM basketball_teams JOIN basketball_games ON basketball_teams.team_id = basketball_games.home_team OR basketball_teams.team_id = basketball_games.away_team) AS subquery GROUP BY name;", "sql_complexity_description": "subqueries, including correlated and nested subqueries", "sql_context": "CREATE TABLE basketball_teams (team_id INT, name VARCHAR(50)); CREATE TABLE basketball_games (game_id INT, home_team INT, away_team INT, home_team_score INT, away_team_score INT); INSERT INTO basketball_teams (team_id, name) VALUES (1, \u0027Boston Celtics\u0027), (2, \u0027Los Angeles Lakers\u0027), (3, \u0027Chicago Bulls\u0027); INSERT INTO basketball_games (game_id, home_team, away_team, home_team_score, away_team_score) VALUES (1, 1, 2, 85, 80), (2, 2, 3, 95, 90), (3, 3, 1, 75, 85);", "sql_prompt": "Which team has the highest number of wins in the \u0027basketball_games\u0027 table?", "sql_task_type_description": "generating reports, dashboards, and analytical insights" }, "metadata": { "domain": "sports", "domain_description": "Extensive data on athlete performance, team management, fan engagement, facility operations, and event planning in sports.", "sql_complexity": "subqueries", "sql_task_type": "analytics and reporting" }, "responses": [], "suggestions": [], "vectors": {} } ``` While the same record in HuggingFace `datasets` looks as follows: ```json { "external_id": null, "metadata": "{\"domain\": \"sports\", \"domain_description\": \"Extensive data on athlete performance, team management, fan engagement, facility operations, and event planning in sports.\", \"sql_complexity\": \"subqueries\", \"sql_task_type\": \"analytics and reporting\"}", "sql": "SELECT name AS team, MAX(home_team_wins + away_team_wins) AS highest_wins FROM (SELECT name, CASE WHEN home_team = team_id AND home_team_score \u003e away_team_score THEN 1 ELSE 0 END + CASE WHEN away_team = team_id AND away_team_score \u003e home_team_score THEN 1 ELSE 0 END AS home_team_wins, CASE WHEN home_team = team_id AND home_team_score \u003c away_team_score THEN 1 ELSE 0 END + CASE WHEN away_team = team_id AND away_team_score \u003c home_team_score THEN 1 ELSE 0 END AS away_team_wins FROM basketball_teams JOIN basketball_games ON basketball_teams.team_id = basketball_games.home_team OR basketball_teams.team_id = basketball_games.away_team) AS subquery GROUP BY name;", "sql_complexity_description": "subqueries, including correlated and nested subqueries", "sql_context": "CREATE TABLE basketball_teams (team_id INT, name VARCHAR(50)); CREATE TABLE basketball_games (game_id INT, home_team INT, away_team INT, home_team_score INT, away_team_score INT); INSERT INTO basketball_teams (team_id, name) VALUES (1, \u0027Boston Celtics\u0027), (2, \u0027Los Angeles Lakers\u0027), (3, \u0027Chicago Bulls\u0027); INSERT INTO basketball_games (game_id, home_team, away_team, home_team_score, away_team_score) VALUES (1, 1, 2, 85, 80), (2, 2, 3, 95, 90), (3, 3, 1, 75, 85);", "sql_prompt": "Which team has the highest number of wins in the \u0027basketball_games\u0027 table?", "sql_task_type_description": "generating reports, dashboards, and analytical insights", "sqltext": [], "sqltext-suggestion": null, "sqltext-suggestion-metadata": { "agent": null, "score": null, "type": null }, "vectors": { "domain": null, "domain_description": null } } ``` ### Data Fields Among the dataset fields, we differentiate between the following: * **Fields:** These are the dataset records themselves, for the moment just text fields are supported. These are the ones that will be used to provide responses to the questions. * **sql_complexity_description** is of type `text`. * **sql_task_type_description** is of type `text`. * **sql_prompt** is of type `text`. * **sql_context** is of type `text`. * **sql** is of type `text`. * **Questions:** These are the questions that will be asked to the annotators. They can be of different types, such as `RatingQuestion`, `TextQuestion`, `LabelQuestion`, `MultiLabelQuestion`, and `RankingQuestion`. * **sqltext** is of type `text`, and description "Review the SQL query field and provide feedback". * **Suggestions:** As of Argilla 1.13.0, the suggestions have been included to provide the annotators with suggestions to ease or assist during the annotation process. Suggestions are linked to the existing questions, are always optional, and contain not just the suggestion itself, but also the metadata linked to it, if applicable. * (optional) **sqltext-suggestion** is of type `text`. * **✨ NEW** **Vectors**: As of Argilla 1.19.0, the vectors have been included in order to add support for similarity search to explore similar records based on vector search powered by the search engine defined. The vectors are optional and cannot be seen within the UI, those are uploaded and internally used. Also the vectors will always be optional, and only the dimensions previously defined in their settings. * (optional) **domain** is of type `float32` and has a dimension of (1, `384`). * (optional) **domain_description** is of type `float32` and has a dimension of (1, `384`). Additionally, we also have two more fields that are optional and are the following: * **metadata:** This is an optional field that can be used to provide additional information about the dataset record. This can be useful to provide additional context to the annotators, or to provide additional information about the dataset record itself. For example, you can use this to provide a link to the original source of the dataset record, or to provide additional information about the dataset record itself, such as the author, the date, or the source. The metadata is always optional, and can be potentially linked to the `metadata_properties` defined in the dataset configuration file in `argilla.yaml`. * **external_id:** This is an optional field that can be used to provide an external ID for the dataset record. This can be useful if you want to link the dataset record to an external resource, such as a database or a file. ### Data Splits The dataset contains a single split, which is `train`. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation guidelines [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
Kamaljp
原始信息汇总

数据集概述

名称: text2sql_argilla

创建工具: 使用Argilla创建

数据集大小: n<1K

标签:

  • rlfh
  • argilla
  • human-feedback

数据集内容

  • 配置文件: 包含一个名为argilla.yaml的配置文件,符合Argilla数据集格式。
  • 数据记录: 数据记录格式与HuggingFace datasets兼容,可通过FeedbackDataset.from_huggingface方法自动加载,或独立使用datasets库通过load_dataset加载。
  • 标注指南: 如果已在Argilla中定义,包含用于构建和策划数据集的标注指南

数据集加载

  • 使用Argilla加载: python import argilla as rg ds = rg.FeedbackDataset.from_huggingface("Kamaljp/text2sql_argilla")

  • 使用datasets加载: python from datasets import load_dataset ds = load_dataset("Kamaljp/text2sql_argilla")

数据集结构

  • 字段: 支持文本字段,用于提供对问题的响应。

    • sql_complexity_description (文本)
    • sql_task_type_description (文本)
    • sql_prompt (文本)
    • sql_context (文本)
    • sql (文本)
  • 问题: 向标注者提出的问题,类型包括评分、文本、标签选择、多标签选择或排名。

    • sqltext (文本) - 描述为“审查SQL查询字段并提供反馈”
  • 建议: 自Argilla 1.13.0起,提供给标注者的建议,以协助标注过程。

    • (可选) sqltext-suggestion (文本)
  • ✨ 新功能 - 向量: 自Argilla 1.19.0起,支持相似性搜索,基于向量搜索探索相似记录。

    • (可选) domain (float32) - 维度为(1, 384)
    • (可选) domain_description (float32) - 维度为(1, 384)
  • 元数据: 提供关于数据集记录的额外信息,如来源链接、作者、日期等。

  • 外部ID: 提供数据集记录的外部ID,用于链接外部资源。

数据集分割

  • 分割: 仅包含train分割。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作