five

shehab44/bird23-train-filtered

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/shehab44/bird23-train-filtered
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 task_categories: - table-question-answering - question-answering language: - en size_categories: - 1K<n<10K --- # BIRD-SQL Train (Filtered) A high-quality subset of the original BIRD train split for text-to-SQL finetuning. ## Overview Over the past year the community has shared many observations about data quality in BIRD. We performed a rigorous data quality check process to retain examples that are **consistent with schema** and **faithfully answer the question**. The resulting set keeps **6,601** instances out of **9,428** (≈70%), and serves as a drop-in replacement for training. - **Original Train:** 9,428 - **Filtered Train (this release):** 6,601 The example code for training and inference can be found [here](https://github.com/bird-bench/mini_dev/tree/main) ### For New Users If you are new to BIRD project, you can download the complete databases for the training set using the following link: [Download BIRD Train](https://bird-bench.oss-cn-beijing.aliyuncs.com/train.zip) ### For Existing Users If you have already downloaded the BIRD training databases, you can pull the latest filtered data updates through Hugging Face using the following scripts: ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("birdsql/bird23-train-filtered") # Access the dataset print(dataset["train"][0]) ``` You can find the column meaning json file [here](https://huggingface.co/datasets/birdsql/bird23-train-filtered/resolve/main/train_column_meaning.json) the key is composed of `database_id|table_name|column_name`, and the value is key information about each column and their value summarized from raw CSVs. ## Training Quality We validate by finetuning a single open model with a standard SFT recipe and evaluating on the official BIRD Mini Dev and Dev set. We use the [Qwen/Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B) as the base model and follow the [Arctic-Text2SQL-R1 project ](https://www.snowflake.com/en/product/ai/ai-research/) data processing. You can find the original [repo](https://github.com/snowflakedb/ArcticTraining/tree/main/projects/arctic_text2sql_r1) and [paper](https://arxiv.org/abs/2505.20315) here. ### Performance Comparison (EX) | Setting | Mini-Dev | Dev | | ------------------------- | -------- | -------- | | **Baseline** | 26.2 | 31.88 | | **Original Train** | 45.4 | 50.46 | | **Filtered Train** | **46.0** | **50.0** | Takeaway: with **~30%** fewer training items, the filtered set matches the full set on the Mini-Dev and Dev set. ### Data scaling on the filtered set <p align="left"> <img src="https://cdn-uploads.huggingface.co/production/uploads/653693cb8ee17cfd44eed8ce/6_0KJzy4o1GfMDnqP3nA1.png" width="520"> </p> ## Dataset Introduction The dataset contains the main following resources: - `database`: The database should be stored under the [`./train_databases/`](./train_databases/). In each database folder, it has two components: - `database_description`: the csv files are manufactured to describe database schema and its values for models to explore or references. - `sqlite`: The database contents in BIRD. - `data`: Each text-to-SQL pairs with the oracle knowledge evidence is stored as a JSONL file, i.e., `train.jsonl`. It has four main parts: - `db_id`: the names of databases - `question`: the questions curated by human crowdsourcing according to database descriptions, database contents. - `evidence`: the external knowledge evidence annotated by experts for assistance of models or SQL annotators. - `SQL`: SQLs annotated by crowdsource referring to database descriptions, database contents, to answer the questions accurately. ## Acknowledgements This work builds on the BIRD benchmark and the efforts of its creators and contributors. We thank the community for continuous feedback that helped shape this release. ## Citation Please cite the repo if you think our work is helpful to you. ``` @article{li2024can, title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls}, author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others}, journal={Advances in Neural Information Processing Systems}, volume={36}, year={2024} } ```
提供机构:
shehab44
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作