recastai/sql-create-context-chatml
收藏Hugging Face2024-03-13 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/recastai/sql-create-context-chatml
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
dataset_info:
features:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: train
num_bytes: 78885727
num_examples: 78577
download_size: 7507566
dataset_size: 78885727
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- text2text-generation
language:
- en
tags:
- text-to-sql
- chatml
pretty_name: 'sql-create-context-chatml '
size_categories:
- 10K<n<100K
---
## Dataset Summary
This dataset has been created by **Re:cast AI** to extend the existing dataset [b-mc2/sql-create-context](https://website-name.com](https://huggingface.co/datasets/b-mc2/sql-create-context) into a [chatml](https://huggingface.co/docs/transformers/main/en/chat_templating) friendly format for use in SFT tasks with pretrained models.
## Dataset Structure
```python
messages = [
{'content': "You are a powerful text-to-SQL AI assistant that helps users ... etc.", 'role': 'system'},
{'content': '(Optional) Context information is below ... etc.', 'role': 'user'},
{'content': 'SELECT COUNT(*) FROM head WHERE age > 56', 'role': 'assistant'}
]
```
## Annotation Process
Example of how the dataset was created, which you can alter to update the author's original dataset into a form suited to your needs.
```python
INSTRUCTIONS = """You are a powerful text-to-SQL AI assistant that helps users interact with SQL databases. Your job is to answer questions about a database. You are given a user question or command and (optional) context regarding one or more tables.
You must output the SQL query that answers the question.
Some rules to follow:
1. Never directly reference the given context in your answer.
2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or 'The answer to the user's query...' or anything along those lines.
3. You only respond with valid SQL to the user's query."""
def process_chatml_fn(example):
user_content = (
"(Optional) Context information is below.\n"
"----------------\n"
f"{example['context']}\n"
"----------------\n"
"Given the context information and not prior knowledge, answer the following query.\n"
f"{example['question']}\n"
)
assistant_content = f"{example['answer']}"
message = [
{"role": "system", "content": INSTRUCTIONS},
{"role": "user", "content": user_content},
{"role": "assistant", "content": assistant_content}
]
return message
ds = load_dataset("b-mc2/sql-create-context", split = "train")
ds = ds.map(lambda x: {"messages": process_chatml_fn(x)}, remove_columns=ds.features) # Conform to chatml format
```
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("recastai/sql-create-context-chatml")
```
提供机构:
recastai
原始信息汇总
数据集概述
基本信息
- 许可证: cc-by-4.0
- 数据集名称: sql-create-context-chatml
- 数据集大小: 78885727字节
- 下载大小: 7507566字节
- 训练集大小: 78577个样本,78885727字节
数据结构
- 特征:
- messages:
- content (字符串类型)
- role (字符串类型)
- messages:
- 分割:
- train: 78577个样本
配置
- 默认配置:
- 数据文件路径: data/train-*
任务与语言
- 任务类别: text2text-generation
- 语言: 英语
- 标签: text-to-sql, chatml
- 美观名称: sql-create-context-chatml
- 大小类别: 10K<n<100K
数据集创建
- 创建目的: 扩展b-mc2/sql-create-context数据集,使其适合于chatml格式的SFT任务。
- 创建方法: 使用特定规则处理原始数据集,转换为chatml格式。



