asoria/datasets_features_outputs
收藏数据集卡片 for datasets_features_outputs
数据集概述
该数据集包含一个 pipeline.yaml 文件,可用于在 distilabel 中重现生成该数据集的管道:
console distilabel pipeline run --config "https://huggingface.co/datasets/asoria/datasets_features_outputs/raw/main/pipeline.yaml"
或者探索配置:
console distilabel pipeline info --config "https://huggingface.co/datasets/asoria/datasets_features_outputs/raw/main/pipeline.yaml"
数据集结构
每个配置的示例具有以下结构:
<details><summary> 配置: default </summary><hr>
json { "columns": "{"text": {"dtype": "string", "_type": "Value"}}", "dataset": "huggingartists/bushido-zho", "generation": "
Question: Which words appear most frequently in the text column of the dataset? {"question": "Which words appear most frequently in the text column of the dataset?", "sql_query": "SELECT word, COUNT(*) as frequency FROM (SELECT TRIM(REGEXP_SPLIT_TO_TABLE(text, \s+)) as word FROM data) words GROUP BY word ORDER BY frequency DESC LIMIT 10"}", "generation_model": "mistralai/Mistral-7B-Instruct-v0.2", "instruction": "You are a data analyst tasked with exploring a dataset named huggingartists/bushido-zho. Below is the dataset schema in SQL format along with a sample of 5 rows: CREATE TABLE "data"("text" VARCHAR); Sample rows: {text: } {text: ...тян



