matlok/python-text-copilot-training-instruct-ai-research-2024-02-11

Name: matlok/python-text-copilot-training-instruct-ai-research-2024-02-11
Creator: matlok
Published: 2024-02-12 04:48:34
License: 暂无描述

Hugging Face2024-02-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/matlok/python-text-copilot-training-instruct-ai-research-2024-02-11

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: - other pretty_name: >- 2024-02-11 - python copilot instructions on how to code using alpaca and yaml dataset_info: - config_name: autogen splits: - name: view_schema configs: - config_name: autogen data_files: - split: view_schema path: schema/train-0001-autogen-autogen.parquet size_categories: - 1M<n<10M tags: - python-copilot - python-coding - python-architecture - knowledge-graphs - multimodal - text-image-audio - fine-tuning - training - question-answering - image-knowledge-graph - alpaca - mp3 - png - text - instruct - coding - task - prompt - response - yaml # supported task_categories # text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, conversational, feature-extraction, text-generation, text2text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-retrieval, time-series-forecasting, text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, other task_categories: - text-generation - question-answering # supported task_ids # acceptability-classification, entity-linking-classification, fact-checking, intent-classification, language-identification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, image-inpainting, image-colorization, super-resolution, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering task_ids: - parsing --- ## Python Copilot Instructions on How to Code using Alpaca and Yaml Training and test datasets for building coding multimodal models that understand how to use the open source GitHub projects for the [Autogen](https://github.com/microsoft/autogen/tree/main) and multimodal **Qwen AI** project: - [Qwen](https://github.com/QwenLM/Qwen) - [Qwen Agent](https://github.com/QwenLM/Qwen-Agent) - [Qwen VL Chat](https://github.com/QwenLM/Qwen-VL) - [Qwen Audio](https://github.com/QwenLM/Qwen-Audio) This dataset is the 2024-02-11 update for the matlok python copilot datasets. Please refer to the [Multimodal Python Copilot Training Overview](https://huggingface.co/datasets/matlok/multimodal-python-copilot-training-overview) for more details on how to use this dataset. ### Details Each row contains python code, either a class method or a global function, imported modules, base classes (if any), exceptions (ordered based off the code), returns (ordered based off the code), arguments (ordered based off the code), and more. - Rows: 1075795 - Size: 1.8 GB - Data type: instruct - Format: Introduction on code usage using alpaca and yaml response - Number of python repos: 1275 ### How to use the datasets #### Load Autogen Schema Dataset ```python from datasets import load_dataset ds_name = ( "matlok" "/" "python-text-copilot-training-" "instruct-ai-research-" "2024-02-11" ) dc = "autogen" ds = load_dataset(ds_name, dc, verification_mode="no_checks") print(f"ds={ds_name} dataset_config={dc} has {len(ds['view_schema']['file_path'])} unique python modules") ``` ``` dataset_config=autogen has 130 unique python modules ``` ### Schema The instruction alpaca text with yaml response is in the **desc** column: ```json { "active": "bool", "args": "string", "args_len": "float64", "audio_file": "string", "audio_path": "string", "class_bases": "string", "class_name": "string", "code": "string", "code_len": "float64", "desc": "string", "desc_docstr": "string", "desc_docstr_len": "float64", "desc_len": "int64", "docstr": "string", "docstr_len": "int64", "file_path": "string", "file_type": "string", "function_names": "string", "gen_bytes": "int64", "gen_data_type": "string", "gen_mode": "string", "gen_size": "int64", "gen_valid": "bool", "height": "int64", "image_file": "string", "image_path": "string", "method_names": "string", "name": "string", "num_all_bases": "int64", "num_bases": "int64", "num_classes": "int64", "num_functions": "float64", "num_imports": "int64", "num_methods": "float64", "prompts": "string", "raises": "string", "raises_len": "float64", "recsize": "int64", "repo": "string", "returns": "string", "returns_len": "float64", "size": "int64", "src_object": "string", "total_objects": "int64", "usage": "string", "usages": "string", "width": "int64" } ```

提供机构：

matlok

原始信息汇总

数据集概述

数据集信息

数据集名称: 2024-02-11 - python copilot instructions on how to code using alpaca and yaml
配置名称: autogen
数据分割:
- 名称: view_schema
数据文件:
- 分割: view_schema
- 路径: schema/train-0001-autogen-autogen.parquet
数据规模: 1M<n<10M

支持的任务类别

text-generation
question-answering

支持的任务ID

parsing

数据集详情

行数: 1075795
大小: 1.8 GB
数据类型: instruct
格式: Introduction on code usage using alpaca and yaml response
Python仓库数量: 1275

数据集使用方法

加载Autogen Schema数据集

python from datasets import load_dataset

ds_name = ( "matlok" "/" "python-text-copilot-training-" "instruct-ai-research-" "2024-02-11" ) dc = "autogen" ds = load_dataset(ds_name, dc, verification_mode="no_checks") print(f"ds={ds_name} dataset_config={dc} has {len(ds[view_schema][file_path])} unique python modules")

数据集架构

desc列: 包含指令alpaca文本和yaml响应

数据集字段

json { "active": "bool", "args": "string", "args_len": "float64", "audio_file": "string", "audio_path": "string", "class_bases": "string", "class_name": "string", "code": "string", "code_len": "float64", "desc": "string", "desc_docstr": "string", "desc_docstr_len": "float64", "desc_len": "int64", "docstr": "string", "docstr_len": "int64", "file_path": "string", "file_type": "string", "function_names": "string", "gen_bytes": "int64", "gen_data_type": "string", "gen_mode": "string", "gen_size": "int64", "gen_valid": "bool", "height": "int64", "image_file": "string", "image_path": "string", "method_names": "string", "name": "string", "num_all_bases": "int64", "num_bases": "int64", "num_classes": "int64", "num_functions": "float64", "num_imports": "int64", "num_methods": "float64", "prompts": "string", "raises": "string", "raises_len": "float64", "recsize": "int64", "repo": "string", "returns": "string", "returns_len": "float64", "size": "int64", "src_object": "string", "total_objects": "int64", "usage": "string", "usages": "string", "width": "int64" }