five

matlok/python-text-copilot-training-instruct-ai-research-2024-02-11

收藏
Hugging Face2024-02-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/matlok/python-text-copilot-training-instruct-ai-research-2024-02-11
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: - other pretty_name: >- 2024-02-11 - python copilot instructions on how to code using alpaca and yaml dataset_info: - config_name: autogen splits: - name: view_schema configs: - config_name: autogen data_files: - split: view_schema path: schema/train-0001-autogen-autogen.parquet size_categories: - 1M<n<10M tags: - python-copilot - python-coding - python-architecture - knowledge-graphs - multimodal - text-image-audio - fine-tuning - training - question-answering - image-knowledge-graph - alpaca - mp3 - png - text - instruct - coding - task - prompt - response - yaml # supported task_categories # text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, conversational, feature-extraction, text-generation, text2text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-retrieval, time-series-forecasting, text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, other task_categories: - text-generation - question-answering # supported task_ids # acceptability-classification, entity-linking-classification, fact-checking, intent-classification, language-identification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, image-inpainting, image-colorization, super-resolution, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering task_ids: - parsing --- ## Python Copilot Instructions on How to Code using Alpaca and Yaml Training and test datasets for building coding multimodal models that understand how to use the open source GitHub projects for the [Autogen](https://github.com/microsoft/autogen/tree/main) and multimodal **Qwen AI** project: - [Qwen](https://github.com/QwenLM/Qwen) - [Qwen Agent](https://github.com/QwenLM/Qwen-Agent) - [Qwen VL Chat](https://github.com/QwenLM/Qwen-VL) - [Qwen Audio](https://github.com/QwenLM/Qwen-Audio) This dataset is the 2024-02-11 update for the matlok python copilot datasets. Please refer to the [Multimodal Python Copilot Training Overview](https://huggingface.co/datasets/matlok/multimodal-python-copilot-training-overview) for more details on how to use this dataset. ### Details Each row contains python code, either a class method or a global function, imported modules, base classes (if any), exceptions (ordered based off the code), returns (ordered based off the code), arguments (ordered based off the code), and more. - Rows: 1075795 - Size: 1.8 GB - Data type: instruct - Format: Introduction on code usage using alpaca and yaml response - Number of python repos: 1275 ### How to use the datasets #### Load Autogen Schema Dataset ```python from datasets import load_dataset ds_name = ( "matlok" "/" "python-text-copilot-training-" "instruct-ai-research-" "2024-02-11" ) dc = "autogen" ds = load_dataset(ds_name, dc, verification_mode="no_checks") print(f"ds={ds_name} dataset_config={dc} has {len(ds['view_schema']['file_path'])} unique python modules") ``` ``` dataset_config=autogen has 130 unique python modules ``` ### Schema The instruction alpaca text with yaml response is in the **desc** column: ```json { "active": "bool", "args": "string", "args_len": "float64", "audio_file": "string", "audio_path": "string", "class_bases": "string", "class_name": "string", "code": "string", "code_len": "float64", "desc": "string", "desc_docstr": "string", "desc_docstr_len": "float64", "desc_len": "int64", "docstr": "string", "docstr_len": "int64", "file_path": "string", "file_type": "string", "function_names": "string", "gen_bytes": "int64", "gen_data_type": "string", "gen_mode": "string", "gen_size": "int64", "gen_valid": "bool", "height": "int64", "image_file": "string", "image_path": "string", "method_names": "string", "name": "string", "num_all_bases": "int64", "num_bases": "int64", "num_classes": "int64", "num_functions": "float64", "num_imports": "int64", "num_methods": "float64", "prompts": "string", "raises": "string", "raises_len": "float64", "recsize": "int64", "repo": "string", "returns": "string", "returns_len": "float64", "size": "int64", "src_object": "string", "total_objects": "int64", "usage": "string", "usages": "string", "width": "int64" } ```
提供机构:
matlok
原始信息汇总

数据集概述

数据集信息

  • 数据集名称: 2024-02-11 - python copilot instructions on how to code using alpaca and yaml
  • 配置名称: autogen
  • 数据分割:
    • 名称: view_schema
  • 数据文件:
    • 分割: view_schema
    • 路径: schema/train-0001-autogen-autogen.parquet
  • 数据规模: 1M<n<10M

标签

  • python-copilot
  • python-coding
  • python-architecture
  • knowledge-graphs
  • multimodal
  • text-image-audio
  • fine-tuning
  • training
  • question-answering
  • image-knowledge-graph
  • alpaca
  • mp3
  • png
  • text
  • instruct
  • coding
  • task
  • prompt
  • response
  • yaml

支持的任务类别

  • text-generation
  • question-answering

支持的任务ID

  • parsing

数据集详情

  • 行数: 1075795
  • 大小: 1.8 GB
  • 数据类型: instruct
  • 格式: Introduction on code usage using alpaca and yaml response
  • Python仓库数量: 1275

数据集使用方法

加载Autogen Schema数据集

python from datasets import load_dataset

ds_name = ( "matlok" "/" "python-text-copilot-training-" "instruct-ai-research-" "2024-02-11" ) dc = "autogen" ds = load_dataset(ds_name, dc, verification_mode="no_checks") print(f"ds={ds_name} dataset_config={dc} has {len(ds[view_schema][file_path])} unique python modules")

数据集架构

  • desc列: 包含指令alpaca文本和yaml响应

数据集字段

json { "active": "bool", "args": "string", "args_len": "float64", "audio_file": "string", "audio_path": "string", "class_bases": "string", "class_name": "string", "code": "string", "code_len": "float64", "desc": "string", "desc_docstr": "string", "desc_docstr_len": "float64", "desc_len": "int64", "docstr": "string", "docstr_len": "int64", "file_path": "string", "file_type": "string", "function_names": "string", "gen_bytes": "int64", "gen_data_type": "string", "gen_mode": "string", "gen_size": "int64", "gen_valid": "bool", "height": "int64", "image_file": "string", "image_path": "string", "method_names": "string", "name": "string", "num_all_bases": "int64", "num_bases": "int64", "num_classes": "int64", "num_functions": "float64", "num_imports": "int64", "num_methods": "float64", "prompts": "string", "raises": "string", "raises_len": "float64", "recsize": "int64", "repo": "string", "returns": "string", "returns_len": "float64", "size": "int64", "src_object": "string", "total_objects": "int64", "usage": "string", "usages": "string", "width": "int64" }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作