five

matlok/python-text-copilot-training-instruct

收藏
Hugging Face2024-01-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/matlok/python-text-copilot-training-instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: - other pretty_name: >- python copilot instructions on how to code using alpaca and yaml dataset_info: - config_name: view_01_transformers_src splits: - name: view_01_transformers_src - config_name: view_02_pytorch_fsdp splits: - name: view_02_pytorch_fsdp - config_name: view_03_deepspeed_runtime splits: - name: view_03_deepspeed_runtime - config_name: view_schema splits: - name: view_schema configs: - config_name: view_01_transformers_src data_files: - split: view_01_transformers_src path: files/lok-python-copilot-text.instruct-v1_00000053.parquet - config_name: view_02_pytorch_fsdp data_files: - split: view_02_pytorch_fsdp path: files/lok-python-copilot-text.instruct-v1_00000040.parquet - config_name: view_03_deepspeed_runtime data_files: - split: view_03_deepspeed_runtime path: files/lok-python-copilot-text.instruct-v1_00000019.parquet - config_name: view_schema data_files: - split: view_schema path: files/lok-python-copilot-text.instruct-v1_00000002.parquet size_categories: - 1M<n<10M tags: - python-copilot - python-coding - python-architecture - knowledge-graphs - multimodal - text-image-audio - fine-tuning - training - question-answering - image-knowledge-graph - alpaca - mp3 - png - text - instruct - coding - task - prompt - response - yaml # supported task_categories # text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, conversational, feature-extraction, text-generation, text2text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-retrieval, time-series-forecasting, text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, other task_categories: - text-generation - question-answering # supported task_ids # acceptability-classification, entity-linking-classification, fact-checking, intent-classification, language-identification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, image-inpainting, image-colorization, super-resolution, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering task_ids: - parsing --- ## Python Copilot Instructions on How to Code using Alpaca and Yaml This dataset is a subset of the matlok python copilot datasets. Please refer to the [Multimodal Python Copilot Training Overview](https://huggingface.co/datasets/matlok/multimodal-python-copilot-training-overview) for more details on how to use this dataset. ### Details Each row contains python code, either a class method or a global function, imported modules, base classes (if any), exceptions (ordered based off the code), returns (ordered based off the code), arguments (ordered based off the code), and more. - Rows: 1737704 - Size: 28.6 GB - Data type: text - Format: Introduction on code usage using alpaca and yaml response ### Schema The instruction alpaca text with yaml response is in the **desc** column: ```json { "active": "bool", "args": "string", "args_len": "float64", "audio_file": "string", "audio_path": "string", "class_bases": "string", "class_name": "string", "code": "string", "code_len": "float64", "desc": "string", "desc_docstr": "string", "desc_docstr_len": "float64", "desc_len": "int64", "docstr": "string", "docstr_len": "int64", "file_path": "string", "file_type": "string", "function_names": "string", "gen_bytes": "int64", "gen_data_type": "string", "gen_mode": "string", "gen_size": "int64", "gen_valid": "string", "height": "int64", "image_file": "string", "image_path": "string", "method_names": "string", "name": "string", "num_all_bases": "int64", "num_bases": "int64", "num_classes": "int64", "num_functions": "float64", "num_imports": "int64", "num_methods": "float64", "prompts": "string", "raises": "string", "raises_len": "float64", "recsize": "int64", "repo": "string", "returns": "string", "returns_len": "float64", "size": "int64", "src_object": "string", "sub_file": "string", "total_objects": "int64", "usage": "string", "usages": "string", "width": "int64" } ``` ### How to use the dataset ```python from datasets import load_dataset ds = load_dataset("matlok/python-text-copilot-training-instruct", data_dir="files") ```
提供机构:
matlok
原始信息汇总

数据集概述

数据集信息

  • 配置名称和分割
    • view_01_transformers_src
      • 分割名称:view_01_transformers_src
    • view_02_pytorch_fsdp
      • 分割名称:view_02_pytorch_fsdp
    • view_03_deepspeed_runtime
      • 分割名称:view_03_deepspeed_runtime
    • view_schema
      • 分割名称:view_schema

数据文件

  • 配置名称和数据文件路径
    • view_01_transformers_src
      • 分割:view_01_transformers_src
      • 路径:files/lok-python-copilot-text.instruct-v1_00000053.parquet
    • view_02_pytorch_fsdp
      • 分割:view_02_pytorch_fsdp
      • 路径:files/lok-python-copilot-text.instruct-v1_00000040.parquet
    • view_03_deepspeed_runtime
      • 分割:view_03_deepspeed_runtime
      • 路径:files/lok-python-copilot-text.instruct-v1_00000019.parquet
    • view_schema
      • 分割:view_schema
      • 路径:files/lok-python-copilot-text.instruct-v1_00000002.parquet

数据集规模

  • 大小类别
    • 1M<n<10M

标签

  • python-copilot
  • python-coding
  • python-architecture
  • knowledge-graphs
  • multimodal
  • text-image-audio
  • fine-tuning
  • training
  • question-answering
  • image-knowledge-graph
  • alpaca
  • mp3
  • png
  • text
  • instruct
  • coding
  • task
  • prompt
  • response
  • yaml

支持的任务类别

  • text-generation
  • question-answering

支持的任务ID

  • parsing

数据集详细信息

  • 行数:1737704
  • 大小:28.6 GB
  • 数据类型:text
  • 格式:Introduction on code usage using alpaca and yaml response

数据集模式

  • 描述列desc
  • 字段类型
    • active: bool
    • args: string
    • args_len: float64
    • audio_file: string
    • audio_path: string
    • class_bases: string
    • class_name: string
    • code: string
    • code_len: float64
    • desc: string
    • desc_docstr: string
    • desc_docstr_len: float64
    • desc_len: int64
    • docstr: string
    • docstr_len: int64
    • file_path: string
    • file_type: string
    • function_names: string
    • gen_bytes: int64
    • gen_data_type: string
    • gen_mode: string
    • gen_size: int64
    • gen_valid: string
    • height: int64
    • image_file: string
    • image_path: string
    • method_names: string
    • name: string
    • num_all_bases: int64
    • num_bases: int64
    • num_classes: int64
    • num_functions: float64
    • num_imports: int64
    • num_methods: float64
    • prompts: string
    • raises: string
    • raises_len: float64
    • recsize: int64
    • repo: string
    • returns: string
    • returns_len: float64
    • size: int64
    • src_object: string
    • sub_file: string
    • total_objects: int64
    • usage: string
    • usages: string
    • width: int64

如何使用数据集

python from datasets import load_dataset

ds = load_dataset("matlok/python-text-copilot-training-instruct", data_dir="files")

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作