five

matlok/python-text-copilot-training-instruct-ai-research

收藏
Hugging Face2024-01-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/matlok/python-text-copilot-training-instruct-ai-research
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: - other pretty_name: >- instruct dataset for training ai coding with leading ai research dataset_info: - config_name: train_01_transformers_src splits: - name: train_01_transformers_src - config_name: test_01_how_to_code_from_ai_repos splits: - name: test_01_how_to_code_from_ai_repos - config_name: view_schema splits: - name: view_schema configs: - config_name: train_01_transformers_src data_files: - split: train_01_transformers_src path: files/lok-python-copilot-text.instruct-v1_00000086.parquet - config_name: test_01_how_to_code_from_ai_repos data_files: - split: test_01_how_to_code_from_ai_repos path: test/how_to_code_from_ai_repos_v1.parquet - config_name: view_schema data_files: - split: view_schema path: files/lok-python-copilot-text.instruct-v1_00000148.parquet size_categories: - 1M<n<10M tags: - python-copilot - python-coding - python-architecture - knowledge-graphs - multimodal - text-image-audio - fine-tuning - training - question-answering - image-knowledge-graph - alpaca - mp3 - png - text - instruct - coding - task - prompt - response - yaml # supported task_categories # text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, conversational, feature-extraction, text-generation, text2text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-retrieval, time-series-forecasting, text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, other task_categories: - text-generation - question-answering # supported task_ids # acceptability-classification, entity-linking-classification, fact-checking, intent-classification, language-identification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, image-inpainting, image-colorization, super-resolution, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering task_ids: - parsing --- ## Building an AI Copilot Dataset to help keep up with Leading AI Research This is a specialized, instruction dataset for training python coding assistants on how to code from leading AI/ML open source repositories (2.3M coding samples). This dataset is a subset of the matlok python copilot datasets. Please refer to the [Multimodal Python Copilot Training Overview](https://huggingface.co/datasets/matlok/multimodal-python-copilot-training-overview) for more details on how to use this dataset. ### Details This dataset holds the latest coding changes from >1159 github repositories vs the static [v1 instruct dataset prototype](https://huggingface.co/datasets/matlok/python-text-copilot-training-instruct). Each row contains python coding samples extracted from either a class method or a global function. Included in the row are additional feature columns that are used for decorating dataset downstream: imported modules, base classes (if any), exceptions (ordered based off the code), returns (ordered based off the code), arguments (ordered based off the code), and more. - Rows: 2329824 - Size: 27.0 GB - Data type: text - Format: Introduction on code usage using alpaca and yaml response ### Schema The instruction alpaca text with yaml response is in the **desc** column: ```json { "active": "bool", "args": "string", "args_len": "float64", "audio_file": "string", "audio_path": "string", "class_bases": "string", "class_name": "string", "code": "string", "code_len": "float64", "desc": "string", "desc_docstr": "string", "desc_docstr_len": "float64", "desc_len": "int64", "docstr": "string", "docstr_len": "int64", "file_path": "string", "file_type": "string", "function_names": "string", "gen_bytes": "int64", "gen_data_type": "string", "gen_mode": "string", "gen_size": "int64", "gen_valid": "string", "height": "int64", "image_file": "string", "image_path": "string", "method_names": "string", "name": "string", "num_all_bases": "int64", "num_bases": "int64", "num_classes": "int64", "num_functions": "float64", "num_imports": "int64", "num_methods": "float64", "prompts": "string", "raises": "string", "raises_len": "float64", "recsize": "int64", "repo": "string", "returns": "string", "returns_len": "float64", "size": "int64", "src_object": "string", "sub_file": "string", "total_objects": "int64", "usage": "string", "usages": "string", "width": "int64" } ``` ### How to use the dataset ```python from datasets import load_dataset ds = load_dataset("matlok/python-text-copilot-training-instruct-ai-research", data_dir="files") ```
提供机构:
matlok
原始信息汇总

数据集概述

基本信息

  • 数据集名称: instruct dataset for training ai coding with leading ai research
  • 许可证: other

配置信息

  • 配置名称:
    • train_01_transformers_src
    • test_01_how_to_code_from_ai_repos
    • view_schema

数据文件

  • 配置名称: train_01_transformers_src
    • 分割: train_01_transformers_src
    • 路径: files/lok-python-copilot-text.instruct-v1_00000086.parquet
  • 配置名称: test_01_how_to_code_from_ai_repos
    • 分割: test_01_how_to_code_from_ai_repos
    • 路径: test/how_to_code_from_ai_repos_v1.parquet
  • 配置名称: view_schema
    • 分割: view_schema
    • 路径: files/lok-python-copilot-text.instruct-v1_00000148.parquet

数据规模

  • 大小分类: 1M<n<10M

标签

  • python-copilot
  • python-coding
  • python-architecture
  • knowledge-graphs
  • multimodal
  • text-image-audio
  • fine-tuning
  • training
  • question-answering
  • image-knowledge-graph
  • alpaca
  • mp3
  • png
  • text
  • instruct
  • coding
  • task
  • prompt
  • response
  • yaml

支持的任务类别

  • text-generation
  • question-answering

支持的任务ID

  • parsing

详细信息

  • 行数: 2329824
  • 大小: 27.0 GB
  • 数据类型: text
  • 格式: Introduction on code usage using alpaca and yaml response

架构

  • 描述列: desc
    • 字段:
      • active: bool
      • args: string
      • args_len: float64
      • audio_file: string
      • audio_path: string
      • class_bases: string
      • class_name: string
      • code: string
      • code_len: float64
      • desc: string
      • desc_docstr: string
      • desc_docstr_len: float64
      • desc_len: int64
      • docstr: string
      • docstr_len: int64
      • file_path: string
      • file_type: string
      • function_names: string
      • gen_bytes: int64
      • gen_data_type: string
      • gen_mode: string
      • gen_size: int64
      • gen_valid: string
      • height: int64
      • image_file: string
      • image_path: string
      • method_names: string
      • name: string
      • num_all_bases: int64
      • num_bases: int64
      • num_classes: int64
      • num_functions: float64
      • num_imports: int64
      • num_methods: float64
      • prompts: string
      • raises: string
      • raises_len: float64
      • recsize: int64
      • repo: string
      • returns: string
      • returns_len: float64
      • size: int64
      • src_object: string
      • sub_file: string
      • total_objects: int64
      • usage: string
      • usages: string
      • width: int64
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作