five

davanstrien/similarity-dataset-sc2-8b

收藏
Hugging Face2024-05-30 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/davanstrien/similarity-dataset-sc2-8b
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en size_categories: n<1K dataset_info: features: - name: anchor dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 1061977 num_examples: 2324 download_size: 488823 dataset_size: 1061977 configs: - config_name: default data_files: - split: train path: data/train-* tags: - synthetic - distilabel - sentence-transformers - DistilSimData --- <p align="left"> <a href="https://github.com/argilla-io/distilabel"> <img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/> </a> </p> # Dataset Card for similarity-dataset-sc2-8b This dataset has been created with [distilabel](https://distilabel.argilla.io/) and the pipeline outlined [here](https://github.com/davanstrien/awesome-synthetic-datasets/tree/main/examples/embedding-datasets). It is designed as a synthetic dataset for training Sentence Transformers models, providing structured examples to help models learn fine-grained semantic distinctions in various domains. ## Dataset Summary The `similarity-dataset-sc2-8b` was generated to serve as training data for models that need to understand subtle differences and similarities between sentences. It leverages a custom pipeline for generating positive and negative sentence pairs (positive and negative examples) related to programming tasks, particularly prompts for Python functions. The dataset is based on [bigcode/self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k). ## Dataset Summary This dataset contains a `pipeline.yaml` which can be used to reproduce the pipeline that generated it in distilabel using the `distilabel` CLI: ```console distilabel pipeline run --config "https://huggingface.co/datasets/davanstrien/similarity-dataset-sc2-8b/raw/main/pipeline.yaml" ``` or explore the configuration: ```console distilabel pipeline info --config "https://huggingface.co/datasets/davanstrien/similarity-dataset-sc2-8b/raw/main/pipeline.yaml" ``` ## Dataset structure The examples have the following structure per configuration: <details><summary> Configuration: default </summary><hr> ```json { "anchor": "Write a Python function that checks if an object is a subclass of a given class. The function should safely handle exceptions and return `False` if an exception is raised.", "generation": "{\"bad\": [\"Write a Python function that generates random numbers based on the input of a cat\u0027s meow. The function should also calculate the average of the generated numbers and return the average as a string.\",\"Write a Python function that takes a string input and translates it into a secret code by replacing each letter with a random number.\"], \n\"good\": [\"Write a Python function that checks if a given object is a subclass of a specified class, handling any exceptions that may occur safely and returning False if an exception is raised.\",\"Write a Python function that determines whether a provided object is a subclass of a given class, and returns False if an exception is encountered during the check.\"]}", "negative": "Write a Python function that takes a string input and translates it into a secret code by replacing each letter with a random number.", "positive": "Write a Python function that checks if a given object is a subclass of a specified class, handling any exceptions that may occur safely and returning False if an exception is raised." } ``` This subset can be loaded as: ```python from datasets import load_dataset ds = load_dataset("davanstrien/similarity-dataset-sc2-8b", "default") ``` Or simply as it follows, since there's only one configuration and is named `default`: ```python from datasets import load_dataset ds = load_dataset("davanstrien/similarity-dataset-sc2-8b") ``` </details>
提供机构:
davanstrien
原始信息汇总

数据集卡片 for similarity-dataset-sc2-8b

数据集概述

similarity-dataset-sc2-8b 是为需要理解句子之间细微差异和相似性的模型设计的训练数据集。它利用自定义管道生成与编程任务相关的正负句子对(正例和负例),特别是针对Python函数的提示。

该数据集基于 bigcode/self-oss-instruct-sc2-exec-filter-50k

数据集结构

特征

  • anchor: 字符串类型
  • positive: 字符串类型
  • negative: 字符串类型

分割

  • train: 包含 2324 个样本,总大小为 1061977 字节

配置

  • default:
    • 数据文件路径: data/train-*

示例

json { "anchor": "Write a Python function that checks if an object is a subclass of a given class. The function should safely handle exceptions and return False if an exception is raised.", "negative": "Write a Python function that takes a string input and translates it into a secret code by replacing each letter with a random number.", "positive": "Write a Python function that checks if a given object is a subclass of a specified class, handling any exceptions that may occur safely and returning False if an exception is raised." }

加载数据集

python from datasets import load_dataset

ds = load_dataset("davanstrien/similarity-dataset-sc2-8b")

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作