davanstrien/similarity-dataset-sc2-8b

Name: davanstrien/similarity-dataset-sc2-8b
Creator: davanstrien
Published: 2024-05-30 12:12:52
License: 暂无描述

Hugging Face2024-05-30 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/davanstrien/similarity-dataset-sc2-8b

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en size_categories: n<1K dataset_info: features: - name: anchor dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 1061977 num_examples: 2324 download_size: 488823 dataset_size: 1061977 configs: - config_name: default data_files: - split: train path: data/train-* tags: - synthetic - distilabel - sentence-transformers - DistilSimData --- <p align="left"> <a href="https://github.com/argilla-io/distilabel"> <img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/> </a> </p> # Dataset Card for similarity-dataset-sc2-8b This dataset has been created with [distilabel](https://distilabel.argilla.io/) and the pipeline outlined [here](https://github.com/davanstrien/awesome-synthetic-datasets/tree/main/examples/embedding-datasets). It is designed as a synthetic dataset for training Sentence Transformers models, providing structured examples to help models learn fine-grained semantic distinctions in various domains. ## Dataset Summary The `similarity-dataset-sc2-8b` was generated to serve as training data for models that need to understand subtle differences and similarities between sentences. It leverages a custom pipeline for generating positive and negative sentence pairs (positive and negative examples) related to programming tasks, particularly prompts for Python functions. The dataset is based on [bigcode/self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k). ## Dataset Summary This dataset contains a `pipeline.yaml` which can be used to reproduce the pipeline that generated it in distilabel using the `distilabel` CLI: ```console distilabel pipeline run --config "https://huggingface.co/datasets/davanstrien/similarity-dataset-sc2-8b/raw/main/pipeline.yaml" ``` or explore the configuration: ```console distilabel pipeline info --config "https://huggingface.co/datasets/davanstrien/similarity-dataset-sc2-8b/raw/main/pipeline.yaml" ``` ## Dataset structure The examples have the following structure per configuration: <details><summary> Configuration: default </summary><hr> ```json { "anchor": "Write a Python function that checks if an object is a subclass of a given class. The function should safely handle exceptions and return `False` if an exception is raised.", "generation": "{\"bad\": [\"Write a Python function that generates random numbers based on the input of a cat\u0027s meow. The function should also calculate the average of the generated numbers and return the average as a string.\",\"Write a Python function that takes a string input and translates it into a secret code by replacing each letter with a random number.\"], \n\"good\": [\"Write a Python function that checks if a given object is a subclass of a specified class, handling any exceptions that may occur safely and returning False if an exception is raised.\",\"Write a Python function that determines whether a provided object is a subclass of a given class, and returns False if an exception is encountered during the check.\"]}", "negative": "Write a Python function that takes a string input and translates it into a secret code by replacing each letter with a random number.", "positive": "Write a Python function that checks if a given object is a subclass of a specified class, handling any exceptions that may occur safely and returning False if an exception is raised." } ``` This subset can be loaded as: ```python from datasets import load_dataset ds = load_dataset("davanstrien/similarity-dataset-sc2-8b", "default") ``` Or simply as it follows, since there's only one configuration and is named `default`: ```python from datasets import load_dataset ds = load_dataset("davanstrien/similarity-dataset-sc2-8b") ``` </details>

提供机构：

davanstrien

原始信息汇总

数据集卡片 for similarity-dataset-sc2-8b

数据集概述

similarity-dataset-sc2-8b 是为需要理解句子之间细微差异和相似性的模型设计的训练数据集。它利用自定义管道生成与编程任务相关的正负句子对（正例和负例），特别是针对Python函数的提示。

该数据集基于 bigcode/self-oss-instruct-sc2-exec-filter-50k。

数据集结构

特征

anchor: 字符串类型
positive: 字符串类型
negative: 字符串类型

分割

train: 包含 2324 个样本，总大小为 1061977 字节

配置

default:
- 数据文件路径: data/train-*

示例

json { "anchor": "Write a Python function that checks if an object is a subclass of a given class. The function should safely handle exceptions and return False if an exception is raised.", "negative": "Write a Python function that takes a string input and translates it into a secret code by replacing each letter with a random number.", "positive": "Write a Python function that checks if a given object is a subclass of a specified class, handling any exceptions that may occur safely and returning False if an exception is raised." }

加载数据集

python from datasets import load_dataset

ds = load_dataset("davanstrien/similarity-dataset-sc2-8b")

5,000+

优质数据集

54 个

任务类型

进入经典数据集