UniverseTBD/arxiv-bit-flip-cs.LG

Name: UniverseTBD/arxiv-bit-flip-cs.LG
Creator: UniverseTBD
Published: 2023-09-24 00:11:18
License: 暂无描述

Hugging Face2023-09-24 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/UniverseTBD/arxiv-bit-flip-cs.LG

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: bit dtype: string - name: flip dtype: string - name: title dtype: string - name: categories dtype: string - name: abstract dtype: string - name: authors dtype: string - name: doi dtype: string - name: id dtype: string splits: - name: train num_bytes: 229044314 num_examples: 100039 download_size: 127335112 dataset_size: 229044314 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for "arxiv-bit-flip-cs.LG" This dataset contains "Bit-Flips," structured representations extracted from the abstracts of ArXiv papers, specifically in the category of `cs.LG` (Machine Learning). These Bit-Flips aim to encapsulate the essence of the research by identifying the conventional belief or 'status quo' the abstract challenges (Bit) and the counterargument or innovative approach that flips the Bit (Flip). ## Bit-Flip Concept A Bit-Flip serves as a two-part schema: * _Bit_: It identifies the conventional belief or assumption that the research implicitly or explicitly challenges. It is composed of three sentences that are logically connected. * _Flip_: It formulates the counterargument or innovative approach that flips the conventional belief or Bit. It also consists of three logically connected sentences. ## Data Collection The dataset focuses on the ArXiv category of `cs.LG` (Machine Learning). The dataset was created to understand the paradigm shifts or challenges to conventional wisdom that are presented in new research, encapsulated through the Bit-Flip schema. ## Methodology The data was processed using a Python script that performs the following steps: 1. The script generates a custom prompt based on each abstract, using a predefined template that explains the Bit-Flip concept. 2. An Azure model is used to generate a response to the custom prompt. 3. The response is parsed to extract a JSON-like structure containing the Bit and the Flip. 4. Each Bit and Flip is saved along with the title of the paper. 5. The script uses multithreading to speed up the data processing and can handle a batch of abstracts in each run. The processed data is saved in a CSV file.

提供机构：

UniverseTBD

5,000+

优质数据集

54 个

任务类型

进入经典数据集