pacovaldez/stackoverflow-questions

Name: pacovaldez/stackoverflow-questions
Creator: pacovaldez
Published: 2022-11-10 00:14:37
License: 暂无描述

Hugging Face2022-11-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/pacovaldez/stackoverflow-questions

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language: - en language_creators: - found license: - apache-2.0 multilinguality: - monolingual pretty_name: stackoverflow_post_questions size_categories: - 1M<n<10M source_datasets: - original tags: - stackoverflow - technical questions task_categories: - text-classification task_ids: - multi-class-classification --- # Dataset Card for [Stackoverflow Post Questions] ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Contributions](#contributions) ## Dataset Description Companies that sell Open-source software tools usually hire an army of Customer representatives to try to answer every question asked about their tool. The first step in this process is the prioritization of the question. The classification scale usually consists of 4 values, P0, P1, P2, and P3, with different meanings across every participant in the industry. On the other hand, every software developer in the world has dealt with Stack Overflow (SO); the amount of shared knowledge there is incomparable to any other website. Questions in SO are usually annotated and curated by thousands of people, providing metadata about the quality of the question. This dataset aims to provide an accurate prioritization for programming questions. ### Dataset Summary The dataset contains the title and body of stackoverflow questions and a label value(0,1,2,3) that was calculated using thresholds defined by SO badges. ### Languages English ## Dataset Structure title: string, body: string, label: int ### Data Splits The split is 40/40/20, where classes have been balaned to be around the same size. ## Dataset Creation The data set was extracted and labeled with the following query in BigQuery: ``` SELECT title, body, CASE WHEN score >= 100 OR favorite_count >= 100 OR view_count >= 10000 THEN 0 WHEN score >= 25 OR favorite_count >= 25 OR view_count >= 2500 THEN 1 WHEN score >= 10 OR favorite_count >= 10 OR view_count >= 1000 THEN 2 ELSE 3 END AS label FROM `bigquery-public-data`.stackoverflow.posts_questions ``` ### Source Data The data was extracted from the Big Query public dataset: `bigquery-public-data.stackoverflow.posts_questions` #### Initial Data Collection and Normalization The original dataset contained high class imbalance: label count 0 977424 1 2401534 2 3418179 3 16222990 Grand Total 23020127 The data was sampled from each class to have around the same amount of records on every class. ### Contributions Thanks to [@pacofvf](https://github.com/pacofvf) for adding this dataset.

提供机构：

pacovaldez

原始信息汇总

数据集概述

基本信息

名称: Stackoverflow Post Questions
语言: 英语
许可证: Apache-2.0
多语言性: 单语种
大小: 1M<n<10M
来源: 原始数据
标签:
- Stackoverflow
- 技术问题
任务类别: 文本分类
任务ID: 多类分类

数据集描述

数据集摘要

内容: 包含Stack Overflow问题的标题、正文及标签值(0,1,2,3)。
标签计算: 使用Stack Overflow徽章定义的阈值计算。

语言

语言: 英语

数据集结构

字段:
- title: 字符串
- body: 字符串
- label: 整数

数据分割

分割比例: 40/40/20
平衡性: 各分类大小相近

数据集创建

源数据

数据源: BigQuery公共数据集 bigquery-public-data.stackoverflow.posts_questions

初始数据收集与标准化

原始数据问题: 高类别不平衡
处理: 从每个类别中采样，使每个类别的记录数量大致相同

5,000+

优质数据集

54 个

任务类型

进入经典数据集