alex-atelo/datasets-github-issues
收藏Hugging Face2024-02-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/alex-atelo/datasets-github-issues
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: url
dtype: string
- name: repository_url
dtype: string
- name: labels_url
dtype: string
- name: comments_url
dtype: string
- name: events_url
dtype: string
- name: html_url
dtype: string
- name: id
dtype: int64
- name: node_id
dtype: string
- name: number
dtype: int64
- name: title
dtype: string
- name: user
struct:
- name: avatar_url
dtype: string
- name: events_url
dtype: string
- name: followers_url
dtype: string
- name: following_url
dtype: string
- name: gists_url
dtype: string
- name: gravatar_id
dtype: string
- name: html_url
dtype: string
- name: id
dtype: int64
- name: login
dtype: string
- name: node_id
dtype: string
- name: organizations_url
dtype: string
- name: received_events_url
dtype: string
- name: repos_url
dtype: string
- name: site_admin
dtype: bool
- name: starred_url
dtype: string
- name: subscriptions_url
dtype: string
- name: type
dtype: string
- name: url
dtype: string
- name: labels
list:
- name: color
dtype: string
- name: default
dtype: bool
- name: description
dtype: string
- name: id
dtype: int64
- name: name
dtype: string
- name: node_id
dtype: string
- name: url
dtype: string
- name: state
dtype: string
- name: locked
dtype: bool
- name: assignee
struct:
- name: avatar_url
dtype: string
- name: events_url
dtype: string
- name: followers_url
dtype: string
- name: following_url
dtype: string
- name: gists_url
dtype: string
- name: gravatar_id
dtype: string
- name: html_url
dtype: string
- name: id
dtype: int64
- name: login
dtype: string
- name: node_id
dtype: string
- name: organizations_url
dtype: string
- name: received_events_url
dtype: string
- name: repos_url
dtype: string
- name: site_admin
dtype: bool
- name: starred_url
dtype: string
- name: subscriptions_url
dtype: string
- name: type
dtype: string
- name: url
dtype: string
- name: assignees
list:
- name: avatar_url
dtype: string
- name: events_url
dtype: string
- name: followers_url
dtype: string
- name: following_url
dtype: string
- name: gists_url
dtype: string
- name: gravatar_id
dtype: string
- name: html_url
dtype: string
- name: id
dtype: int64
- name: login
dtype: string
- name: node_id
dtype: string
- name: organizations_url
dtype: string
- name: received_events_url
dtype: string
- name: repos_url
dtype: string
- name: site_admin
dtype: bool
- name: starred_url
dtype: string
- name: subscriptions_url
dtype: string
- name: type
dtype: string
- name: url
dtype: string
- name: milestone
struct:
- name: closed_at
dtype: string
- name: closed_issues
dtype: int64
- name: created_at
dtype: string
- name: creator
struct:
- name: avatar_url
dtype: string
- name: events_url
dtype: string
- name: followers_url
dtype: string
- name: following_url
dtype: string
- name: gists_url
dtype: string
- name: gravatar_id
dtype: string
- name: html_url
dtype: string
- name: id
dtype: int64
- name: login
dtype: string
- name: node_id
dtype: string
- name: organizations_url
dtype: string
- name: received_events_url
dtype: string
- name: repos_url
dtype: string
- name: site_admin
dtype: bool
- name: starred_url
dtype: string
- name: subscriptions_url
dtype: string
- name: type
dtype: string
- name: url
dtype: string
- name: description
dtype: string
- name: due_on
dtype: string
- name: html_url
dtype: string
- name: id
dtype: int64
- name: labels_url
dtype: string
- name: node_id
dtype: string
- name: number
dtype: int64
- name: open_issues
dtype: int64
- name: state
dtype: string
- name: title
dtype: string
- name: updated_at
dtype: string
- name: url
dtype: string
- name: num_comments
dtype: int64
- name: created_at
dtype: timestamp[ns, tz=UTC]
- name: updated_at
dtype: timestamp[ns, tz=UTC]
- name: closed_at
dtype: timestamp[ns, tz=UTC]
- name: author_association
dtype: string
- name: active_lock_reason
dtype: float64
- name: draft
dtype: float64
- name: pull_request
struct:
- name: diff_url
dtype: string
- name: html_url
dtype: string
- name: merged_at
dtype: string
- name: patch_url
dtype: string
- name: url
dtype: string
- name: body
dtype: string
- name: reactions
struct:
- name: '+1'
dtype: int64
- name: '-1'
dtype: int64
- name: confused
dtype: int64
- name: eyes
dtype: int64
- name: heart
dtype: int64
- name: hooray
dtype: int64
- name: laugh
dtype: int64
- name: rocket
dtype: int64
- name: total_count
dtype: int64
- name: url
dtype: string
- name: timeline_url
dtype: string
- name: performed_via_github_app
dtype: float64
- name: state_reason
dtype: string
- name: __index_level_0__
dtype: int64
- name: is_pr
dtype: bool
- name: comments
sequence: string
splits:
- name: train
num_bytes: 36763529
num_examples: 6650
download_size: 10752010
dataset_size: 36763529
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
annotations_creators:
- no-annotation
language:
- en
language_creators:
- found
license:
- wtfpl
multilinguality:
- monolingual
pretty_name: HuggingFace Datasets GitHub Issues
size_categories:
- unknown
source_datasets:
- original
tags: []
task_categories:
- text-classification
- text-retrieval
task_ids:
- multi-class-classification
- multi-label-classification
- document-retrieval
---
# Dataset Card for GitHub Issues
## Dataset Description
- **Point of Contact:** [Alex](https://huggingface.co/alex-atelo)
### Dataset Summary
GitHub Issues is a dataset consisting of GitHub issues and pull requests associated with the 🤗 Datasets [repository](https://github.com/huggingface/datasets). It is intended for educational purposes and can be used for semantic search or multilabel text classification. The contents of each GitHub issue are in English and concern the domain of datasets for NLP, computer vision, and beyond.
### Supported Tasks and Leaderboards
For each of the tasks tagged for this dataset, give a brief description of the tag, metrics, and suggested models (with a link to their HuggingFace implementation if available). Give a similar description of tasks that were not covered by the structured tag set (repace the `task-category-tag` with an appropriate `other:other-task-name`).
- `task-category-tag`: The dataset can be used to train a model for [TASK NAME], which consists in [TASK DESCRIPTION]. Success on this task is typically measured by achieving a *high/low* [metric name](https://huggingface.co/metrics/metric_name). The ([model name](https://huggingface.co/model_name) or [model class](https://huggingface.co/transformers/model_doc/model_class.html)) model currently achieves the following score. *[IF A LEADERBOARD IS AVAILABLE]:* This task has an active leaderboard which can be found at [leaderboard url]() and ranks models based on [metric name](https://huggingface.co/metrics/metric_name) while also reporting [other metric name](https://huggingface.co/metrics/other_metric_name).
### Languages
Provide a brief overview of the languages represented in the dataset. Describe relevant details about specifics of the language such as whether it is social media text, African American English,...
When relevant, please provide [BCP-47 codes](https://tools.ietf.org/html/bcp47), which consist of a [primary language subtag](https://tools.ietf.org/html/bcp47#section-2.2.1), with a [script subtag](https://tools.ietf.org/html/bcp47#section-2.2.3) and/or [region subtag](https://tools.ietf.org/html/bcp47#section-2.2.4) if available.
## Dataset Structure
### Data Instances
Provide an JSON-formatted example and brief description of a typical instance in the dataset. If available, provide a link to further examples.
```
{
'example_field': ...,
...
}
```
Provide any additional information that is not covered in the other sections about the data here. In particular describe any relationships between data points and if these relationships are made explicit.
### Data Fields
List and describe the fields present in the dataset. Mention their data type, and whether they are used as input or output in any of the tasks the dataset currently supports. If the data has span indices, describe their attributes, such as whether they are at the character level or word level, whether they are contiguous or not, etc. If the datasets contains example IDs, state whether they have an inherent meaning, such as a mapping to other datasets or pointing to relationships between data points.
- `example_field`: description of `example_field`
Note that the descriptions can be initialized with the **Show Markdown Data Fields** output of the [tagging app](https://github.com/huggingface/datasets-tagging), you will then only need to refine the generated descriptions.
### Data Splits
Describe and name the splits in the dataset if there are more than one.
Describe any criteria for splitting the data, if used. If their are differences between the splits (e.g. if the training annotations are machine-generated and the dev and test ones are created by humans, or if different numbers of annotators contributed to each example), describe them here.
Provide the sizes of each split. As appropriate, provide any descriptive statistics for the features, such as average length. For example:
| | Tain | Valid | Test |
| ----- | ------ | ----- | ---- |
| Input Sentences | | | |
| Average Sentence Length | | | |
## Dataset Creation
### Curation Rationale
What need motivated the creation of this dataset? What are some of the reasons underlying the major choices involved in putting it together?
### Source Data
This section describes the source data (e.g. news text and headlines, social media posts, translated sentences,...)
#### Initial Data Collection and Normalization
Describe the data collection process. Describe any criteria for data selection or filtering. List any key words or search terms used. If possible, include runtime information for the collection process.
If data was collected from other pre-existing datasets, link to source here and to their [Hugging Face version](https://huggingface.co/datasets/dataset_name).
If the data was modified or normalized after being collected (e.g. if the data is word-tokenized), describe the process and the tools used.
#### Who are the source language producers?
State whether the data was produced by humans or machine generated. Describe the people or systems who originally created the data.
If available, include self-reported demographic or identity information for the source data creators, but avoid inferring this information. Instead state that this information is unknown. See [Larson 2017](https://www.aclweb.org/anthology/W17-1601.pdf) for using identity categories as a variables, particularly gender.
Describe the conditions under which the data was created (for example, if the producers were crowdworkers, state what platform was used, or if the data was found, what website the data was found on). If compensation was provided, include that information here.
Describe other people represented or mentioned in the data. Where possible, link to references for the information.
### Annotations
If the dataset contains annotations which are not part of the initial data collection, describe them in the following paragraphs.
#### Annotation process
If applicable, describe the annotation process and any tools used, or state otherwise. Describe the amount of data annotated, if not all. Describe or reference annotation guidelines provided to the annotators. If available, provide interannotator statistics. Describe any annotation validation processes.
#### Who are the annotators?
If annotations were collected for the source data (such as class labels or syntactic parses), state whether the annotations were produced by humans or machine generated.
Describe the people or systems who originally created the annotations and their selection criteria if applicable.
If available, include self-reported demographic or identity information for the annotators, but avoid inferring this information. Instead state that this information is unknown. See [Larson 2017](https://www.aclweb.org/anthology/W17-1601.pdf) for using identity categories as a variables, particularly gender.
Describe the conditions under which the data was annotated (for example, if the annotators were crowdworkers, state what platform was used, or if the data was found, what website the data was found on). If compensation was provided, include that information here.
### Personal and Sensitive Information
State whether the dataset uses identity categories and, if so, how the information is used. Describe where this information comes from (i.e. self-reporting, collecting from profiles, inferring, etc.). See [Larson 2017](https://www.aclweb.org/anthology/W17-1601.pdf) for using identity categories as a variables, particularly gender. State whether the data is linked to individuals and whether those individuals can be identified in the dataset, either directly or indirectly (i.e., in combination with other data).
State whether the dataset contains other data that might be considered sensitive (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history).
If efforts were made to anonymize the data, describe the anonymization process.
## Considerations for Using the Data
### Social Impact of Dataset
Please discuss some of the ways you believe the use of this dataset will impact society.
The statement should include both positive outlooks, such as outlining how technologies developed through its use may improve people's lives, and discuss the accompanying risks. These risks may range from making important decisions more opaque to people who are affected by the technology, to reinforcing existing harmful biases (whose specifics should be discussed in the next section), among other considerations.
Also describe in this section if the proposed dataset contains a low-resource or under-represented language. If this is the case or if this task has any impact on underserved communities, please elaborate here.
### Discussion of Biases
Provide descriptions of specific biases that are likely to be reflected in the data, and state whether any steps were taken to reduce their impact.
For Wikipedia text, see for example [Dinan et al 2020 on biases in Wikipedia (esp. Table 1)](https://arxiv.org/abs/2005.00614), or [Blodgett et al 2020](https://www.aclweb.org/anthology/2020.acl-main.485/) for a more general discussion of the topic.
If analyses have been run quantifying these biases, please add brief summaries and links to the studies here.
### Other Known Limitations
If studies of the datasets have outlined other limitations of the dataset, such as annotation artifacts, please outline and cite them here.
## Additional Information
### Dataset Curators
List the people involved in collecting the dataset and their affiliation(s). If funding information is known, include it here.
### Licensing Information
Provide the license and link to the license webpage if available.
### Citation Information
Provide the [BibTex](http://www.bibtex.org/)-formatted reference for the dataset. For example:
```
@article{article_id,
author = {Author List},
title = {Dataset Paper Title},
journal = {Publication Venue},
year = {2525}
}
```
If the dataset has a [DOI](https://www.doi.org/), please provide it here.
```
@misc{huggingfacecourse,
author = {Hugging Face},
title = {The Hugging Face Course, 2022},
howpublished = "\url{https://huggingface.co/course}",
year = {2022},
note = "[Online; accessed <today>]"
}
```
### Contributions
Thanks to [@alex-atelo](https://huggingface.co/alex-atelo) for adding this dataset.
提供机构:
alex-atelo
原始信息汇总
数据集概述
数据集描述
数据集摘要
GitHub Issues 数据集包含与 🤗 Datasets 仓库相关的 GitHub 问题和拉取请求。该数据集旨在用于教育目的,可用于语义搜索或多标签文本分类。每个 GitHub 问题的内容均为英语,涉及 NLP、计算机视觉等领域的数据集。
支持的任务和排行榜
该数据集可用于以下任务:
- 文本分类:用于训练多类别或多标签分类模型。
- 文本检索:用于训练文档检索模型。
语言
数据集中的文本语言为英语。
数据集结构
数据实例
数据集中的每个实例包含以下字段:
url:字符串类型,问题或拉取请求的 URL。repository_url:字符串类型,仓库的 URL。labels_url:字符串类型,标签的 URL。comments_url:字符串类型,评论的 URL。events_url:字符串类型,事件的 URL。html_url:字符串类型,HTML 页面的 URL。id:整数类型,问题或拉取请求的唯一标识符。node_id:字符串类型,节点标识符。number:整数类型,问题或拉取请求的编号。title:字符串类型,问题或拉取请求的标题。user:结构体类型,包含用户信息,如头像 URL、事件 URL 等。labels:列表类型,包含标签信息,如颜色、默认值、描述等。state:字符串类型,问题或拉取请求的状态。locked:布尔类型,是否锁定。assignee:结构体类型,包含指派人信息,如头像 URL、事件 URL 等。assignees:列表类型,包含多个指派人信息。milestone:结构体类型,包含里程碑信息,如关闭时间、创建时间等。num_comments:整数类型,评论数量。created_at:时间戳类型,创建时间。updated_at:时间戳类型,更新时间。closed_at:时间戳类型,关闭时间。author_association:字符串类型,作者关联信息。active_lock_reason:浮点数类型,锁定原因。draft:浮点数类型,是否为草稿。pull_request:结构体类型,包含拉取请求信息,如差异 URL、合并时间等。body:字符串类型,问题或拉取请求的正文。reactions:结构体类型,包含反应信息,如点赞数、困惑数等。timeline_url:字符串类型,时间线 URL。performed_via_github_app:浮点数类型,是否通过 GitHub 应用执行。state_reason:字符串类型,状态原因。__index_level_0__:整数类型,索引级别。is_pr:布尔类型,是否为拉取请求。comments:序列类型,评论内容。
数据分割
数据集包含一个训练集:
train:包含 6650 个实例,总字节数为 36763529。
数据集创建
数据集创建动机
该数据集的创建旨在提供一个用于教育和研究目的的 GitHub 问题和拉取请求数据集,特别是用于语义搜索和多标签文本分类任务。
源数据
数据集的源数据来自 GitHub 上的 🤗 Datasets 仓库。
注释
数据集不包含额外的注释。
个人和敏感信息
数据集中不包含个人身份信息或其他敏感信息。
数据使用考虑
社会影响
该数据集的使用可能有助于开发新的文本分类和检索技术,但也应注意潜在的偏见和数据隐私问题。
偏见讨论
数据集可能包含与 GitHub 社区相关的偏见,使用时应谨慎处理。
其他已知限制
数据集可能存在与 GitHub 问题和拉取请求相关的特定限制,如数据更新频率等。
附加信息
数据集策展人
数据集由 Alex 添加。
许可信息
数据集的许可为 WTFPL。
引用信息
数据集的引用信息如下:
@misc{huggingfacecourse, author = {Hugging Face}, title = {The Hugging Face Course, 2022}, howpublished = "url{https://huggingface.co/course}", year = {2022}, note = "[Online; accessed <today>]" }
贡献
感谢 @alex-atelo 添加此数据集。



