alex-atelo/datasets-github-issues

Name: alex-atelo/datasets-github-issues
Creator: alex-atelo
Published: 2024-02-23 02:10:13
License: 暂无描述

Hugging Face2024-02-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/alex-atelo/datasets-github-issues

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: url dtype: string - name: repository_url dtype: string - name: labels_url dtype: string - name: comments_url dtype: string - name: events_url dtype: string - name: html_url dtype: string - name: id dtype: int64 - name: node_id dtype: string - name: number dtype: int64 - name: title dtype: string - name: user struct: - name: avatar_url dtype: string - name: events_url dtype: string - name: followers_url dtype: string - name: following_url dtype: string - name: gists_url dtype: string - name: gravatar_id dtype: string - name: html_url dtype: string - name: id dtype: int64 - name: login dtype: string - name: node_id dtype: string - name: organizations_url dtype: string - name: received_events_url dtype: string - name: repos_url dtype: string - name: site_admin dtype: bool - name: starred_url dtype: string - name: subscriptions_url dtype: string - name: type dtype: string - name: url dtype: string - name: labels list: - name: color dtype: string - name: default dtype: bool - name: description dtype: string - name: id dtype: int64 - name: name dtype: string - name: node_id dtype: string - name: url dtype: string - name: state dtype: string - name: locked dtype: bool - name: assignee struct: - name: avatar_url dtype: string - name: events_url dtype: string - name: followers_url dtype: string - name: following_url dtype: string - name: gists_url dtype: string - name: gravatar_id dtype: string - name: html_url dtype: string - name: id dtype: int64 - name: login dtype: string - name: node_id dtype: string - name: organizations_url dtype: string - name: received_events_url dtype: string - name: repos_url dtype: string - name: site_admin dtype: bool - name: starred_url dtype: string - name: subscriptions_url dtype: string - name: type dtype: string - name: url dtype: string - name: assignees list: - name: avatar_url dtype: string - name: events_url dtype: string - name: followers_url dtype: string - name: following_url dtype: string - name: gists_url dtype: string - name: gravatar_id dtype: string - name: html_url dtype: string - name: id dtype: int64 - name: login dtype: string - name: node_id dtype: string - name: organizations_url dtype: string - name: received_events_url dtype: string - name: repos_url dtype: string - name: site_admin dtype: bool - name: starred_url dtype: string - name: subscriptions_url dtype: string - name: type dtype: string - name: url dtype: string - name: milestone struct: - name: closed_at dtype: string - name: closed_issues dtype: int64 - name: created_at dtype: string - name: creator struct: - name: avatar_url dtype: string - name: events_url dtype: string - name: followers_url dtype: string - name: following_url dtype: string - name: gists_url dtype: string - name: gravatar_id dtype: string - name: html_url dtype: string - name: id dtype: int64 - name: login dtype: string - name: node_id dtype: string - name: organizations_url dtype: string - name: received_events_url dtype: string - name: repos_url dtype: string - name: site_admin dtype: bool - name: starred_url dtype: string - name: subscriptions_url dtype: string - name: type dtype: string - name: url dtype: string - name: description dtype: string - name: due_on dtype: string - name: html_url dtype: string - name: id dtype: int64 - name: labels_url dtype: string - name: node_id dtype: string - name: number dtype: int64 - name: open_issues dtype: int64 - name: state dtype: string - name: title dtype: string - name: updated_at dtype: string - name: url dtype: string - name: num_comments dtype: int64 - name: created_at dtype: timestamp[ns, tz=UTC] - name: updated_at dtype: timestamp[ns, tz=UTC] - name: closed_at dtype: timestamp[ns, tz=UTC] - name: author_association dtype: string - name: active_lock_reason dtype: float64 - name: draft dtype: float64 - name: pull_request struct: - name: diff_url dtype: string - name: html_url dtype: string - name: merged_at dtype: string - name: patch_url dtype: string - name: url dtype: string - name: body dtype: string - name: reactions struct: - name: '+1' dtype: int64 - name: '-1' dtype: int64 - name: confused dtype: int64 - name: eyes dtype: int64 - name: heart dtype: int64 - name: hooray dtype: int64 - name: laugh dtype: int64 - name: rocket dtype: int64 - name: total_count dtype: int64 - name: url dtype: string - name: timeline_url dtype: string - name: performed_via_github_app dtype: float64 - name: state_reason dtype: string - name: __index_level_0__ dtype: int64 - name: is_pr dtype: bool - name: comments sequence: string splits: - name: train num_bytes: 36763529 num_examples: 6650 download_size: 10752010 dataset_size: 36763529 configs: - config_name: default data_files: - split: train path: data/train-* annotations_creators: - no-annotation language: - en language_creators: - found license: - wtfpl multilinguality: - monolingual pretty_name: HuggingFace Datasets GitHub Issues size_categories: - unknown source_datasets: - original tags: [] task_categories: - text-classification - text-retrieval task_ids: - multi-class-classification - multi-label-classification - document-retrieval --- # Dataset Card for GitHub Issues ## Dataset Description - **Point of Contact:** [Alex](https://huggingface.co/alex-atelo) ### Dataset Summary GitHub Issues is a dataset consisting of GitHub issues and pull requests associated with the 🤗 Datasets [repository](https://github.com/huggingface/datasets). It is intended for educational purposes and can be used for semantic search or multilabel text classification. The contents of each GitHub issue are in English and concern the domain of datasets for NLP, computer vision, and beyond. ### Supported Tasks and Leaderboards For each of the tasks tagged for this dataset, give a brief description of the tag, metrics, and suggested models (with a link to their HuggingFace implementation if available). Give a similar description of tasks that were not covered by the structured tag set (repace the `task-category-tag` with an appropriate `other:other-task-name`). - `task-category-tag`: The dataset can be used to train a model for [TASK NAME], which consists in [TASK DESCRIPTION]. Success on this task is typically measured by achieving a *high/low* [metric name](https://huggingface.co/metrics/metric_name). The ([model name](https://huggingface.co/model_name) or [model class](https://huggingface.co/transformers/model_doc/model_class.html)) model currently achieves the following score. *[IF A LEADERBOARD IS AVAILABLE]:* This task has an active leaderboard which can be found at [leaderboard url]() and ranks models based on [metric name](https://huggingface.co/metrics/metric_name) while also reporting [other metric name](https://huggingface.co/metrics/other_metric_name). ### Languages Provide a brief overview of the languages represented in the dataset. Describe relevant details about specifics of the language such as whether it is social media text, African American English,... When relevant, please provide [BCP-47 codes](https://tools.ietf.org/html/bcp47), which consist of a [primary language subtag](https://tools.ietf.org/html/bcp47#section-2.2.1), with a [script subtag](https://tools.ietf.org/html/bcp47#section-2.2.3) and/or [region subtag](https://tools.ietf.org/html/bcp47#section-2.2.4) if available. ## Dataset Structure ### Data Instances Provide an JSON-formatted example and brief description of a typical instance in the dataset. If available, provide a link to further examples. ``` { 'example_field': ..., ... } ``` Provide any additional information that is not covered in the other sections about the data here. In particular describe any relationships between data points and if these relationships are made explicit. ### Data Fields List and describe the fields present in the dataset. Mention their data type, and whether they are used as input or output in any of the tasks the dataset currently supports. If the data has span indices, describe their attributes, such as whether they are at the character level or word level, whether they are contiguous or not, etc. If the datasets contains example IDs, state whether they have an inherent meaning, such as a mapping to other datasets or pointing to relationships between data points. - `example_field`: description of `example_field` Note that the descriptions can be initialized with the **Show Markdown Data Fields** output of the [tagging app](https://github.com/huggingface/datasets-tagging), you will then only need to refine the generated descriptions. ### Data Splits Describe and name the splits in the dataset if there are more than one. Describe any criteria for splitting the data, if used. If their are differences between the splits (e.g. if the training annotations are machine-generated and the dev and test ones are created by humans, or if different numbers of annotators contributed to each example), describe them here. Provide the sizes of each split. As appropriate, provide any descriptive statistics for the features, such as average length. For example: | | Tain | Valid | Test | | ----- | ------ | ----- | ---- | | Input Sentences | | | | | Average Sentence Length | | | | ## Dataset Creation ### Curation Rationale What need motivated the creation of this dataset? What are some of the reasons underlying the major choices involved in putting it together? ### Source Data This section describes the source data (e.g. news text and headlines, social media posts, translated sentences,...) #### Initial Data Collection and Normalization Describe the data collection process. Describe any criteria for data selection or filtering. List any key words or search terms used. If possible, include runtime information for the collection process. If data was collected from other pre-existing datasets, link to source here and to their [Hugging Face version](https://huggingface.co/datasets/dataset_name). If the data was modified or normalized after being collected (e.g. if the data is word-tokenized), describe the process and the tools used. #### Who are the source language producers? State whether the data was produced by humans or machine generated. Describe the people or systems who originally created the data. If available, include self-reported demographic or identity information for the source data creators, but avoid inferring this information. Instead state that this information is unknown. See [Larson 2017](https://www.aclweb.org/anthology/W17-1601.pdf) for using identity categories as a variables, particularly gender. Describe the conditions under which the data was created (for example, if the producers were crowdworkers, state what platform was used, or if the data was found, what website the data was found on). If compensation was provided, include that information here. Describe other people represented or mentioned in the data. Where possible, link to references for the information. ### Annotations If the dataset contains annotations which are not part of the initial data collection, describe them in the following paragraphs. #### Annotation process If applicable, describe the annotation process and any tools used, or state otherwise. Describe the amount of data annotated, if not all. Describe or reference annotation guidelines provided to the annotators. If available, provide interannotator statistics. Describe any annotation validation processes. #### Who are the annotators? If annotations were collected for the source data (such as class labels or syntactic parses), state whether the annotations were produced by humans or machine generated. Describe the people or systems who originally created the annotations and their selection criteria if applicable. If available, include self-reported demographic or identity information for the annotators, but avoid inferring this information. Instead state that this information is unknown. See [Larson 2017](https://www.aclweb.org/anthology/W17-1601.pdf) for using identity categories as a variables, particularly gender. Describe the conditions under which the data was annotated (for example, if the annotators were crowdworkers, state what platform was used, or if the data was found, what website the data was found on). If compensation was provided, include that information here. ### Personal and Sensitive Information State whether the dataset uses identity categories and, if so, how the information is used. Describe where this information comes from (i.e. self-reporting, collecting from profiles, inferring, etc.). See [Larson 2017](https://www.aclweb.org/anthology/W17-1601.pdf) for using identity categories as a variables, particularly gender. State whether the data is linked to individuals and whether those individuals can be identified in the dataset, either directly or indirectly (i.e., in combination with other data). State whether the dataset contains other data that might be considered sensitive (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history). If efforts were made to anonymize the data, describe the anonymization process. ## Considerations for Using the Data ### Social Impact of Dataset Please discuss some of the ways you believe the use of this dataset will impact society. The statement should include both positive outlooks, such as outlining how technologies developed through its use may improve people's lives, and discuss the accompanying risks. These risks may range from making important decisions more opaque to people who are affected by the technology, to reinforcing existing harmful biases (whose specifics should be discussed in the next section), among other considerations. Also describe in this section if the proposed dataset contains a low-resource or under-represented language. If this is the case or if this task has any impact on underserved communities, please elaborate here. ### Discussion of Biases Provide descriptions of specific biases that are likely to be reflected in the data, and state whether any steps were taken to reduce their impact. For Wikipedia text, see for example [Dinan et al 2020 on biases in Wikipedia (esp. Table 1)](https://arxiv.org/abs/2005.00614), or [Blodgett et al 2020](https://www.aclweb.org/anthology/2020.acl-main.485/) for a more general discussion of the topic. If analyses have been run quantifying these biases, please add brief summaries and links to the studies here. ### Other Known Limitations If studies of the datasets have outlined other limitations of the dataset, such as annotation artifacts, please outline and cite them here. ## Additional Information ### Dataset Curators List the people involved in collecting the dataset and their affiliation(s). If funding information is known, include it here. ### Licensing Information Provide the license and link to the license webpage if available. ### Citation Information Provide the [BibTex](http://www.bibtex.org/)-formatted reference for the dataset. For example: ``` @article{article_id, author = {Author List}, title = {Dataset Paper Title}, journal = {Publication Venue}, year = {2525} } ``` If the dataset has a [DOI](https://www.doi.org/), please provide it here. ``` @misc{huggingfacecourse, author = {Hugging Face}, title = {The Hugging Face Course, 2022}, howpublished = "\url{https://huggingface.co/course}", year = {2022}, note = "[Online; accessed <today>]" } ``` ### Contributions Thanks to [@alex-atelo](https://huggingface.co/alex-atelo) for adding this dataset.

提供机构：

alex-atelo

原始信息汇总

数据集概述

数据集描述

数据集摘要

GitHub Issues 数据集包含与 🤗 Datasets 仓库相关的 GitHub 问题和拉取请求。该数据集旨在用于教育目的，可用于语义搜索或多标签文本分类。每个 GitHub 问题的内容均为英语，涉及 NLP、计算机视觉等领域的数据集。

支持的任务和排行榜

该数据集可用于以下任务：

文本分类：用于训练多类别或多标签分类模型。
文本检索：用于训练文档检索模型。

语言

数据集中的文本语言为英语。

数据集结构

数据实例

数据集中的每个实例包含以下字段：

url：字符串类型，问题或拉取请求的 URL。
repository_url：字符串类型，仓库的 URL。
labels_url：字符串类型，标签的 URL。
comments_url：字符串类型，评论的 URL。
events_url：字符串类型，事件的 URL。
html_url：字符串类型，HTML 页面的 URL。
id：整数类型，问题或拉取请求的唯一标识符。
node_id：字符串类型，节点标识符。
number：整数类型，问题或拉取请求的编号。
title：字符串类型，问题或拉取请求的标题。
user：结构体类型，包含用户信息，如头像 URL、事件 URL 等。
labels：列表类型，包含标签信息，如颜色、默认值、描述等。
state：字符串类型，问题或拉取请求的状态。
locked：布尔类型，是否锁定。
assignee：结构体类型，包含指派人信息，如头像 URL、事件 URL 等。
assignees：列表类型，包含多个指派人信息。
milestone：结构体类型，包含里程碑信息，如关闭时间、创建时间等。
num_comments：整数类型，评论数量。
created_at：时间戳类型，创建时间。
updated_at：时间戳类型，更新时间。
closed_at：时间戳类型，关闭时间。
author_association：字符串类型，作者关联信息。
active_lock_reason：浮点数类型，锁定原因。
draft：浮点数类型，是否为草稿。
pull_request：结构体类型，包含拉取请求信息，如差异 URL、合并时间等。
body：字符串类型，问题或拉取请求的正文。
reactions：结构体类型，包含反应信息，如点赞数、困惑数等。
timeline_url：字符串类型，时间线 URL。
performed_via_github_app：浮点数类型，是否通过 GitHub 应用执行。
state_reason：字符串类型，状态原因。
__index_level_0__：整数类型，索引级别。
is_pr：布尔类型，是否为拉取请求。
comments：序列类型，评论内容。

数据分割

数据集包含一个训练集：

train：包含 6650 个实例，总字节数为 36763529。

数据集创建

数据集创建动机

该数据集的创建旨在提供一个用于教育和研究目的的 GitHub 问题和拉取请求数据集，特别是用于语义搜索和多标签文本分类任务。

源数据

数据集的源数据来自 GitHub 上的 🤗 Datasets 仓库。

注释

数据集不包含额外的注释。

个人和敏感信息

数据集中不包含个人身份信息或其他敏感信息。

数据使用考虑

社会影响

该数据集的使用可能有助于开发新的文本分类和检索技术，但也应注意潜在的偏见和数据隐私问题。

偏见讨论

数据集可能包含与 GitHub 社区相关的偏见，使用时应谨慎处理。

其他已知限制

数据集可能存在与 GitHub 问题和拉取请求相关的特定限制，如数据更新频率等。

附加信息

数据集策展人

数据集由 Alex 添加。

许可信息

数据集的许可为 WTFPL。

引用信息

数据集的引用信息如下：

@misc{huggingfacecourse, author = {Hugging Face}, title = {The Hugging Face Course, 2022}, howpublished = "url{https://huggingface.co/course}", year = {2022}, note = "[Online; accessed <today>]" }

贡献

感谢 @alex-atelo 添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集