five

mbateman/github-issues

收藏
Hugging Face2021-12-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mbateman/github-issues
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for GitHub Issues ## Dataset Description - **Point of Contact:** [Michael Bateman](michael.bateman.com@gmail.com) ### Dataset Summary GitHub Issues is a dataset consisting of GitHub issues and pull requests associated with the 🤗 Datasets [repository](https://github.com/huggingface/datasets). It is intended for educational purposes and can be used for semantic search or multilabel text classification. The contents of each GitHub issue are in English and concern the domain of datasets for NLP, computer vision, and beyond. ### Supported Tasks and Leaderboards For each of the tasks tagged for this dataset, give a brief description of the tag, metrics, and suggested models (with a link to their HuggingFace implementation if available). Give a similar description of tasks that were not covered by the structured tag set (repace the `task-category-tag` with an appropriate `other:other-task-name`). - `task-category-tag`: The dataset can be used to train a model for [TASK NAME], which consists in [TASK DESCRIPTION]. Success on this task is typically measured by achieving a *high/low* [metric name](https://huggingface.co/metrics/metric_name). The ([model name](https://huggingface.co/model_name) or [model class](https://huggingface.co/transformers/model_doc/model_class.html)) model currently achieves the following score. *[IF A LEADERBOARD IS AVAILABLE]:* This task has an active leaderboard which can be found at [leaderboard url]() and ranks models based on [metric name](https://huggingface.co/metrics/metric_name) while also reporting [other metric name](https://huggingface.co/metrics/other_metric_name). ### Languages Provide a brief overview of the languages represented in the dataset. Describe relevant details about specifics of the language such as whether it is social media text, African American English,... When relevant, please provide [BCP-47 codes](https://tools.ietf.org/html/bcp47), which consist of a [primary language subtag](https://tools.ietf.org/html/bcp47#section-2.2.1), with a [script subtag](https://tools.ietf.org/html/bcp47#section-2.2.3) and/or [region subtag](https://tools.ietf.org/html/bcp47#section-2.2.4) if available. ## Dataset Structure ### Data Instances Provide an JSON-formatted example and brief description of a typical instance in the dataset. If available, provide a link to further examples. ``` { 'example_field': ..., ... } ``` Provide any additional information that is not covered in the other sections about the data here. In particular describe any relationships between data points and if these relationships are made explicit. ### Data Fields List and describe the fields present in the dataset. Mention their data type, and whether they are used as input or output in any of the tasks the dataset currently supports. If the data has span indices, describe their attributes, such as whether they are at the character level or word level, whether they are contiguous or not, etc. If the datasets contains example IDs, state whether they have an inherent meaning, such as a mapping to other datasets or pointing to relationships between data points. - `example_field`: description of `example_field` Note that the descriptions can be initialized with the **Show Markdown Data Fields** output of the [tagging app](https://github.com/huggingface/datasets-tagging), you will then only need to refine the generated descriptions. ### Data Splits Describe and name the splits in the dataset if there are more than one. Describe any criteria for splitting the data, if used. If their are differences between the splits (e.g. if the training annotations are machine-generated and the dev and test ones are created by humans, or if different numbers of annotators contributed to each example), describe them here. Provide the sizes of each split. As appropriate, provide any descriptive statistics for the features, such as average length. For example: | | Tain | Valid | Test | | ----- | ------ | ----- | ---- | | Input Sentences | | | | | Average Sentence Length | | | | ## Dataset Creation ### Curation Rationale What need motivated the creation of this dataset? What are some of the reasons underlying the major choices involved in putting it together? ### Source Data This section describes the source data (e.g. news text and headlines, social media posts, translated sentences,...) #### Initial Data Collection and Normalization Describe the data collection process. Describe any criteria for data selection or filtering. List any key words or search terms used. If possible, include runtime information for the collection process. If data was collected from other pre-existing datasets, link to source here and to their [Hugging Face version](https://huggingface.co/datasets/dataset_name). If the data was modified or normalized after being collected (e.g. if the data is word-tokenized), describe the process and the tools used. #### Who are the source language producers? State whether the data was produced by humans or machine generated. Describe the people or systems who originally created the data. If available, include self-reported demographic or identity information for the source data creators, but avoid inferring this information. Instead state that this information is unknown. See [Larson 2017](https://www.aclweb.org/anthology/W17-1601.pdf) for using identity categories as a variables, particularly gender. Describe the conditions under which the data was created (for example, if the producers were crowdworkers, state what platform was used, or if the data was found, what website the data was found on). If compensation was provided, include that information here. Describe other people represented or mentioned in the data. Where possible, link to references for the information. ### Annotations If the dataset contains annotations which are not part of the initial data collection, describe them in the following paragraphs. #### Annotation process If applicable, describe the annotation process and any tools used, or state otherwise. Describe the amount of data annotated, if not all. Describe or reference annotation guidelines provided to the annotators. If available, provide interannotator statistics. Describe any annotation validation processes. #### Who are the annotators? If annotations were collected for the source data (such as class labels or syntactic parses), state whether the annotations were produced by humans or machine generated. Describe the people or systems who originally created the annotations and their selection criteria if applicable. If available, include self-reported demographic or identity information for the annotators, but avoid inferring this information. Instead state that this information is unknown. See [Larson 2017](https://www.aclweb.org/anthology/W17-1601.pdf) for using identity categories as a variables, particularly gender. Describe the conditions under which the data was annotated (for example, if the annotators were crowdworkers, state what platform was used, or if the data was found, what website the data was found on). If compensation was provided, include that information here. ### Personal and Sensitive Information State whether the dataset uses identity categories and, if so, how the information is used. Describe where this information comes from (i.e. self-reporting, collecting from profiles, inferring, etc.). See [Larson 2017](https://www.aclweb.org/anthology/W17-1601.pdf) for using identity categories as a variables, particularly gender. State whether the data is linked to individuals and whether those individuals can be identified in the dataset, either directly or indirectly (i.e., in combination with other data). State whether the dataset contains other data that might be considered sensitive (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history). If efforts were made to anonymize the data, describe the anonymization process. ## Considerations for Using the Data ### Social Impact of Dataset Please discuss some of the ways you believe the use of this dataset will impact society. The statement should include both positive outlooks, such as outlining how technologies developed through its use may improve people's lives, and discuss the accompanying risks. These risks may range from making important decisions more opaque to people who are affected by the technology, to reinforcing existing harmful biases (whose specifics should be discussed in the next section), among other considerations. Also describe in this section if the proposed dataset contains a low-resource or under-represented language. If this is the case or if this task has any impact on underserved communities, please elaborate here. ### Discussion of Biases Provide descriptions of specific biases that are likely to be reflected in the data, and state whether any steps were taken to reduce their impact. For Wikipedia text, see for example [Dinan et al 2020 on biases in Wikipedia (esp. Table 1)](https://arxiv.org/abs/2005.00614), or [Blodgett et al 2020](https://www.aclweb.org/anthology/2020.acl-main.485/) for a more general discussion of the topic. If analyses have been run quantifying these biases, please add brief summaries and links to the studies here. ### Other Known Limitations If studies of the datasets have outlined other limitations of the dataset, such as annotation artifacts, please outline and cite them here. ## Additional Information ### Dataset Curators List the people involved in collecting the dataset and their affiliation(s). If funding information is known, include it here. ### Licensing Information Provide the license and link to the license webpage if available. ### Citation Information Provide the [BibTex](http://www.bibtex.org/)-formatted reference for the dataset. For example: ``` @article{article_id, author = {Author List}, title = {Dataset Paper Title}, journal = {Publication Venue}, year = {2525} } ``` If the dataset has a [DOI](https://www.doi.org/), please provide it here. ### Contributions Thanks to [@mbateman](https://github.com/mbateman) for adding this dataset.
提供机构:
mbateman
原始信息汇总

数据集概述

数据集名称

GitHub Issues

数据集描述

GitHub Issues 是一个包含与 Hugging Face 数据集仓库 相关的 GitHub 问题和拉取请求的数据集。该数据集主要用于教育目的,适用于语义搜索或多标签文本分类任务。数据集中的内容均为英文,涉及自然语言处理、计算机视觉等领域的数据集讨论。

支持的任务和评测指标

数据集可用于训练模型执行特定任务,如[任务名称],该任务涉及[任务描述]。任务的成功通常通过达到高/低[指标名称]来衡量。目前,模型名称模型类别模型在此任务上取得了以下成绩。如果存在活跃的评测榜单,该榜单可以在评测榜单链接找到,根据指标名称对模型进行排名,并报告其他指标名称

语言信息

数据集中的内容均为英文,不涉及特定方言或变体。

数据集结构

数据实例

数据集中的典型实例以JSON格式表示,例如: json { example_field: ..., ... }

数据字段

数据集包含多个字段,每个字段都有其特定的数据类型和用途。例如:

  • example_field: 描述example_field的字段信息。

数据分割

数据集可能包含多个分割,如训练集、验证集和测试集。每个分割的大小和特征描述将在相应部分提供。

数据集创建

来源数据

数据集的来源包括与Hugging Face数据集仓库相关的GitHub问题和拉取请求。数据收集过程涉及特定的筛选标准和关键词。

注释

如果数据集包含非原始数据收集部分的注释,将在此部分描述注释过程、使用的工具以及注释者的信息。

使用数据集的考虑

社会影响

使用此数据集可能对社会产生的影响包括技术进步、决策透明度提升以及潜在的偏见强化。

偏见讨论

数据集中可能存在的特定偏见及其减少措施将在本部分详细讨论。

其他信息

数据集维护者

数据集的维护者包括Michael Bateman等。

许可证信息

数据集的许可证信息将在本部分提供。

引用信息

数据集的引用信息,包括BibTex格式和DOI(如有),将在本部分提供。

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作