five

the-stack-github-issues

收藏
魔搭社区2025-12-05 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/bigcode/the-stack-github-issues
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Description This dataset contains conversations from GitHub issues and Pull Requests. Each conversation is comprised of a series of events, such as opening an issue, creating a comment, or closing the issue, and includes the author's username, text, action, and identifiers such as the issue ID and number. The dataset, which is mostly in English, has a total size of 54GB and 30.9M files. ## Dataset Structure ```python from datasets import load_dataset dataset = load_dataset("bigcode/the-stack-github-issues") dataset ``` ``` Dataset({ features: ['repo', 'issue_id', 'issue_number', 'pull_request', 'events', 'text_size', 'content', 'usernames'], num_rows: 30982955 }) ``` - `content` contains the full text in the conversation concatenated with special tokens: `<issue_start>` for the beginning of the issue, `<issue_comment>` before each comment and `<issue_closed>` if a conversation is closed. Each comment is prepended with `username_{i}:` before the text, `username_{i}` is the mask for author `i`. This column is intended for model training to avoid memorizing usernames, and understand the structure of the conversation. - `events` contains the detailed events on top of which we built `content`, it also includes information the username's author and mask used. Below is an example: ```` {'content': '<issue_start><issue_comment>Title: Click Save: Sorry, Cannot Write\n 'username_0: Hi all, Edit a file in Ice, click Save Icon\n Get error message: Sorry, cannot write /var/www/index.html ... Edit: Also getting error: Cannot Zip Files up.\n <issue_comment>username_1: hi there i have a similar problem. I cant save the files...', 'events': [{'action': 'opened', 'author': 'LaZyLion-ca', 'comment_id': None, 'datetime': '2013-06-06T13:30:31Z', 'masked_author': 'username_0', 'text': 'Hi all, Edit a file in Ice, click Save Icon...' 'title': 'Click Save: Sorry, Cannot Write', 'type': 'issue'}, ...], 'issue_id': 15222443, 'issue_number': 264, 'pull_request': None, 'repo': 'icecoder/ICEcoder', 'text_size': 525, 'usernames': '["LaZyLion-ca", "seyo-IV"]'} ```` ### Dataset pre-processing This dataset was collected as part of [The Stack](https://huggingface.co/datasets/bigcode/the-stack) dataset, and the curation rationale can be found at this [link](https://huggingface.co/datasets/bigcode/the-stack#source-data). To improve the quality of the dataset and remove personally identifiable information (PII), we performed the following cleaning steps, which reduced the dataset's size from 180GB to 54GB: - We first removed automated text generated when users reply using their emails, using regex matching. We also deleted issues with little text (less than 200 total characters) and truncated long comments in the middle (to a maximum of 100 lines while keeping the last 20 lines). This step removed 18% of the volume. - We deleted comments from bots by looking for keywords in the author's username. If an issue became empty after this filtering, we removed it. We also removed comments that preceded those from bots if they triggered them, by looking for the bot's username inside the text. This step removed 61% of the remaining volume and 22% of the conversations, as bot-generated comments tend to be very long. - We then used the number of users in the conversation as a proxy for quality. We kept all conversations with two or more users. If a conversation had only one user, we kept it only if the total text was larger than 200 characters and smaller than 7000 characters. We also removed issues with more than 10 events, as we noticed that they were of low quality or from bots we missed in the previous filtering. This filtering removed 4% of the volume and 30% of the conversations. - To redact PII, we masked IP addresses, email addresses, and secret keys from the text using regexes. We also masked the usernames of the authors from the comments and replaced them with username_{i}, where i is the order of the author in the conversation.
提供机构:
maas
创建时间:
2025-10-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作