jaigouk/coding-dataset

Name: jaigouk/coding-dataset
Creator: jaigouk
Published: 2024-02-26 20:35:05
License: 暂无描述

Hugging Face2024-02-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/jaigouk/coding-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

rspec_dataset数据集包含了从多个GitHub仓库的app/services目录中收集的测试用例，这些仓库包括diaspora、mastodon、gitlabhq、discourse、chatwoot和openproject。数据集提供了每个仓库的平均源代码行数、平均测试代码行数和测试用例数量的统计信息，并计算了每个测试用例的平均token数。Bigcode数据集包括ruby-dataset、shell-dataset、python-dataset和sql-dataset，但未提供详细信息。

The dataset includes rspec test specifications collected exclusively from the app/services directory within specified GitHub repositories, covering average source lines, test lines, and test cases for multiple projects. Additionally, it calculates the average tokens per test case, which is crucial for ensuring each example fits within the context window when training or inferencing with LLMs.

提供机构：

jaigouk

原始信息汇总

Ruby 数据集

自定义 Ruby 数据集

rspec_dataset

Bigcode 数据集

ruby-dataset
shell-dataset
python-dataset
sql-dataset

rspec 数据集

规格（Specs）仅从指定仓库的 app/services 目录中收集。这种方法被采用是因为大多数业务逻辑都封装在这些服务中。

仓库 URL

python REPO_URLS = [ https://github.com/diaspora/diaspora.git, https://github.com/mastodon/mastodon.git, https://github.com/gitlabhq/gitlabhq.git, https://github.com/discourse/discourse.git, https://github.com/chatwoot/chatwoot.git, https://github.com/opf/openproject.git, ]

数据统计

sh Repository Avg Source Lines Avg Test Lines Test Cases diaspora 62 156 12 mastodon 97 131 59 gitlabhq 66 154 952 discourse 188 303 49 chatwoot 63 107 50 openproject 86 178 98

Total 74 159 1220

计算细节

avg_source_lines = [62, 97, 66, 188, 63, 86]
avg_test_lines = [156, 131, 154, 303, 107, 178]
test_cases = [12, 59, 952, 49, 50, 98]

假设每行代码平均有 10 个标记（tokens），这是编程语言的一个粗略平均值。

total_source_tokens = 总源代码标记数
total_test_tokens = 总测试代码标记数
total_tokens = 总标记数
avg_tokens_per_test_case = 每个测试案例的平均标记数

计算结果：

total_tokens = 15910
avg_tokens_per_test_case = 13.040983606557377

在为 LLM 准备训练或推理数据时，每个示例（在这种情况下，每个测试案例或代码片段）需要适应上下文窗口。之前计算的每个测试案例的平均标记数（大约 13.04 个标记）远低于 LLM 的限制。

5,000+

优质数据集

54 个

任务类型

进入经典数据集