dfalbel/github-r-repos
收藏Hugging Face2023-07-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/dfalbel/github-r-repos
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- text-generation
language:
- code
pretty_name: github-r-repos
size_categories:
- 100K<n<1M
---
## GitHub R repositories dataset
R source files from GitHub.
This dataset has been created using the public GitHub datasets from Google BigQuery.
This is the actual query that has been used to export the data:
```
EXPORT DATA
OPTIONS (
uri = 'gs://your-bucket/gh-r/*.parquet',
format = 'PARQUET') as
(
select
f.id, f.repo_name, f.path,
c.content, c.size
from (
SELECT distinct
id, repo_name, path
FROM `bigquery-public-data.github_repos.files`
where ends_with(path, ".R")
) as f
left join `bigquery-public-data.github_repos.contents` as c on f.id = c.id
)
EXPORT_DATA
OPTIONS (
uri = 'gs://your-bucket/licenses.parquet',
format = 'PARQUET') as
(select * from `bigquery-public-data.github_repos.licenses`)
```
Files were then exported and processed locally with files in the root of this repository.
Datasets in this repository contain data from reositories with different licenses.
The data schema is:
```
id: string
repo_name: string
path: string
content: string
size: int32
license: string
```
Last updated: Jun 6th 2023
提供机构:
dfalbel
原始信息汇总
GitHub R repositories dataset 概述
数据集基本信息
- 许可证: other
- 任务类别: text-generation
- 语言: code
- 数据集大小: 100K<n<1M
数据来源与处理
-
数据集由 Google BigQuery 的公共 GitHub 数据集创建。
-
使用以下 SQL 查询导出数据: sql EXPORT DATA OPTIONS ( uri = gs://your-bucket/gh-r/*.parquet, format = PARQUET) as ( select f.id, f.repo_name, f.path, c.content, c.size from ( SELECT distinct id, repo_name, path FROM
bigquery-public-data.github_repos.fileswhere ends_with(path, ".R") ) as f left joinbigquery-public-data.github_repos.contentsas c on f.id = c.id ) -
数据集包含来自具有不同许可证的存储库的数据。
数据结构
- id: string
- repo_name: string
- path: string
- content: string
- size: int32
- license: string
更新日期
- 最后更新: Jun 6th 2023



