five

Puzer/github-repo-embeddings

收藏
Hugging Face2026-01-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Puzer/github-repo-embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en pretty_name: GitHub Repo Embeddings license: cc-by-4.0 tags: - github - embeddings - repositories - recommender-system task_categories: - graph-ml size_categories: - 1M<n<10M --- # GitHub Repo Embeddings (Dataset) This dataset contains: - GitHub repository embeddings learned from star co-occurrence. - Raw data for training such embeddings (2016 - 2025 years) It is generated by the same pipeline as [this repo](https://github.com/Puzer/github-repo-embeddings) and is intended for offline analysis, research, and downstream search/indexing. [See Demo which uses trained embeddings](http://puzer.github.io/) ## Summary - Source: GitHub Archive (BigQuery) WatchEvent + repo metadata. - Signal: repositories starred together by the same user. - Model: `torch.nn.EmbeddingBag` trained with MultiSimilarityLoss. - Embedding size: 128 dims. ## Files ### `starred_repos.parquet` User-level training data. - `repo_ids`: list[int], repo ids starred by a user (order preserved from events). ### `repos_meta.parquet` Repository metadata aligned with the training data. - `repo_id`: int - `repo_name`: str (owner/name) - `stars`: int, frequency of stars in this dataset - `created_at`: datetime, repo creation date (first push event) - `last_updated`: datetime, last push event ### `repo_embeddings_with_meta.parquet` Repository metadata + learned embeddings aligned by `repo_id`. - Includes columns from `repos_meta.parquet` - `embedding`: list[float], 128-dim vector ## Notes - The dataset is derived from public GitHub Archive data and is intended for research and demo purposes.
提供机构:
Puzer
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作