Shuu12121/owl_code_search_hard_negative_datasets-Pre_kd
收藏Hugging Face2026-03-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Shuu12121/owl_code_search_hard_negative_datasets-Pre_kd
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- code
license: apache-2.0
task_categories:
- feature-extraction
- sentence-similarity
tags:
- code-search
- hard-negatives
- knowledge-distillation
- contrastive-learning
- sentence-transformers
- colbert
pretty_name: "Owl Code Search Hard Negative Datasets (Pre-KD)"
size_categories:
- 1M<n<10M
dataset_info:
- config_name: documents_go
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 788961766
num_examples: 1361475
download_size: 234362060
dataset_size: 788961766
- config_name: documents_java
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 595749922
num_examples: 1281018
download_size: 157335988
dataset_size: 595749922
- config_name: documents_javascript
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 103608078
num_examples: 129007
download_size: 36381974
dataset_size: 103608078
- config_name: documents_php
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 223866454
num_examples: 424463
download_size: 63038942
dataset_size: 223866454
- config_name: documents_python
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1076037918
num_examples: 776900
download_size: 335083048
dataset_size: 1076037918
- config_name: documents_ruby
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 63811504
num_examples: 104899
download_size: 15337714
dataset_size: 63811504
- config_name: documents_rust
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 239736870
num_examples: 381521
download_size: 76499102
dataset_size: 239736870
- config_name: documents_typescript
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 265760164
num_examples: 328457
download_size: 77031340
dataset_size: 265760164
- config_name: queries_go
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 244424686
num_examples: 1361475
download_size: 79920936
dataset_size: 244424686
- config_name: queries_java
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 392045510
num_examples: 1281018
download_size: 101975044
dataset_size: 392045510
- config_name: queries_javascript
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 40990466
num_examples: 129007
download_size: 14437070
dataset_size: 40990466
- config_name: queries_php
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 101841366
num_examples: 424463
download_size: 31103178
dataset_size: 101841366
- config_name: queries_python
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 333616210
num_examples: 776900
download_size: 102401118
dataset_size: 333616210
- config_name: queries_ruby
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 67053734
num_examples: 104899
download_size: 16664920
dataset_size: 67053734
- config_name: queries_rust
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 78330830
num_examples: 381521
download_size: 30435878
dataset_size: 78330830
- config_name: queries_typescript
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 96598646
num_examples: 328457
download_size: 30449400
dataset_size: 96598646
- config_name: scores_go
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 1049496498
num_examples: 1361475
download_size: 556865396
dataset_size: 1049496498
- config_name: scores_java
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 1072717102
num_examples: 1281018
download_size: 525421824
dataset_size: 1072717102
- config_name: scores_javascript
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 129963078
num_examples: 129007
download_size: 50592770
dataset_size: 129963078
- config_name: scores_php
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 335188010
num_examples: 424463
download_size: 172108596
dataset_size: 335188010
- config_name: scores_python
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 696396830
num_examples: 776900
download_size: 324380260
dataset_size: 696396830
- config_name: scores_ruby
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 83655522
num_examples: 104899
download_size: 40260810
dataset_size: 83655522
- config_name: scores_rust
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 314038494
num_examples: 381521
download_size: 163389490
dataset_size: 314038494
- config_name: scores_typescript
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 337032714
num_examples: 328457
download_size: 132654824
dataset_size: 337032714
configs:
- config_name: documents_go
data_files:
- split: train
path: documents_go/train-*
- config_name: documents_java
data_files:
- split: train
path: documents_java/train-*
- config_name: documents_javascript
data_files:
- split: train
path: documents_javascript/train-*
- config_name: documents_php
data_files:
- split: train
path: documents_php/train-*
- config_name: documents_python
data_files:
- split: train
path: documents_python/train-*
- config_name: documents_ruby
data_files:
- split: train
path: documents_ruby/train-*
- config_name: documents_rust
data_files:
- split: train
path: documents_rust/train-*
- config_name: documents_typescript
data_files:
- split: train
path: documents_typescript/train-*
- config_name: queries_go
data_files:
- split: train
path: queries_go/train-*
- config_name: queries_java
data_files:
- split: train
path: queries_java/train-*
- config_name: queries_javascript
data_files:
- split: train
path: queries_javascript/train-*
- config_name: queries_php
data_files:
- split: train
path: queries_php/train-*
- config_name: queries_python
data_files:
- split: train
path: queries_python/train-*
- config_name: queries_ruby
data_files:
- split: train
path: queries_ruby/train-*
- config_name: queries_rust
data_files:
- split: train
path: queries_rust/train-*
- config_name: queries_typescript
data_files:
- split: train
path: queries_typescript/train-*
- config_name: scores_go
data_files:
- split: train
path: scores_go/train-*
- config_name: scores_java
data_files:
- split: train
path: scores_java/train-*
- config_name: scores_javascript
data_files:
- split: train
path: scores_javascript/train-*
- config_name: scores_php
data_files:
- split: train
path: scores_php/train-*
- config_name: scores_python
data_files:
- split: train
path: scores_python/train-*
- config_name: scores_ruby
data_files:
- split: train
path: scores_ruby/train-*
- config_name: scores_rust
data_files:
- split: train
path: scores_rust/train-*
- config_name: scores_typescript
data_files:
- split: train
path: scores_typescript/train-*
---
# Owl Code Search Hard Negative Datasets
Knowledge Distillation (KD) ベースのハードネガティブ付きコード検索データセットです。
コード検索モデル[Shuu12121/CodeSearch-ModernBERT-Crow-v3-large-len1024-Plus](https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Crow-v3-large-len1024-Plus)を教師モデルとして、[各コメントと説明コメントのペアのデータセット](https://huggingface.co/collections/Shuu12121/codesearch-datasets)から各クエリに対する関数の類似度スコアを計算し、ハードネガティブ(正解に類似しているが不正解の文書)を付与しています。
## 概要
- **目的**: コード検索モデルの Contrastive Learning / Knowledge Distillation ファインチューニング
- **言語**: Go, Java, JavaScript, PHP, Python, Ruby, Rust, TypeScript(8言語)
- **総サンプル数**: 4,787,740
- **データサイズ**: 8.73 GB(展開後) / 3.37 GB(ダウンロード時)
- **フォーマット**: Per-language config 形式(`scores_{lang}`, `queries_{lang}`, `documents_{lang}`)
## データ構造
各言語ごとに 3 つの config が存在します:
### `queries_{lang}`
各クエリ(自然言語による検索文)を格納。
| カラム | 型 | 説明 |
|--------|------|------|
| `query_id` | `string` | クエリの一意識別子 |
| `query` | `string` | 自然言語のクエリテキスト(docstring / コメント) |
| `split` | `string` | 元データの分割情報 |
### `documents_{lang}`
各文書(ソースコード)を格納。
| カラム | 型 | 説明 |
|--------|------|------|
| `document_id` | `string` | 文書の一意識別子 |
| `document` | `string` | ソースコード本文 |
| `split` | `string` | 元データの分割情報 |
### `scores_{lang}`
教師モデルによる類似度スコアを格納。各クエリに対して、スコア順にソートされた文書 ID リストとスコアリストを保持。
| カラム | 型 | 説明 |
|--------|------|------|
| `query_id` | `string` | 対応するクエリの ID |
| `document_ids` | `list[string]` | スコア順にソートされた文書 ID のリスト |
| `scores` | `list[float64]` | 対応する類似度スコアのリスト |
| `split` | `string` | 元データの分割情報 |
> **スコアの解釈**:
> - `scores[0]` / `document_ids[0]` が正例(実際のペアだったもの)
> - `score[0] = -1` は正解が上位32件に検索結果が含まれていなかった場合
## 言語別統計
| 言語 | クエリ数 | 文書数 | スコア数 |
|------|-------:|-------:|-------:|
| Go | 1,361,475 | 1,361,475 | 1,361,475 |
| Java | 1,281,018 | 1,281,018 | 1,281,018 |
| JavaScript | 129,007 | 129,007 | 129,007 |
| PHP | 424,463 | 424,463 | 424,463 |
| Python | 776,900 | 776,900 | 776,900 |
| Ruby | 104,899 | 104,899 | 104,899 |
| Rust | 381,521 | 381,521 | 381,521 |
| TypeScript | 328,457 | 328,457 | 328,457 |
| **合計** | **4,787,740** | **4,787,740** | **4,787,740** |
## 注意点
全データをメモリに載せようとするとOOMになる可能性があります!!
## 使い方
### 基本的な読み込み
```python
from datasets import load_dataset
# Python の scores を読み込む
scores = load_dataset(
"Shuu12121/owl_code_search_hard_negative_datasets-Pre_kd",
name="scores_python",
split="train",
)
# Python の queries を読み込む
queries = load_dataset(
"Shuu12121/owl_code_search_hard_negative_datasets-Pre_kd",
name="queries_python",
split="train",
)
# Python の documents を読み込む
documents = load_dataset(
"Shuu12121/owl_code_search_hard_negative_datasets-Pre_kd",
name="documents_python",
split="train",
)
```
### ハードネガティブの抽出
```python
# クエリ・文書テキストの辞書を構築
query_texts = dict(zip(queries["query_id"], queries["query"]))
doc_texts = dict(zip(documents["document_id"], documents["document"]))
# 閾値の設定
nv_threshold = 0.99 # positive スコアの 99% 未満をネガティブとする
# 1 サンプルの処理例
sample = scores[0]
query_text = query_texts[sample["query_id"]]
positive_doc = doc_texts[sample["document_ids"][0]] # scores[0] が正例
positive_score = sample["scores"][0]
hard_negatives = []
for doc_id, score in zip(sample["document_ids"][1:], sample["scores"][1:]):
if score < nv_threshold * positive_score and score != -1:
hard_negatives.append(doc_texts[doc_id])
print(f"Query: {query_text[:100]}...")
print(f"Positive: {positive_doc[:100]}...")
print(f"Hard negatives: {len(hard_negatives)}")
```
## 作成に使用されたプログラム
[リポジトリはこちら](https://github.com/Shun0212/hard-negatives-ranking-datasets-maker)
提供机构:
Shuu12121



