five

Shuu12121/owl_code_search_hard_negative_datasets-Pre_kd

收藏
Hugging Face2026-03-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Shuu12121/owl_code_search_hard_negative_datasets-Pre_kd
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - code license: apache-2.0 task_categories: - feature-extraction - sentence-similarity tags: - code-search - hard-negatives - knowledge-distillation - contrastive-learning - sentence-transformers - colbert pretty_name: "Owl Code Search Hard Negative Datasets (Pre-KD)" size_categories: - 1M<n<10M dataset_info: - config_name: documents_go features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 788961766 num_examples: 1361475 download_size: 234362060 dataset_size: 788961766 - config_name: documents_java features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 595749922 num_examples: 1281018 download_size: 157335988 dataset_size: 595749922 - config_name: documents_javascript features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 103608078 num_examples: 129007 download_size: 36381974 dataset_size: 103608078 - config_name: documents_php features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 223866454 num_examples: 424463 download_size: 63038942 dataset_size: 223866454 - config_name: documents_python features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 1076037918 num_examples: 776900 download_size: 335083048 dataset_size: 1076037918 - config_name: documents_ruby features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 63811504 num_examples: 104899 download_size: 15337714 dataset_size: 63811504 - config_name: documents_rust features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 239736870 num_examples: 381521 download_size: 76499102 dataset_size: 239736870 - config_name: documents_typescript features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 265760164 num_examples: 328457 download_size: 77031340 dataset_size: 265760164 - config_name: queries_go features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 244424686 num_examples: 1361475 download_size: 79920936 dataset_size: 244424686 - config_name: queries_java features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 392045510 num_examples: 1281018 download_size: 101975044 dataset_size: 392045510 - config_name: queries_javascript features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 40990466 num_examples: 129007 download_size: 14437070 dataset_size: 40990466 - config_name: queries_php features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 101841366 num_examples: 424463 download_size: 31103178 dataset_size: 101841366 - config_name: queries_python features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 333616210 num_examples: 776900 download_size: 102401118 dataset_size: 333616210 - config_name: queries_ruby features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 67053734 num_examples: 104899 download_size: 16664920 dataset_size: 67053734 - config_name: queries_rust features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 78330830 num_examples: 381521 download_size: 30435878 dataset_size: 78330830 - config_name: queries_typescript features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 96598646 num_examples: 328457 download_size: 30449400 dataset_size: 96598646 - config_name: scores_go features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 1049496498 num_examples: 1361475 download_size: 556865396 dataset_size: 1049496498 - config_name: scores_java features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 1072717102 num_examples: 1281018 download_size: 525421824 dataset_size: 1072717102 - config_name: scores_javascript features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 129963078 num_examples: 129007 download_size: 50592770 dataset_size: 129963078 - config_name: scores_php features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 335188010 num_examples: 424463 download_size: 172108596 dataset_size: 335188010 - config_name: scores_python features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 696396830 num_examples: 776900 download_size: 324380260 dataset_size: 696396830 - config_name: scores_ruby features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 83655522 num_examples: 104899 download_size: 40260810 dataset_size: 83655522 - config_name: scores_rust features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 314038494 num_examples: 381521 download_size: 163389490 dataset_size: 314038494 - config_name: scores_typescript features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 337032714 num_examples: 328457 download_size: 132654824 dataset_size: 337032714 configs: - config_name: documents_go data_files: - split: train path: documents_go/train-* - config_name: documents_java data_files: - split: train path: documents_java/train-* - config_name: documents_javascript data_files: - split: train path: documents_javascript/train-* - config_name: documents_php data_files: - split: train path: documents_php/train-* - config_name: documents_python data_files: - split: train path: documents_python/train-* - config_name: documents_ruby data_files: - split: train path: documents_ruby/train-* - config_name: documents_rust data_files: - split: train path: documents_rust/train-* - config_name: documents_typescript data_files: - split: train path: documents_typescript/train-* - config_name: queries_go data_files: - split: train path: queries_go/train-* - config_name: queries_java data_files: - split: train path: queries_java/train-* - config_name: queries_javascript data_files: - split: train path: queries_javascript/train-* - config_name: queries_php data_files: - split: train path: queries_php/train-* - config_name: queries_python data_files: - split: train path: queries_python/train-* - config_name: queries_ruby data_files: - split: train path: queries_ruby/train-* - config_name: queries_rust data_files: - split: train path: queries_rust/train-* - config_name: queries_typescript data_files: - split: train path: queries_typescript/train-* - config_name: scores_go data_files: - split: train path: scores_go/train-* - config_name: scores_java data_files: - split: train path: scores_java/train-* - config_name: scores_javascript data_files: - split: train path: scores_javascript/train-* - config_name: scores_php data_files: - split: train path: scores_php/train-* - config_name: scores_python data_files: - split: train path: scores_python/train-* - config_name: scores_ruby data_files: - split: train path: scores_ruby/train-* - config_name: scores_rust data_files: - split: train path: scores_rust/train-* - config_name: scores_typescript data_files: - split: train path: scores_typescript/train-* --- # Owl Code Search Hard Negative Datasets Knowledge Distillation (KD) ベースのハードネガティブ付きコード検索データセットです。 コード検索モデル[Shuu12121/CodeSearch-ModernBERT-Crow-v3-large-len1024-Plus](https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Crow-v3-large-len1024-Plus)を教師モデルとして、[各コメントと説明コメントのペアのデータセット](https://huggingface.co/collections/Shuu12121/codesearch-datasets)から各クエリに対する関数の類似度スコアを計算し、ハードネガティブ(正解に類似しているが不正解の文書)を付与しています。 ## 概要 - **目的**: コード検索モデルの Contrastive Learning / Knowledge Distillation ファインチューニング - **言語**: Go, Java, JavaScript, PHP, Python, Ruby, Rust, TypeScript(8言語) - **総サンプル数**: 4,787,740 - **データサイズ**: 8.73 GB(展開後) / 3.37 GB(ダウンロード時) - **フォーマット**: Per-language config 形式(`scores_{lang}`, `queries_{lang}`, `documents_{lang}`) ## データ構造 各言語ごとに 3 つの config が存在します: ### `queries_{lang}` 各クエリ(自然言語による検索文)を格納。 | カラム | 型 | 説明 | |--------|------|------| | `query_id` | `string` | クエリの一意識別子 | | `query` | `string` | 自然言語のクエリテキスト(docstring / コメント) | | `split` | `string` | 元データの分割情報 | ### `documents_{lang}` 各文書(ソースコード)を格納。 | カラム | 型 | 説明 | |--------|------|------| | `document_id` | `string` | 文書の一意識別子 | | `document` | `string` | ソースコード本文 | | `split` | `string` | 元データの分割情報 | ### `scores_{lang}` 教師モデルによる類似度スコアを格納。各クエリに対して、スコア順にソートされた文書 ID リストとスコアリストを保持。 | カラム | 型 | 説明 | |--------|------|------| | `query_id` | `string` | 対応するクエリの ID | | `document_ids` | `list[string]` | スコア順にソートされた文書 ID のリスト | | `scores` | `list[float64]` | 対応する類似度スコアのリスト | | `split` | `string` | 元データの分割情報 | > **スコアの解釈**: > - `scores[0]` / `document_ids[0]` が正例(実際のペアだったもの) > - `score[0] = -1` は正解が上位32件に検索結果が含まれていなかった場合 ## 言語別統計 | 言語 | クエリ数 | 文書数 | スコア数 | |------|-------:|-------:|-------:| | Go | 1,361,475 | 1,361,475 | 1,361,475 | | Java | 1,281,018 | 1,281,018 | 1,281,018 | | JavaScript | 129,007 | 129,007 | 129,007 | | PHP | 424,463 | 424,463 | 424,463 | | Python | 776,900 | 776,900 | 776,900 | | Ruby | 104,899 | 104,899 | 104,899 | | Rust | 381,521 | 381,521 | 381,521 | | TypeScript | 328,457 | 328,457 | 328,457 | | **合計** | **4,787,740** | **4,787,740** | **4,787,740** | ## 注意点 全データをメモリに載せようとするとOOMになる可能性があります!! ## 使い方 ### 基本的な読み込み ```python from datasets import load_dataset # Python の scores を読み込む scores = load_dataset( "Shuu12121/owl_code_search_hard_negative_datasets-Pre_kd", name="scores_python", split="train", ) # Python の queries を読み込む queries = load_dataset( "Shuu12121/owl_code_search_hard_negative_datasets-Pre_kd", name="queries_python", split="train", ) # Python の documents を読み込む documents = load_dataset( "Shuu12121/owl_code_search_hard_negative_datasets-Pre_kd", name="documents_python", split="train", ) ``` ### ハードネガティブの抽出 ```python # クエリ・文書テキストの辞書を構築 query_texts = dict(zip(queries["query_id"], queries["query"])) doc_texts = dict(zip(documents["document_id"], documents["document"])) # 閾値の設定 nv_threshold = 0.99 # positive スコアの 99% 未満をネガティブとする # 1 サンプルの処理例 sample = scores[0] query_text = query_texts[sample["query_id"]] positive_doc = doc_texts[sample["document_ids"][0]] # scores[0] が正例 positive_score = sample["scores"][0] hard_negatives = [] for doc_id, score in zip(sample["document_ids"][1:], sample["scores"][1:]): if score < nv_threshold * positive_score and score != -1: hard_negatives.append(doc_texts[doc_id]) print(f"Query: {query_text[:100]}...") print(f"Positive: {positive_doc[:100]}...") print(f"Hard negatives: {len(hard_negatives)}") ``` ## 作成に使用されたプログラム [リポジトリはこちら](https://github.com/Shun0212/hard-negatives-ranking-datasets-maker)
提供机构:
Shuu12121
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作