Shuu12121/coir_hard_negative_datasets_v2_kd
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Shuu12121/coir_hard_negative_datasets_v2_kd
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: documents_apps-python
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 5242179
num_examples: 8744
download_size: 2441966
dataset_size: 5242179
- config_name: documents_codefeedback-mt-mixed
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 98960692
num_examples: 66366
download_size: 50617557
dataset_size: 98960692
- config_name: documents_codefeedback-st-mixed
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 221183458
num_examples: 143924
download_size: 110602252
dataset_size: 221183458
- config_name: documents_codesearchnet-ccr-go
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 34900899
num_examples: 179976
download_size: 16004024
dataset_size: 34900899
- config_name: documents_codesearchnet-ccr-java
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 49478532
num_examples: 179334
download_size: 19712064
dataset_size: 49478532
- config_name: documents_codesearchnet-ccr-javascript
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 19197168
num_examples: 64789
download_size: 8452866
dataset_size: 19197168
- config_name: documents_codesearchnet-ccr-php
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 70659264
num_examples: 265825
download_size: 28045446
dataset_size: 70659264
- config_name: documents_codesearchnet-ccr-python
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 108249601
num_examples: 276406
download_size: 44293135
dataset_size: 108249601
- config_name: documents_codesearchnet-ccr-ruby
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 5543437
num_examples: 27078
download_size: 2551011
dataset_size: 5543437
- config_name: documents_codesearchnet-go
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 21448402
num_examples: 182395
download_size: 9695284
dataset_size: 21448402
- config_name: documents_codesearchnet-java
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 35997640
num_examples: 180834
download_size: 14938669
dataset_size: 35997640
- config_name: documents_codesearchnet-javascript
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 13143486
num_examples: 64840
download_size: 5907959
dataset_size: 13143486
- config_name: documents_codesearchnet-php
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 49201531
num_examples: 267697
download_size: 20971123
dataset_size: 49201531
- config_name: documents_codesearchnet-python
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 79455458
num_examples: 280136
download_size: 32522061
dataset_size: 79455458
- config_name: documents_codesearchnet-ruby
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 6921123
num_examples: 27569
download_size: 3089665
dataset_size: 6921123
- config_name: documents_codetrans-contest-mixed
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 1541254
num_examples: 1008
download_size: 657617
dataset_size: 1541254
- config_name: documents_codetrans-dl-mixed
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 443046
num_examples: 266
download_size: 128629
dataset_size: 443046
- config_name: documents_cosqa-python
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 2072945
num_examples: 6267
download_size: 1106059
dataset_size: 2072945
- config_name: documents_stackoverflow-qa-mixed
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 24429289
num_examples: 19930
download_size: 13175543
dataset_size: 24429289
- config_name: documents_synthetic-text2sql-sql
features:
- name: document_id
dtype: string
- name: document
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 14784521
num_examples: 99605
download_size: 6712614
dataset_size: 14784521
- config_name: queries_apps-python
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 6410447
num_examples: 5000
download_size: 3263749
dataset_size: 6410447
- config_name: queries_codefeedback-mt-mixed
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 235947161
num_examples: 53106
download_size: 99226011
dataset_size: 235947161
- config_name: queries_codefeedback-st-mixed
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 93421835
num_examples: 125220
download_size: 46801238
dataset_size: 93421835
- config_name: queries_codesearchnet-ccr-go
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 43207673
num_examples: 167288
download_size: 18678358
dataset_size: 43207673
- config_name: queries_codesearchnet-ccr-java
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 62513115
num_examples: 164923
download_size: 23948866
dataset_size: 62513115
- config_name: queries_codesearchnet-ccr-javascript
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 23515337
num_examples: 58025
download_size: 10128890
dataset_size: 23515337
- config_name: queries_codesearchnet-ccr-php
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 87492528
num_examples: 241241
download_size: 33347946
dataset_size: 87492528
- config_name: queries_codesearchnet-ccr-python
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 137102794
num_examples: 251820
download_size: 57069617
dataset_size: 137102794
- config_name: queries_codesearchnet-ccr-ruby
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 6757179
num_examples: 24927
download_size: 3053272
dataset_size: 6757179
- config_name: queries_codesearchnet-go
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 72165539
num_examples: 167288
download_size: 29465690
dataset_size: 72165539
- config_name: queries_codesearchnet-java
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 104328618
num_examples: 164923
download_size: 37258933
dataset_size: 104328618
- config_name: queries_codesearchnet-javascript
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 38975411
num_examples: 58025
download_size: 15762412
dataset_size: 38975411
- config_name: queries_codesearchnet-php
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 146123377
num_examples: 241241
download_size: 51891853
dataset_size: 146123377
- config_name: queries_codesearchnet-python
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 229222143
num_examples: 251820
download_size: 89277813
dataset_size: 229222143
- config_name: queries_codesearchnet-ruby
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 11273759
num_examples: 24927
download_size: 4742042
dataset_size: 11273759
- config_name: queries_codetrans-contest-mixed
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 401024
num_examples: 561
download_size: 196326
dataset_size: 401024
- config_name: queries_codetrans-dl-mixed
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 789134
num_examples: 564
download_size: 81011
dataset_size: 789134
- config_name: queries_cosqa-python
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 547571
num_examples: 9020
download_size: 253852
dataset_size: 547571
- config_name: queries_stackoverflow-qa-mixed
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 19787799
num_examples: 13951
download_size: 9951440
dataset_size: 19787799
- config_name: queries_synthetic-text2sql-sql
features:
- name: query_id
dtype: string
- name: query
dtype: string
- name: split
dtype: string
splits:
- name: train
num_bytes: 10403642
num_examples: 100000
download_size: 5039559
dataset_size: 10403642
- config_name: scores_apps-python
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 8644973
num_examples: 5000
download_size: 5006148
dataset_size: 8644973
- config_name: scores_codefeedback-mt-mixed
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 97074125
num_examples: 53106
download_size: 65753370
dataset_size: 97074125
- config_name: scores_codefeedback-st-mixed
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 234201028
num_examples: 125220
download_size: 170916909
dataset_size: 234201028
- config_name: scores_codesearchnet-ccr-go
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 314703997
num_examples: 167288
download_size: 203946190
dataset_size: 314703997
- config_name: scores_codesearchnet-ccr-java
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 310823885
num_examples: 164923
download_size: 201945544
dataset_size: 310823885
- config_name: scores_codesearchnet-ccr-javascript
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 106364103
num_examples: 58025
download_size: 66235584
dataset_size: 106364103
- config_name: scores_codesearchnet-ccr-php
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 458991826
num_examples: 241241
download_size: 319089190
dataset_size: 458991826
- config_name: scores_codesearchnet-ccr-python
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 480387614
num_examples: 251820
download_size: 303296931
dataset_size: 480387614
- config_name: scores_codesearchnet-ccr-ruby
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 44962878
num_examples: 24927
download_size: 26996628
dataset_size: 44962878
- config_name: scores_codesearchnet-go
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 314744834
num_examples: 167288
download_size: 208169599
dataset_size: 314744834
- config_name: scores_codesearchnet-java
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 310748507
num_examples: 164923
download_size: 205059817
dataset_size: 310748507
- config_name: scores_codesearchnet-javascript
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 106335206
num_examples: 58025
download_size: 67842338
dataset_size: 106335206
- config_name: scores_codesearchnet-php
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 458901491
num_examples: 241241
download_size: 320894550
dataset_size: 458901491
- config_name: scores_codesearchnet-python
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 480604879
num_examples: 251820
download_size: 317376697
dataset_size: 480604879
- config_name: scores_codesearchnet-ruby
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 44912285
num_examples: 24927
download_size: 26981969
dataset_size: 44912285
- config_name: scores_codetrans-contest-mixed
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 917462
num_examples: 561
download_size: 522534
dataset_size: 917462
- config_name: scores_codetrans-dl-mixed
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 898326
num_examples: 564
download_size: 263988
dataset_size: 898326
- config_name: scores_cosqa-python
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 15839278
num_examples: 9020
download_size: 8894666
dataset_size: 15839278
- config_name: scores_stackoverflow-qa-mixed
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 24935344
num_examples: 13951
download_size: 15048090
dataset_size: 24935344
- config_name: scores_synthetic-text2sql-sql
features:
- name: query_id
dtype: string
- name: document_ids
sequence: string
- name: scores
sequence: float64
- name: split
dtype: string
splits:
- name: train
num_bytes: 183200346
num_examples: 100000
download_size: 129971727
dataset_size: 183200346
configs:
- config_name: documents_apps-python
data_files:
- split: train
path: documents_apps-python/train-*
- config_name: documents_codefeedback-mt-mixed
data_files:
- split: train
path: documents_codefeedback-mt-mixed/train-*
- config_name: documents_codefeedback-st-mixed
data_files:
- split: train
path: documents_codefeedback-st-mixed/train-*
- config_name: documents_codesearchnet-ccr-go
data_files:
- split: train
path: documents_codesearchnet-ccr-go/train-*
- config_name: documents_codesearchnet-ccr-java
data_files:
- split: train
path: documents_codesearchnet-ccr-java/train-*
- config_name: documents_codesearchnet-ccr-javascript
data_files:
- split: train
path: documents_codesearchnet-ccr-javascript/train-*
- config_name: documents_codesearchnet-ccr-php
data_files:
- split: train
path: documents_codesearchnet-ccr-php/train-*
- config_name: documents_codesearchnet-ccr-python
data_files:
- split: train
path: documents_codesearchnet-ccr-python/train-*
- config_name: documents_codesearchnet-ccr-ruby
data_files:
- split: train
path: documents_codesearchnet-ccr-ruby/train-*
- config_name: documents_codesearchnet-go
data_files:
- split: train
path: documents_codesearchnet-go/train-*
- config_name: documents_codesearchnet-java
data_files:
- split: train
path: documents_codesearchnet-java/train-*
- config_name: documents_codesearchnet-javascript
data_files:
- split: train
path: documents_codesearchnet-javascript/train-*
- config_name: documents_codesearchnet-php
data_files:
- split: train
path: documents_codesearchnet-php/train-*
- config_name: documents_codesearchnet-python
data_files:
- split: train
path: documents_codesearchnet-python/train-*
- config_name: documents_codesearchnet-ruby
data_files:
- split: train
path: documents_codesearchnet-ruby/train-*
- config_name: documents_codetrans-contest-mixed
data_files:
- split: train
path: documents_codetrans-contest-mixed/train-*
- config_name: documents_codetrans-dl-mixed
data_files:
- split: train
path: documents_codetrans-dl-mixed/train-*
- config_name: documents_cosqa-python
data_files:
- split: train
path: documents_cosqa-python/train-*
- config_name: documents_stackoverflow-qa-mixed
data_files:
- split: train
path: documents_stackoverflow-qa-mixed/train-*
- config_name: documents_synthetic-text2sql-sql
data_files:
- split: train
path: documents_synthetic-text2sql-sql/train-*
- config_name: queries_apps-python
data_files:
- split: train
path: queries_apps-python/train-*
- config_name: queries_codefeedback-mt-mixed
data_files:
- split: train
path: queries_codefeedback-mt-mixed/train-*
- config_name: queries_codefeedback-st-mixed
data_files:
- split: train
path: queries_codefeedback-st-mixed/train-*
- config_name: queries_codesearchnet-ccr-go
data_files:
- split: train
path: queries_codesearchnet-ccr-go/train-*
- config_name: queries_codesearchnet-ccr-java
data_files:
- split: train
path: queries_codesearchnet-ccr-java/train-*
- config_name: queries_codesearchnet-ccr-javascript
data_files:
- split: train
path: queries_codesearchnet-ccr-javascript/train-*
- config_name: queries_codesearchnet-ccr-php
data_files:
- split: train
path: queries_codesearchnet-ccr-php/train-*
- config_name: queries_codesearchnet-ccr-python
data_files:
- split: train
path: queries_codesearchnet-ccr-python/train-*
- config_name: queries_codesearchnet-ccr-ruby
data_files:
- split: train
path: queries_codesearchnet-ccr-ruby/train-*
- config_name: queries_codesearchnet-go
data_files:
- split: train
path: queries_codesearchnet-go/train-*
- config_name: queries_codesearchnet-java
data_files:
- split: train
path: queries_codesearchnet-java/train-*
- config_name: queries_codesearchnet-javascript
data_files:
- split: train
path: queries_codesearchnet-javascript/train-*
- config_name: queries_codesearchnet-php
data_files:
- split: train
path: queries_codesearchnet-php/train-*
- config_name: queries_codesearchnet-python
data_files:
- split: train
path: queries_codesearchnet-python/train-*
- config_name: queries_codesearchnet-ruby
data_files:
- split: train
path: queries_codesearchnet-ruby/train-*
- config_name: queries_codetrans-contest-mixed
data_files:
- split: train
path: queries_codetrans-contest-mixed/train-*
- config_name: queries_codetrans-dl-mixed
data_files:
- split: train
path: queries_codetrans-dl-mixed/train-*
- config_name: queries_cosqa-python
data_files:
- split: train
path: queries_cosqa-python/train-*
- config_name: queries_stackoverflow-qa-mixed
data_files:
- split: train
path: queries_stackoverflow-qa-mixed/train-*
- config_name: queries_synthetic-text2sql-sql
data_files:
- split: train
path: queries_synthetic-text2sql-sql/train-*
- config_name: scores_apps-python
data_files:
- split: train
path: scores_apps-python/train-*
- config_name: scores_codefeedback-mt-mixed
data_files:
- split: train
path: scores_codefeedback-mt-mixed/train-*
- config_name: scores_codefeedback-st-mixed
data_files:
- split: train
path: scores_codefeedback-st-mixed/train-*
- config_name: scores_codesearchnet-ccr-go
data_files:
- split: train
path: scores_codesearchnet-ccr-go/train-*
- config_name: scores_codesearchnet-ccr-java
data_files:
- split: train
path: scores_codesearchnet-ccr-java/train-*
- config_name: scores_codesearchnet-ccr-javascript
data_files:
- split: train
path: scores_codesearchnet-ccr-javascript/train-*
- config_name: scores_codesearchnet-ccr-php
data_files:
- split: train
path: scores_codesearchnet-ccr-php/train-*
- config_name: scores_codesearchnet-ccr-python
data_files:
- split: train
path: scores_codesearchnet-ccr-python/train-*
- config_name: scores_codesearchnet-ccr-ruby
data_files:
- split: train
path: scores_codesearchnet-ccr-ruby/train-*
- config_name: scores_codesearchnet-go
data_files:
- split: train
path: scores_codesearchnet-go/train-*
- config_name: scores_codesearchnet-java
data_files:
- split: train
path: scores_codesearchnet-java/train-*
- config_name: scores_codesearchnet-javascript
data_files:
- split: train
path: scores_codesearchnet-javascript/train-*
- config_name: scores_codesearchnet-php
data_files:
- split: train
path: scores_codesearchnet-php/train-*
- config_name: scores_codesearchnet-python
data_files:
- split: train
path: scores_codesearchnet-python/train-*
- config_name: scores_codesearchnet-ruby
data_files:
- split: train
path: scores_codesearchnet-ruby/train-*
- config_name: scores_codetrans-contest-mixed
data_files:
- split: train
path: scores_codetrans-contest-mixed/train-*
- config_name: scores_codetrans-dl-mixed
data_files:
- split: train
path: scores_codetrans-dl-mixed/train-*
- config_name: scores_cosqa-python
data_files:
- split: train
path: scores_cosqa-python/train-*
- config_name: scores_stackoverflow-qa-mixed
data_files:
- split: train
path: scores_stackoverflow-qa-mixed/train-*
- config_name: scores_synthetic-text2sql-sql
data_files:
- split: train
path: scores_synthetic-text2sql-sql/train-*
---
数据集信息:
本数据集包含三大类配置子项,分别为文档类(documents_*)、查询类(queries_*)与评分类(scores_*),所有配置仅包含训练(train)划分集,具体信息如下:
一、文档类配置项
每个文档类配置均包含以下特征字段:
- document_id:字符串(string)类型的文档唯一标识符
- document:字符串(string)类型的文档内容
- split:字符串(string)类型的数据集划分标识
各文档类配置的详细统计信息如下:
1. documents_apps-python:训练集字节数5242179,样本量8744,下载大小2441966,数据集总大小5242179
2. documents_codefeedback-mt-mixed:训练集字节数98960692,样本量66366,下载大小50617557,数据集总大小98960692
3. documents_codefeedback-st-mixed:训练集字节数221183458,样本量143924,下载大小110602252,数据集总大小221183458
4. documents_codesearchnet-ccr-go:训练集字节数34900899,样本量179976,下载大小16004024,数据集总大小34900899
5. documents_codesearchnet-ccr-java:训练集字节数49478532,样本量179334,下载大小19712064,数据集总大小49478532
6. documents_codesearchnet-ccr-javascript:训练集字节数19197168,样本量64789,下载大小8452866,数据集总大小19197168
7. documents_codesearchnet-ccr-php:训练集字节数70659264,样本量265825,下载大小28045446,数据集总大小70659264
8. documents_codesearchnet-ccr-python:训练集字节数108249601,样本量276406,下载大小44293135,数据集总大小108249601
9. documents_codesearchnet-ccr-ruby:训练集字节数5543437,样本量27078,下载大小2551011,数据集总大小5543437
10. documents_codesearchnet-go:训练集字节数21448402,样本量182395,下载大小9695284,数据集总大小21448402
11. documents_codesearchnet-java:训练集字节数35997640,样本量180834,下载大小14938669,数据集总大小35997640
12. documents_codesearchnet-javascript:训练集字节数13143486,样本量64840,下载大小5907959,数据集总大小13143486
13. documents_codesearchnet-php:训练集字节数49201531,样本量267697,下载大小20971123,数据集总大小49201531
14. documents_codesearchnet-python:训练集字节数79455458,样本量280136,下载大小32522061,数据集总大小79455458
15. documents_codesearchnet-ruby:训练集字节数6921123,样本量27569,下载大小3089665,数据集总大小6921123
16. documents_codetrans-contest-mixed:训练集字节数1541254,样本量1008,下载大小657617,数据集总大小1541254
17. documents_codetrans-dl-mixed:训练集字节数443046,样本量266,下载大小128629,数据集总大小443046
18. documents_cosqa-python:训练集字节数2072945,样本量6267,下载大小1106059,数据集总大小2072945
19. documents_stackoverflow-qa-mixed:训练集字节数24429289,样本量19930,下载大小13175543,数据集总大小24429289
20. documents_synthetic-text2sql-sql:训练集字节数14784521,样本量99605,下载大小6712614,数据集总大小14784521
二、查询类配置项
每个查询类配置均包含以下特征字段:
- query_id:字符串(string)类型的查询唯一标识符
- query:字符串(string)类型的查询内容
- split:字符串(string)类型的数据集划分标识
各查询类配置的详细统计信息如下:
1. queries_apps-python:训练集字节数6410447,样本量5000,下载大小3263749,数据集总大小6410447
2. queries_codefeedback-mt-mixed:训练集字节数235947161,样本量53106,下载大小99226011,数据集总大小235947161
3. queries_codefeedback-st-mixed:训练集字节数93421835,样本量125220,下载大小46801238,数据集总大小93421835
4. queries_codesearchnet-ccr-go:训练集字节数43207673,样本量167288,下载大小18678358,数据集总大小43207673
5. queries_codesearchnet-ccr-java:训练集字节数62513115,样本量164923,下载大小23948866,数据集总大小62513115
6. queries_codesearchnet-ccr-javascript:训练集字节数23515337,样本量58025,下载大小10128890,数据集总大小23515337
7. queries_codesearchnet-ccr-php:训练集字节数87492528,样本量241241,下载大小33347946,数据集总大小87492528
8. queries_codesearchnet-ccr-python:训练集字节数137102794,样本量251820,下载大小57069617,数据集总大小137102794
9. queries_codesearchnet-ccr-ruby:训练集字节数6757179,样本量24927,下载大小3053272,数据集总大小6757179
10. queries_codesearchnet-go:训练集字节数72165539,样本量167288,下载大小29465690,数据集总大小72165539
11. queries_codesearchnet-java:训练集字节数104328618,样本量164923,下载大小37258933,数据集总大小104328618
12. queries_codesearchnet-javascript:训练集字节数38975411,样本量58025,下载大小15762412,数据集总大小38975411
13. queries_codesearchnet-php:训练集字节数146123377,样本量241241,下载大小51891853,数据集总大小146123377
14. queries_codesearchnet-python:训练集字节数229222143,样本量251820,下载大小89277813,数据集总大小229222143
15. queries_codesearchnet-ruby:训练集字节数11273759,样本量24927,下载大小4742042,数据集总大小11273759
16. queries_codetrans-contest-mixed:训练集字节数401024,样本量561,下载大小196326,数据集总大小401024
17. queries_codetrans-dl-mixed:训练集字节数789134,样本量564,下载大小81011,数据集总大小789134
18. queries_cosqa-python:训练集字节数547571,样本量9020,下载大小253852,数据集总大小547571
19. queries_stackoverflow-qa-mixed:训练集字节数19787799,样本量13951,下载大小9951440,数据集总大小19787799
20. queries_synthetic-text2sql-sql:训练集字节数10403642,样本量100000,下载大小5039559,数据集总大小10403642
三、评分类配置项
每个评分类配置均包含以下特征字段:
- query_id:字符串(string)类型的查询唯一标识符
- document_ids:字符串序列(sequence)类型的关联文档ID列表
- scores:64位浮点数(float64)序列类型的关联文档评分列表
- split:字符串(string)类型的数据集划分标识
各评分类配置的详细统计信息如下:
1. scores_apps-python:训练集字节数8644973,样本量5000,下载大小5006148,数据集总大小8644973
2. scores_codefeedback-mt-mixed:训练集字节数97074125,样本量53106,下载大小65753370,数据集总大小97074125
3. scores_codefeedback-st-mixed:训练集字节数234201028,样本量125220,下载大小170916909,数据集总大小234201028
4. scores_codesearchnet-ccr-go:训练集字节数314703997,样本量167288,下载大小203946190,数据集总大小314703997
5. scores_codesearchnet-ccr-java:训练集字节数310823885,样本量164923,下载大小201945544,数据集总大小310823885
6. scores_codesearchnet-ccr-javascript:训练集字节数106364103,样本量58025,下载大小66235584,数据集总大小106364103
7. scores_codesearchnet-ccr-php:训练集字节数458991826,样本量241241,下载大小319089190,数据集总大小458991826
8. scores_codesearchnet-ccr-python:训练集字节数480387614,样本量251820,下载大小303296931,数据集总大小480387614
9. scores_codesearchnet-ccr-ruby:训练集字节数44962878,样本量24927,下载大小26996628,数据集总大小44962878
10. scores_codesearchnet-go:训练集字节数314744834,样本量167288,下载大小208169599,数据集总大小314744834
11. scores_codesearchnet-java:训练集字节数310748507,样本量164923,下载大小205059817,数据集总大小310748507
12. scores_codesearchnet-javascript:训练集字节数106335206,样本量58025,下载大小67842338,数据集总大小106335206
13. scores_codesearchnet-php:训练集字节数458901491,样本量241241,下载大小320894550,数据集总大小458901491
14. scores_codesearchnet-python:训练集字节数480604879,样本量251820,下载大小317376697,数据集总大小480604879
15. scores_codesearchnet-ruby:训练集字节数44912285,样本量24927,下载大小26981969,数据集总大小44912285
16. scores_codetrans-contest-mixed:训练集字节数917462,样本量561,下载大小522534,数据集总大小917462
17. scores_codetrans-dl-mixed:训练集字节数898326,样本量564,下载大小263988,数据集总大小898326
18. scores_cosqa-python:训练集字节数15839278,样本量9020,下载大小8894666,数据集总大小15839278
19. scores_stackoverflow-qa-mixed:训练集字节数24935344,样本量13951,下载大小15048090,数据集总大小24935344
20. scores_synthetic-text2sql-sql:训练集字节数183200346,样本量100000,下载大小129971727,数据集总大小183200346
此外,所有配置项的数据文件信息统一为:每个配置的训练划分集对应数据文件路径为`{配置名称}/train-*`,例如documents_apps-python配置的训练数据路径为`documents_apps-python/train-*`。
提供机构:
Shuu12121



