five

Shuu12121/coir_hard_negative_datasets_v2_kd

收藏
Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Shuu12121/coir_hard_negative_datasets_v2_kd
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: documents_apps-python features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 5242179 num_examples: 8744 download_size: 2441966 dataset_size: 5242179 - config_name: documents_codefeedback-mt-mixed features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 98960692 num_examples: 66366 download_size: 50617557 dataset_size: 98960692 - config_name: documents_codefeedback-st-mixed features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 221183458 num_examples: 143924 download_size: 110602252 dataset_size: 221183458 - config_name: documents_codesearchnet-ccr-go features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 34900899 num_examples: 179976 download_size: 16004024 dataset_size: 34900899 - config_name: documents_codesearchnet-ccr-java features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 49478532 num_examples: 179334 download_size: 19712064 dataset_size: 49478532 - config_name: documents_codesearchnet-ccr-javascript features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 19197168 num_examples: 64789 download_size: 8452866 dataset_size: 19197168 - config_name: documents_codesearchnet-ccr-php features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 70659264 num_examples: 265825 download_size: 28045446 dataset_size: 70659264 - config_name: documents_codesearchnet-ccr-python features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 108249601 num_examples: 276406 download_size: 44293135 dataset_size: 108249601 - config_name: documents_codesearchnet-ccr-ruby features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 5543437 num_examples: 27078 download_size: 2551011 dataset_size: 5543437 - config_name: documents_codesearchnet-go features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 21448402 num_examples: 182395 download_size: 9695284 dataset_size: 21448402 - config_name: documents_codesearchnet-java features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 35997640 num_examples: 180834 download_size: 14938669 dataset_size: 35997640 - config_name: documents_codesearchnet-javascript features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 13143486 num_examples: 64840 download_size: 5907959 dataset_size: 13143486 - config_name: documents_codesearchnet-php features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 49201531 num_examples: 267697 download_size: 20971123 dataset_size: 49201531 - config_name: documents_codesearchnet-python features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 79455458 num_examples: 280136 download_size: 32522061 dataset_size: 79455458 - config_name: documents_codesearchnet-ruby features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 6921123 num_examples: 27569 download_size: 3089665 dataset_size: 6921123 - config_name: documents_codetrans-contest-mixed features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 1541254 num_examples: 1008 download_size: 657617 dataset_size: 1541254 - config_name: documents_codetrans-dl-mixed features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 443046 num_examples: 266 download_size: 128629 dataset_size: 443046 - config_name: documents_cosqa-python features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 2072945 num_examples: 6267 download_size: 1106059 dataset_size: 2072945 - config_name: documents_stackoverflow-qa-mixed features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 24429289 num_examples: 19930 download_size: 13175543 dataset_size: 24429289 - config_name: documents_synthetic-text2sql-sql features: - name: document_id dtype: string - name: document dtype: string - name: split dtype: string splits: - name: train num_bytes: 14784521 num_examples: 99605 download_size: 6712614 dataset_size: 14784521 - config_name: queries_apps-python features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 6410447 num_examples: 5000 download_size: 3263749 dataset_size: 6410447 - config_name: queries_codefeedback-mt-mixed features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 235947161 num_examples: 53106 download_size: 99226011 dataset_size: 235947161 - config_name: queries_codefeedback-st-mixed features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 93421835 num_examples: 125220 download_size: 46801238 dataset_size: 93421835 - config_name: queries_codesearchnet-ccr-go features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 43207673 num_examples: 167288 download_size: 18678358 dataset_size: 43207673 - config_name: queries_codesearchnet-ccr-java features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 62513115 num_examples: 164923 download_size: 23948866 dataset_size: 62513115 - config_name: queries_codesearchnet-ccr-javascript features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 23515337 num_examples: 58025 download_size: 10128890 dataset_size: 23515337 - config_name: queries_codesearchnet-ccr-php features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 87492528 num_examples: 241241 download_size: 33347946 dataset_size: 87492528 - config_name: queries_codesearchnet-ccr-python features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 137102794 num_examples: 251820 download_size: 57069617 dataset_size: 137102794 - config_name: queries_codesearchnet-ccr-ruby features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 6757179 num_examples: 24927 download_size: 3053272 dataset_size: 6757179 - config_name: queries_codesearchnet-go features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 72165539 num_examples: 167288 download_size: 29465690 dataset_size: 72165539 - config_name: queries_codesearchnet-java features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 104328618 num_examples: 164923 download_size: 37258933 dataset_size: 104328618 - config_name: queries_codesearchnet-javascript features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 38975411 num_examples: 58025 download_size: 15762412 dataset_size: 38975411 - config_name: queries_codesearchnet-php features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 146123377 num_examples: 241241 download_size: 51891853 dataset_size: 146123377 - config_name: queries_codesearchnet-python features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 229222143 num_examples: 251820 download_size: 89277813 dataset_size: 229222143 - config_name: queries_codesearchnet-ruby features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 11273759 num_examples: 24927 download_size: 4742042 dataset_size: 11273759 - config_name: queries_codetrans-contest-mixed features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 401024 num_examples: 561 download_size: 196326 dataset_size: 401024 - config_name: queries_codetrans-dl-mixed features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 789134 num_examples: 564 download_size: 81011 dataset_size: 789134 - config_name: queries_cosqa-python features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 547571 num_examples: 9020 download_size: 253852 dataset_size: 547571 - config_name: queries_stackoverflow-qa-mixed features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 19787799 num_examples: 13951 download_size: 9951440 dataset_size: 19787799 - config_name: queries_synthetic-text2sql-sql features: - name: query_id dtype: string - name: query dtype: string - name: split dtype: string splits: - name: train num_bytes: 10403642 num_examples: 100000 download_size: 5039559 dataset_size: 10403642 - config_name: scores_apps-python features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 8644973 num_examples: 5000 download_size: 5006148 dataset_size: 8644973 - config_name: scores_codefeedback-mt-mixed features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 97074125 num_examples: 53106 download_size: 65753370 dataset_size: 97074125 - config_name: scores_codefeedback-st-mixed features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 234201028 num_examples: 125220 download_size: 170916909 dataset_size: 234201028 - config_name: scores_codesearchnet-ccr-go features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 314703997 num_examples: 167288 download_size: 203946190 dataset_size: 314703997 - config_name: scores_codesearchnet-ccr-java features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 310823885 num_examples: 164923 download_size: 201945544 dataset_size: 310823885 - config_name: scores_codesearchnet-ccr-javascript features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 106364103 num_examples: 58025 download_size: 66235584 dataset_size: 106364103 - config_name: scores_codesearchnet-ccr-php features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 458991826 num_examples: 241241 download_size: 319089190 dataset_size: 458991826 - config_name: scores_codesearchnet-ccr-python features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 480387614 num_examples: 251820 download_size: 303296931 dataset_size: 480387614 - config_name: scores_codesearchnet-ccr-ruby features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 44962878 num_examples: 24927 download_size: 26996628 dataset_size: 44962878 - config_name: scores_codesearchnet-go features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 314744834 num_examples: 167288 download_size: 208169599 dataset_size: 314744834 - config_name: scores_codesearchnet-java features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 310748507 num_examples: 164923 download_size: 205059817 dataset_size: 310748507 - config_name: scores_codesearchnet-javascript features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 106335206 num_examples: 58025 download_size: 67842338 dataset_size: 106335206 - config_name: scores_codesearchnet-php features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 458901491 num_examples: 241241 download_size: 320894550 dataset_size: 458901491 - config_name: scores_codesearchnet-python features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 480604879 num_examples: 251820 download_size: 317376697 dataset_size: 480604879 - config_name: scores_codesearchnet-ruby features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 44912285 num_examples: 24927 download_size: 26981969 dataset_size: 44912285 - config_name: scores_codetrans-contest-mixed features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 917462 num_examples: 561 download_size: 522534 dataset_size: 917462 - config_name: scores_codetrans-dl-mixed features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 898326 num_examples: 564 download_size: 263988 dataset_size: 898326 - config_name: scores_cosqa-python features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 15839278 num_examples: 9020 download_size: 8894666 dataset_size: 15839278 - config_name: scores_stackoverflow-qa-mixed features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 24935344 num_examples: 13951 download_size: 15048090 dataset_size: 24935344 - config_name: scores_synthetic-text2sql-sql features: - name: query_id dtype: string - name: document_ids sequence: string - name: scores sequence: float64 - name: split dtype: string splits: - name: train num_bytes: 183200346 num_examples: 100000 download_size: 129971727 dataset_size: 183200346 configs: - config_name: documents_apps-python data_files: - split: train path: documents_apps-python/train-* - config_name: documents_codefeedback-mt-mixed data_files: - split: train path: documents_codefeedback-mt-mixed/train-* - config_name: documents_codefeedback-st-mixed data_files: - split: train path: documents_codefeedback-st-mixed/train-* - config_name: documents_codesearchnet-ccr-go data_files: - split: train path: documents_codesearchnet-ccr-go/train-* - config_name: documents_codesearchnet-ccr-java data_files: - split: train path: documents_codesearchnet-ccr-java/train-* - config_name: documents_codesearchnet-ccr-javascript data_files: - split: train path: documents_codesearchnet-ccr-javascript/train-* - config_name: documents_codesearchnet-ccr-php data_files: - split: train path: documents_codesearchnet-ccr-php/train-* - config_name: documents_codesearchnet-ccr-python data_files: - split: train path: documents_codesearchnet-ccr-python/train-* - config_name: documents_codesearchnet-ccr-ruby data_files: - split: train path: documents_codesearchnet-ccr-ruby/train-* - config_name: documents_codesearchnet-go data_files: - split: train path: documents_codesearchnet-go/train-* - config_name: documents_codesearchnet-java data_files: - split: train path: documents_codesearchnet-java/train-* - config_name: documents_codesearchnet-javascript data_files: - split: train path: documents_codesearchnet-javascript/train-* - config_name: documents_codesearchnet-php data_files: - split: train path: documents_codesearchnet-php/train-* - config_name: documents_codesearchnet-python data_files: - split: train path: documents_codesearchnet-python/train-* - config_name: documents_codesearchnet-ruby data_files: - split: train path: documents_codesearchnet-ruby/train-* - config_name: documents_codetrans-contest-mixed data_files: - split: train path: documents_codetrans-contest-mixed/train-* - config_name: documents_codetrans-dl-mixed data_files: - split: train path: documents_codetrans-dl-mixed/train-* - config_name: documents_cosqa-python data_files: - split: train path: documents_cosqa-python/train-* - config_name: documents_stackoverflow-qa-mixed data_files: - split: train path: documents_stackoverflow-qa-mixed/train-* - config_name: documents_synthetic-text2sql-sql data_files: - split: train path: documents_synthetic-text2sql-sql/train-* - config_name: queries_apps-python data_files: - split: train path: queries_apps-python/train-* - config_name: queries_codefeedback-mt-mixed data_files: - split: train path: queries_codefeedback-mt-mixed/train-* - config_name: queries_codefeedback-st-mixed data_files: - split: train path: queries_codefeedback-st-mixed/train-* - config_name: queries_codesearchnet-ccr-go data_files: - split: train path: queries_codesearchnet-ccr-go/train-* - config_name: queries_codesearchnet-ccr-java data_files: - split: train path: queries_codesearchnet-ccr-java/train-* - config_name: queries_codesearchnet-ccr-javascript data_files: - split: train path: queries_codesearchnet-ccr-javascript/train-* - config_name: queries_codesearchnet-ccr-php data_files: - split: train path: queries_codesearchnet-ccr-php/train-* - config_name: queries_codesearchnet-ccr-python data_files: - split: train path: queries_codesearchnet-ccr-python/train-* - config_name: queries_codesearchnet-ccr-ruby data_files: - split: train path: queries_codesearchnet-ccr-ruby/train-* - config_name: queries_codesearchnet-go data_files: - split: train path: queries_codesearchnet-go/train-* - config_name: queries_codesearchnet-java data_files: - split: train path: queries_codesearchnet-java/train-* - config_name: queries_codesearchnet-javascript data_files: - split: train path: queries_codesearchnet-javascript/train-* - config_name: queries_codesearchnet-php data_files: - split: train path: queries_codesearchnet-php/train-* - config_name: queries_codesearchnet-python data_files: - split: train path: queries_codesearchnet-python/train-* - config_name: queries_codesearchnet-ruby data_files: - split: train path: queries_codesearchnet-ruby/train-* - config_name: queries_codetrans-contest-mixed data_files: - split: train path: queries_codetrans-contest-mixed/train-* - config_name: queries_codetrans-dl-mixed data_files: - split: train path: queries_codetrans-dl-mixed/train-* - config_name: queries_cosqa-python data_files: - split: train path: queries_cosqa-python/train-* - config_name: queries_stackoverflow-qa-mixed data_files: - split: train path: queries_stackoverflow-qa-mixed/train-* - config_name: queries_synthetic-text2sql-sql data_files: - split: train path: queries_synthetic-text2sql-sql/train-* - config_name: scores_apps-python data_files: - split: train path: scores_apps-python/train-* - config_name: scores_codefeedback-mt-mixed data_files: - split: train path: scores_codefeedback-mt-mixed/train-* - config_name: scores_codefeedback-st-mixed data_files: - split: train path: scores_codefeedback-st-mixed/train-* - config_name: scores_codesearchnet-ccr-go data_files: - split: train path: scores_codesearchnet-ccr-go/train-* - config_name: scores_codesearchnet-ccr-java data_files: - split: train path: scores_codesearchnet-ccr-java/train-* - config_name: scores_codesearchnet-ccr-javascript data_files: - split: train path: scores_codesearchnet-ccr-javascript/train-* - config_name: scores_codesearchnet-ccr-php data_files: - split: train path: scores_codesearchnet-ccr-php/train-* - config_name: scores_codesearchnet-ccr-python data_files: - split: train path: scores_codesearchnet-ccr-python/train-* - config_name: scores_codesearchnet-ccr-ruby data_files: - split: train path: scores_codesearchnet-ccr-ruby/train-* - config_name: scores_codesearchnet-go data_files: - split: train path: scores_codesearchnet-go/train-* - config_name: scores_codesearchnet-java data_files: - split: train path: scores_codesearchnet-java/train-* - config_name: scores_codesearchnet-javascript data_files: - split: train path: scores_codesearchnet-javascript/train-* - config_name: scores_codesearchnet-php data_files: - split: train path: scores_codesearchnet-php/train-* - config_name: scores_codesearchnet-python data_files: - split: train path: scores_codesearchnet-python/train-* - config_name: scores_codesearchnet-ruby data_files: - split: train path: scores_codesearchnet-ruby/train-* - config_name: scores_codetrans-contest-mixed data_files: - split: train path: scores_codetrans-contest-mixed/train-* - config_name: scores_codetrans-dl-mixed data_files: - split: train path: scores_codetrans-dl-mixed/train-* - config_name: scores_cosqa-python data_files: - split: train path: scores_cosqa-python/train-* - config_name: scores_stackoverflow-qa-mixed data_files: - split: train path: scores_stackoverflow-qa-mixed/train-* - config_name: scores_synthetic-text2sql-sql data_files: - split: train path: scores_synthetic-text2sql-sql/train-* ---

数据集信息: 本数据集包含三大类配置子项,分别为文档类(documents_*)、查询类(queries_*)与评分类(scores_*),所有配置仅包含训练(train)划分集,具体信息如下: 一、文档类配置项 每个文档类配置均包含以下特征字段: - document_id:字符串(string)类型的文档唯一标识符 - document:字符串(string)类型的文档内容 - split:字符串(string)类型的数据集划分标识 各文档类配置的详细统计信息如下: 1. documents_apps-python:训练集字节数5242179,样本量8744,下载大小2441966,数据集总大小5242179 2. documents_codefeedback-mt-mixed:训练集字节数98960692,样本量66366,下载大小50617557,数据集总大小98960692 3. documents_codefeedback-st-mixed:训练集字节数221183458,样本量143924,下载大小110602252,数据集总大小221183458 4. documents_codesearchnet-ccr-go:训练集字节数34900899,样本量179976,下载大小16004024,数据集总大小34900899 5. documents_codesearchnet-ccr-java:训练集字节数49478532,样本量179334,下载大小19712064,数据集总大小49478532 6. documents_codesearchnet-ccr-javascript:训练集字节数19197168,样本量64789,下载大小8452866,数据集总大小19197168 7. documents_codesearchnet-ccr-php:训练集字节数70659264,样本量265825,下载大小28045446,数据集总大小70659264 8. documents_codesearchnet-ccr-python:训练集字节数108249601,样本量276406,下载大小44293135,数据集总大小108249601 9. documents_codesearchnet-ccr-ruby:训练集字节数5543437,样本量27078,下载大小2551011,数据集总大小5543437 10. documents_codesearchnet-go:训练集字节数21448402,样本量182395,下载大小9695284,数据集总大小21448402 11. documents_codesearchnet-java:训练集字节数35997640,样本量180834,下载大小14938669,数据集总大小35997640 12. documents_codesearchnet-javascript:训练集字节数13143486,样本量64840,下载大小5907959,数据集总大小13143486 13. documents_codesearchnet-php:训练集字节数49201531,样本量267697,下载大小20971123,数据集总大小49201531 14. documents_codesearchnet-python:训练集字节数79455458,样本量280136,下载大小32522061,数据集总大小79455458 15. documents_codesearchnet-ruby:训练集字节数6921123,样本量27569,下载大小3089665,数据集总大小6921123 16. documents_codetrans-contest-mixed:训练集字节数1541254,样本量1008,下载大小657617,数据集总大小1541254 17. documents_codetrans-dl-mixed:训练集字节数443046,样本量266,下载大小128629,数据集总大小443046 18. documents_cosqa-python:训练集字节数2072945,样本量6267,下载大小1106059,数据集总大小2072945 19. documents_stackoverflow-qa-mixed:训练集字节数24429289,样本量19930,下载大小13175543,数据集总大小24429289 20. documents_synthetic-text2sql-sql:训练集字节数14784521,样本量99605,下载大小6712614,数据集总大小14784521 二、查询类配置项 每个查询类配置均包含以下特征字段: - query_id:字符串(string)类型的查询唯一标识符 - query:字符串(string)类型的查询内容 - split:字符串(string)类型的数据集划分标识 各查询类配置的详细统计信息如下: 1. queries_apps-python:训练集字节数6410447,样本量5000,下载大小3263749,数据集总大小6410447 2. queries_codefeedback-mt-mixed:训练集字节数235947161,样本量53106,下载大小99226011,数据集总大小235947161 3. queries_codefeedback-st-mixed:训练集字节数93421835,样本量125220,下载大小46801238,数据集总大小93421835 4. queries_codesearchnet-ccr-go:训练集字节数43207673,样本量167288,下载大小18678358,数据集总大小43207673 5. queries_codesearchnet-ccr-java:训练集字节数62513115,样本量164923,下载大小23948866,数据集总大小62513115 6. queries_codesearchnet-ccr-javascript:训练集字节数23515337,样本量58025,下载大小10128890,数据集总大小23515337 7. queries_codesearchnet-ccr-php:训练集字节数87492528,样本量241241,下载大小33347946,数据集总大小87492528 8. queries_codesearchnet-ccr-python:训练集字节数137102794,样本量251820,下载大小57069617,数据集总大小137102794 9. queries_codesearchnet-ccr-ruby:训练集字节数6757179,样本量24927,下载大小3053272,数据集总大小6757179 10. queries_codesearchnet-go:训练集字节数72165539,样本量167288,下载大小29465690,数据集总大小72165539 11. queries_codesearchnet-java:训练集字节数104328618,样本量164923,下载大小37258933,数据集总大小104328618 12. queries_codesearchnet-javascript:训练集字节数38975411,样本量58025,下载大小15762412,数据集总大小38975411 13. queries_codesearchnet-php:训练集字节数146123377,样本量241241,下载大小51891853,数据集总大小146123377 14. queries_codesearchnet-python:训练集字节数229222143,样本量251820,下载大小89277813,数据集总大小229222143 15. queries_codesearchnet-ruby:训练集字节数11273759,样本量24927,下载大小4742042,数据集总大小11273759 16. queries_codetrans-contest-mixed:训练集字节数401024,样本量561,下载大小196326,数据集总大小401024 17. queries_codetrans-dl-mixed:训练集字节数789134,样本量564,下载大小81011,数据集总大小789134 18. queries_cosqa-python:训练集字节数547571,样本量9020,下载大小253852,数据集总大小547571 19. queries_stackoverflow-qa-mixed:训练集字节数19787799,样本量13951,下载大小9951440,数据集总大小19787799 20. queries_synthetic-text2sql-sql:训练集字节数10403642,样本量100000,下载大小5039559,数据集总大小10403642 三、评分类配置项 每个评分类配置均包含以下特征字段: - query_id:字符串(string)类型的查询唯一标识符 - document_ids:字符串序列(sequence)类型的关联文档ID列表 - scores:64位浮点数(float64)序列类型的关联文档评分列表 - split:字符串(string)类型的数据集划分标识 各评分类配置的详细统计信息如下: 1. scores_apps-python:训练集字节数8644973,样本量5000,下载大小5006148,数据集总大小8644973 2. scores_codefeedback-mt-mixed:训练集字节数97074125,样本量53106,下载大小65753370,数据集总大小97074125 3. scores_codefeedback-st-mixed:训练集字节数234201028,样本量125220,下载大小170916909,数据集总大小234201028 4. scores_codesearchnet-ccr-go:训练集字节数314703997,样本量167288,下载大小203946190,数据集总大小314703997 5. scores_codesearchnet-ccr-java:训练集字节数310823885,样本量164923,下载大小201945544,数据集总大小310823885 6. scores_codesearchnet-ccr-javascript:训练集字节数106364103,样本量58025,下载大小66235584,数据集总大小106364103 7. scores_codesearchnet-ccr-php:训练集字节数458991826,样本量241241,下载大小319089190,数据集总大小458991826 8. scores_codesearchnet-ccr-python:训练集字节数480387614,样本量251820,下载大小303296931,数据集总大小480387614 9. scores_codesearchnet-ccr-ruby:训练集字节数44962878,样本量24927,下载大小26996628,数据集总大小44962878 10. scores_codesearchnet-go:训练集字节数314744834,样本量167288,下载大小208169599,数据集总大小314744834 11. scores_codesearchnet-java:训练集字节数310748507,样本量164923,下载大小205059817,数据集总大小310748507 12. scores_codesearchnet-javascript:训练集字节数106335206,样本量58025,下载大小67842338,数据集总大小106335206 13. scores_codesearchnet-php:训练集字节数458901491,样本量241241,下载大小320894550,数据集总大小458901491 14. scores_codesearchnet-python:训练集字节数480604879,样本量251820,下载大小317376697,数据集总大小480604879 15. scores_codesearchnet-ruby:训练集字节数44912285,样本量24927,下载大小26981969,数据集总大小44912285 16. scores_codetrans-contest-mixed:训练集字节数917462,样本量561,下载大小522534,数据集总大小917462 17. scores_codetrans-dl-mixed:训练集字节数898326,样本量564,下载大小263988,数据集总大小898326 18. scores_cosqa-python:训练集字节数15839278,样本量9020,下载大小8894666,数据集总大小15839278 19. scores_stackoverflow-qa-mixed:训练集字节数24935344,样本量13951,下载大小15048090,数据集总大小24935344 20. scores_synthetic-text2sql-sql:训练集字节数183200346,样本量100000,下载大小129971727,数据集总大小183200346 此外,所有配置项的数据文件信息统一为:每个配置的训练划分集对应数据文件路径为`{配置名称}/train-*`,例如documents_apps-python配置的训练数据路径为`documents_apps-python/train-*`。
提供机构:
Shuu12121
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作