five

liuhangbiao/SciCode-Domain-Code

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/liuhangbiao/SciCode-Domain-Code
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - code tags: - code - scientific-computing - domain-specific - chemistry - biology - physics size_categories: - 1M<n<10M --- # DATA1: Domain-Specific Code Dataset ## Dataset Overview DATA1 is a large-scale domain-specific code dataset focusing on code samples from interdisciplinary fields such as biology, chemistry, materials science, and related areas. The dataset is collected and organized from GitHub repositories, covering 178 different domain topics with over 1.1 billion lines of code. ## Dataset Statistics - **Total Datasets**: 178 CSV files - **Total Data Size**: ~115 GB - **Total Lines of Code**: Over 1.1 billion lines - **Data Format**: CSV (Comma-Separated Values) - **Encoding**: UTF-8 ## Dataset Structure Each CSV file corresponds to a specific domain topic, with the naming format `dataset_{Topic}.csv`, where `{Topic}` is the domain keyword (e.g., Protein, Drug, Genomics). ### Data Field Description Each CSV file contains the following fields: | Field Name | Type | Description | |------------|------|-------------| | `keyword` | String | Domain keyword used to identify the domain of the code sample | | `repo_name` | String | GitHub repository name (format: owner/repo) | | `file_path` | String | Relative path of the file in the repository | | `file_extension` | String | File extension (e.g., .py, .java, .cpp) | | `file_size` | Integer | File size in bytes | | `line_count` | Integer | Number of lines of code in the file | | `content` | String | Complete file content | | `language` | String | Programming language (e.g., Python, Java, C++) | ## Domain Categories The dataset covers the following major domain categories: ### Biology-Related - **Molecular Biology**: Protein, DNA, RNA, Gene, Enzyme, Receptor, Ligand - **Cell Biology**: Cell_biology, Single_cell, Cell_atlas, Organoid - **Genomics**: Genomics, Genotype, Phenotype, Epigenetics, Metagenomics - **Transcriptomics**: Transcriptomics, Spatial_Transcriptomics, Transcription, Translation - **Proteomics**: Proteomics, Protein_Protein_Interactions, Folding - **Metabolomics**: Metabolomics, Metabolic, Lipidomics, Glycomics - **Systems Biology**: System_biology, Signaling, Pathway, Networks ### Chemistry-Related - **Computational Chemistry**: Computational_Chemistry, Quantum_Chemistry, DFT, QM_MM - **Medicinal Chemistry**: Drug, ADMET, QSAR, Docking, Lead_discovery, Lead_optimization - **Materials Chemistry**: Material, Crystal, Conformation, Chemical_space - **Reaction Chemistry**: Reaction, Kinetics, Mechanism, Redox ### Medicine and Pharmacology - **Pharmacology**: Pharmacology, Pharmacokinetics, Pharmacogenomics, Pharmacogenetics - **Medicine**: Medicine, Disease, Diagnostics, Pathology, Vaccine - **Toxicology**: Toxicology, Biomarker, Marker ### Computational Methods - **Machine Learning**: Transformer, GAN, VAE, Diffusion, Flow_matching, Reinforcement_learning - **Quantum Computing**: Quantum_mechanics, Quantum_biology, Electronic_structure - **Modeling Methods**: Modeling, Multi_scale_modeling, Agent_based_model, Stochastic_modeling - **Numerical Methods**: Monte_Carlo, Finite_element_method, Phase_field_technique ### Other Specialized Fields - **Bioinformatics**: Bioinformatics, Cheminformatics, Next_generation_sequencing - **Bioengineering**: Bioengineering, Biotechnology, Biosensors - **Immunology**: Immunology, Antibody, Antigen, Antagonist - **Virology**: Viral, Pandemic, Pathogens, AMR (Antimicrobial Resistance) ## Data Source The data is collected from open-source repositories on GitHub through the following process: 1. **Keyword Search**: Search for relevant repositories on GitHub using domain-specific keywords 2. **Repository Filtering**: Filter repositories based on relevance scores and code quality 3. **File Extraction**: Extract code files from filtered repositories 4. **Categorization**: Classify files into corresponding topic datasets based on keywords and domain characteristics ## Dataset Characteristics 1. **Wide Domain Coverage**: Covers multiple interdisciplinary fields including biology, chemistry, materials science, and medicine 2. **Diverse Code Types**: Includes multiple programming languages such as Python, Java, C++, R, and MATLAB 3. **Large Scale**: Over 1.1 billion lines of code with a total data size of 115 GB 4. **Structured Storage**: Each domain topic is stored independently as a CSV file for convenient on-demand usage 5. **Rich Metadata**: Contains comprehensive metadata including repository information, file paths, and language types ## Usage Guidelines ### Data Loading ```python import pandas as pd # Load dataset for a specific domain df = pd.read_csv('dataset_Protein.csv') # View basic dataset information print(f"Dataset size: {len(df)} files") print(f"Programming language distribution: {df['language'].value_counts()}") print(f"File type distribution: {df['file_extension'].value_counts()}") ``` ### Data Filtering ```python # Filter by programming language python_files = df[df['language'] == 'Python'] # Filter by file size (e.g., files smaller than 100KB) small_files = df[df['file_size'] < 100000] # Filter by line count medium_files = df[(df['line_count'] > 50) & (df['line_count'] < 1000)] ``` ### Domain-Specific Analysis ```python # Analyze code characteristics for a specific domain protein_df = pd.read_csv('dataset_Protein.csv') print(f"Number of code files in Protein domain: {len(protein_df)}") print(f"Average file size: {protein_df['file_size'].mean():.2f} bytes") print(f"Average line count: {protein_df['line_count'].mean():.2f} lines") ``` ## Important Notes 1. **File Size**: Some dataset files are large (up to several GB), please be mindful of memory usage when loading 2. **Encoding**: All files use UTF-8 encoding; ensure proper handling of special characters if encountered 3. **Data Quality**: Data is sourced from public repositories and may vary in code quality; preprocessing is recommended before use 4. **License Compliance**: Please comply with the license requirements of the original repositories when using the data

--- license: Apache-2.0 许可证 task_categories: - 文本生成 language: - 代码 tags: - 代码 - 科学计算 - 领域专属(domain-specific) - 化学 - 生物学 - 物理学 size_categories: - 100万<样本数<1000万 --- ## DATA1:领域专属代码数据集(Domain-Specific Code Dataset) ## 数据集概览 DATA1是一款大规模领域专属代码数据集,聚焦于生物学、化学、材料科学及相关交叉领域的代码样本。该数据集从GitHub开源仓库中收集整理,涵盖178个不同的领域主题,总代码行数超过11亿行。 ## 数据集统计信息 - **数据集总量**:178个CSV文件 - **总数据规模**:约115 GB - **总代码行数**:超过11亿行 - **数据格式**:CSV(逗号分隔值,Comma-Separated Values) - **编码格式**:UTF-8 ## 数据集结构 每个CSV文件对应一个特定的领域主题,命名格式为`dataset_{Topic}.csv`,其中`{Topic}`为领域关键词(例如:蛋白质(Protein)、药物(Drug)、基因组学(Genomics))。 ### 数据字段说明 每个CSV文件包含以下字段: | 字段名 | 类型 | 描述 | |------------|------|-------------| | `keyword` | 字符串 | 用于标识代码样本所属领域的领域关键词 | | `repo_name` | 字符串 | GitHub仓库名称(格式:所有者/仓库名) | | `file_path` | 字符串 | 文件在仓库中的相对路径 | | `file_extension` | 字符串 | 文件扩展名(例如:.py、.java、.cpp) | | `file_size` | 整数 | 文件大小,单位为字节 | | `line_count` | 整数 | 文件中的代码行数 | | `content` | 字符串 | 文件完整内容 | | `language` | 字符串 | 编程语言(例如:Python、Java、C++) | ## 领域分类 该数据集涵盖以下主要领域类别: ### 生物学相关领域 - **分子生物学**:蛋白质(Protein)、DNA、RNA、基因、酶、受体、配体 - **细胞生物学**:细胞生物学(Cell_biology)、单细胞(Single_cell)、细胞图谱(Cell_atlas)、类器官(Organoid) - **基因组学**:基因组学(Genomics)、基因型(Genotype)、表型(Phenotype)、表观遗传学(Epigenetics)、宏基因组学(Metagenomics) - **转录组学**:转录组学(Transcriptomics)、空间转录组学(Spatial_Transcriptomics)、转录(Transcription)、翻译(Translation) - **蛋白质组学**:蛋白质组学(Proteomics)、蛋白质-蛋白质相互作用(Protein_Protein_Interactions)、折叠(Folding) - **代谢组学**:代谢组学(Metabolomics)、代谢(Metabolic)、脂质组学(Lipidomics)、糖组学(Glycomics) - **系统生物学**:系统生物学(System_biology)、信号传导(Signaling)、通路(Pathway)、网络(Networks) ### 化学相关领域 - **计算化学**:计算化学(Computational_Chemistry)、量子化学(Quantum_Chemistry)、密度泛函理论(DFT)、量子力学/分子力学(QM_MM) - **药物化学**:药物(Drug)、ADMET、定量构效关系(QSAR)、分子对接(Docking)、先导化合物发现(Lead_discovery)、先导化合物优化(Lead_optimization) - **材料化学**:材料(Material)、晶体(Crystal)、构象(Conformation)、化学空间(Chemical_space) - **反应化学**:反应(Reaction)、动力学(Kinetics)、反应机理(Mechanism)、氧化还原(Redox) ### 医学与药理学领域 - **药理学**:药理学(Pharmacology)、药代动力学(Pharmacokinetics)、药物基因组学(Pharmacogenomics)、药物遗传学(Pharmacogenetics) - **医学**:医学(Medicine)、疾病(Disease)、诊断学(Diagnostics)、病理学(Pathology)、疫苗(Vaccine) - **毒理学**:毒理学(Toxicology)、生物标志物(Biomarker)、标志物(Marker) ### 计算方法领域 - **机器学习**:Transformer(Transformer)、生成对抗网络(GAN)、变分自编码器(VAE)、扩散模型(Diffusion)、流匹配(Flow_matching)、强化学习(Reinforcement_learning) - **量子计算**:量子力学(Quantum_mechanics)、量子生物学(Quantum_biology)、电子结构(Electronic_structure) - **建模方法**:建模(Modeling)、多尺度建模(Multi_scale_modeling)、基于智能体的建模(Agent_based_model)、随机建模(Stochastic_modeling) - **数值方法**:蒙特卡洛(Monte_Carlo)、有限元法(Finite_element_method)、相场技术(Phase_field_technique) ### 其他专门领域 - **生物信息学**:生物信息学(Bioinformatics)、化学信息学(Cheminformatics)、下一代测序(Next_generation_sequencing) - **生物工程**:生物工程(Bioengineering)、生物技术(Biotechnology)、生物传感器(Biosensors) - **免疫学**:免疫学(Immunology)、抗体(Antibody)、抗原(Antigen)、拮抗剂(Antagonist) - **病毒学**:病毒(Viral)、大流行(Pandemic)、病原体(Pathogens)、抗菌药物耐药性(AMR,Antimicrobial Resistance) ## 数据来源 数据通过以下流程从GitHub上的开源仓库收集而来: 1. **关键词搜索**:使用领域专属关键词在GitHub上搜索相关仓库 2. **仓库筛选**:基于相关性得分与代码质量对仓库进行筛选 3. **文件提取**:从筛选后的仓库中提取代码文件 4. **分类标注**:基于关键词与领域特征将文件归类至对应主题的数据集 ## 数据集特性 1. **覆盖领域广泛**:涵盖生物学、化学、材料科学、医学等多个交叉学科领域 2. **代码类型多样**:包含Python、Java、C++、R、MATLAB等多种编程语言 3. **规模体量庞大**:总代码行数超过11亿行,总数据规模达115 GB 4. **存储结构结构化**:每个领域主题独立存储为CSV文件,便于按需调用 5. **元信息丰富**:包含仓库信息、文件路径、语言类型等全面的元数据 ## 使用指南 ### 数据加载 python import pandas as pd # 加载特定领域的数据集 df = pd.read_csv('dataset_Protein.csv') # 查看数据集基本信息 print(f"数据集规模:{len(df)} 个文件") print(f"编程语言分布:{df['language'].value_counts()}") print(f"文件类型分布:{df['file_extension'].value_counts()}") ### 数据筛选 python # 按编程语言筛选 python_files = df[df['language'] == 'Python'] # 按文件大小筛选(例如:小于100KB的文件) small_files = df[df['file_size'] < 100000] # 按代码行数筛选 medium_files = df[(df['line_count'] > 50) & (df['line_count'] < 1000)] ### 领域专属分析 python # 分析特定领域的代码特征 protein_df = pd.read_csv('dataset_Protein.csv') print(f"蛋白质领域的代码文件数量:{len(protein_df)}") print(f"平均文件大小:{protein_df['file_size'].mean():.2f} 字节") print(f"平均代码行数:{protein_df['line_count'].mean():.2f} 行") ## 重要注意事项 1. **文件大小**:部分数据集文件体积较大(可达数GB),加载时请注意内存使用情况 2. **编码格式**:所有文件均采用UTF-8编码;若遇到特殊字符,请确保正确处理 3. **数据质量**:数据源自公开仓库,代码质量参差不齐;建议在使用前进行预处理 4. **许可合规**:使用数据时,请遵守原始仓库的许可协议要求
提供机构:
liuhangbiao
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作