five

CleverThis/uniprotkb_obsolete_entries_0-v1

收藏
Hugging Face2025-12-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/CleverThis/uniprotkb_obsolete_entries_0-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - feature-extraction language: - en tags: - rdf - knowledge-graph - semantic-web - triples size_categories: - 1K<n<10K --- # uniprotkb_obsolete_entries_0 ## Dataset Description Comprehensive protein knowledgebase with functional annotations **Original Source:** ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_obsolete_entries_0.rdf.xz ### Dataset Summary This dataset contains RDF triples from uniprotkb_obsolete_entries_0 converted to HuggingFace dataset format for easy use in machine learning pipelines. - **Format:** Originally rdf, converted to HuggingFace Dataset - **Size:** 0.392 GB (extracted) - **Entities:** ~90M protein entries - **Triples:** ~3.4B - **Original License:** CC BY 4.0 ### Recommended Use Protein research, molecular biology, functional genomics ### Notes: High quality with manual curation for Swiss-Prot entries. Updated every 8 weeks. ## Dataset Format: Lossless RDF Representation This dataset uses a **standard lossless format** for representing RDF (Resource Description Framework) data in HuggingFace Datasets. All semantic information from the original RDF knowledge graph is preserved, enabling perfect round-trip conversion between RDF and HuggingFace formats. ### Schema Each RDF triple is represented as a row with **6 fields**: | Field | Type | Description | Example | |-------|------|-------------|---------| | `subject` | string | Subject of the triple (URI or blank node) | `"http://schema.org/Person"` | | `predicate` | string | Predicate URI | `"http://www.w3.org/1999/02/22-rdf-syntax-ns#type"` | | `object` | string | Object of the triple | `"John Doe"` or `"http://schema.org/Thing"` | | `object_type` | string | Type of object: `"uri"`, `"literal"`, or `"blank_node"` | `"literal"` | | `object_datatype` | string | XSD datatype URI (for typed literals) | `"http://www.w3.org/2001/XMLSchema#integer"` | | `object_language` | string | Language tag (for language-tagged literals) | `"en"` | ### Example: RDF Triple Representation **Original RDF (Turtle)**: ```turtle <http://example.org/John> <http://schema.org/name> "John Doe"@en . ``` **HuggingFace Dataset Row**: ```python { "subject": "http://example.org/John", "predicate": "http://schema.org/name", "object": "John Doe", "object_type": "literal", "object_datatype": None, "object_language": "en" } ``` ### Loading the Dataset ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("CleverThis/uniprotkb_obsolete_entries_0") # Access the data data = dataset["data"] # Iterate over triples for row in data: subject = row["subject"] predicate = row["predicate"] obj = row["object"] obj_type = row["object_type"] print(f"Triple: ({subject}, {predicate}, {obj})") print(f" Object type: {obj_type}") if row["object_language"]: print(f" Language: {row['object_language']}") if row["object_datatype"]: print(f" Datatype: {row['object_datatype']}") ``` ### Converting Back to RDF The dataset can be converted back to any RDF format (Turtle, N-Triples, RDF/XML, etc.) with **zero information loss**: ```python from datasets import load_dataset from rdflib import Graph, URIRef, Literal, BNode def convert_to_rdf(dataset_name, output_file="output.ttl", split="data"): """Convert HuggingFace dataset back to RDF Turtle format.""" # Load dataset dataset = load_dataset(dataset_name) # Create RDF graph graph = Graph() # Convert each row to RDF triple for row in dataset[split]: # Subject if row["subject"].startswith("_:"): subject = BNode(row["subject"][2:]) else: subject = URIRef(row["subject"]) # Predicate (always URI) predicate = URIRef(row["predicate"]) # Object (depends on object_type) if row["object_type"] == "uri": obj = URIRef(row["object"]) elif row["object_type"] == "blank_node": obj = BNode(row["object"][2:]) elif row["object_type"] == "literal": if row["object_datatype"]: obj = Literal(row["object"], datatype=URIRef(row["object_datatype"])) elif row["object_language"]: obj = Literal(row["object"], lang=row["object_language"]) else: obj = Literal(row["object"]) graph.add((subject, predicate, obj)) # Serialize to Turtle (or any RDF format) graph.serialize(output_file, format="turtle") print(f"Exported {len(graph)} triples to {output_file}") return graph # Usage graph = convert_to_rdf("CleverThis/uniprotkb_obsolete_entries_0", "reconstructed.ttl") ``` ### Information Preservation Guarantee This format preserves **100% of RDF information**: - ✅ **URIs**: Exact string representation preserved - ✅ **Literals**: Full text content preserved - ✅ **Datatypes**: XSD and custom datatypes preserved (e.g., `xsd:integer`, `xsd:dateTime`) - ✅ **Language Tags**: BCP 47 language tags preserved (e.g., `@en`, `@fr`, `@ja`) - ✅ **Blank Nodes**: Node structure preserved (identifiers may change but graph isomorphism maintained) **Round-trip guarantee**: Original RDF → HuggingFace → Reconstructed RDF produces **semantically identical** graphs. ### Querying the Dataset You can filter and query the dataset like any HuggingFace dataset: ```python from datasets import load_dataset dataset = load_dataset("CleverThis/uniprotkb_obsolete_entries_0") # Find all triples with English literals english_literals = dataset["data"].filter( lambda x: x["object_type"] == "literal" and x["object_language"] == "en" ) print(f"Found {len(english_literals)} English literals") # Find all rdf:type statements type_statements = dataset["data"].filter( lambda x: "rdf-syntax-ns#type" in x["predicate"] ) print(f"Found {len(type_statements)} type statements") # Convert to Pandas for analysis import pandas as pd df = dataset["data"].to_pandas() # Analyze predicate distribution print(df["predicate"].value_counts()) ``` ### Dataset Format The dataset contains all triples in a single **data** split, suitable for machine learning tasks such as: - Knowledge graph completion - Link prediction - Entity embedding - Relation extraction - Graph neural networks ### Format Specification For complete technical documentation of the RDF-to-HuggingFace format, see: 📖 [RDF to HuggingFace Format Specification](https://github.com/CleverThis/cleverernie/blob/master/docs/rdf_huggingface_format_specification.md) The specification includes: - Detailed schema definition - All RDF node type mappings - Performance benchmarks - Edge cases and limitations - Complete code examples ### Conversion Metadata - **Source Format**: rdf - **Original Size**: 0.392 GB - **Conversion Tool**: [CleverErnie RDF Pipeline](https://github.com/CleverThis/cleverernie) - **Format Version**: 1.0 - **Conversion Date**: 2025-12-24 ## Citation If you use this dataset, please cite the original source: **Original Dataset:** uniprotkb_obsolete_entries_0 **URL:** ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_obsolete_entries_0.rdf.xz **License:** CC BY 4.0 ## Dataset Preparation This dataset was prepared using the CleverErnie GISM framework: ```bash # Download original dataset python scripts/rdf_dataset_downloader.py uniprotkb_obsolete_entries_0 -o datasets/ # Convert to HuggingFace format python scripts/convert_rdf_to_hf_dataset.py \ datasets/uniprotkb_obsolete_entries_0/[file] \ hf_datasets/uniprotkb_obsolete_entries_0 \ --format xml # Upload to HuggingFace Hub python scripts/upload_all_datasets.py --dataset uniprotkb_obsolete_entries_0 ``` ## Additional Information ### Original Source ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_obsolete_entries_0.rdf.xz ### Conversion Details - Converted using: [CleverErnie GISM](https://github.com/cleverthis/cleverernie) - Conversion script: `scripts/convert_rdf_to_hf_dataset.py` - Dataset format: Single 'data' split with all triples ### Maintenance This dataset is maintained by the CleverThis organization.

许可证:CC BY 4.0 任务类别: - 文本生成 - 特征提取 语言: - 英语 标签: - RDF(Resource Description Framework) - 知识图谱(Knowledge Graph) - 语义网(Semantic Web) - 三元组(Triples) 规模类别: - 1000 < 样本量 < 10000 # uniprotkb_obsolete_entries_0 ## 数据集描述 涵盖功能注释的综合性蛋白质知识库 **原始来源:** ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_obsolete_entries_0.rdf.xz ### 数据集概览 本数据集包含源自uniprotkb_obsolete_entries_0的RDF三元组,并已转换为Hugging Face数据集格式,以便在机器学习流水线中便捷使用。 - **格式:** 原始格式为RDF,已转换为Hugging Face数据集格式 - **大小:** 解压后为0.392 GB - **实体:** 约9000万个蛋白质条目 - **三元组数量:** 约34亿个 - **原始许可证:** CC BY 4.0 ### 推荐应用场景 蛋白质研究、分子生物学、功能基因组学 ### 说明 本数据集针对Swiss-Prot条目进行了人工审核,质量优异;每8周更新一次。 ## 数据集格式:无损RDF表示法 本数据集采用标准无损格式,用于在Hugging Face数据集生态中表示RDF(Resource Description Framework,资源描述框架)数据。原始RDF知识图谱中的所有语义信息均得到保留,可实现RDF格式与Hugging Face数据集格式之间的完美双向转换。 ### 数据Schema 每个RDF三元组以包含6个字段的行形式表示: | 字段名 | 数据类型 | 描述 | 示例 | |-------|------|-------------|---------| | `subject` | string | 三元组的主体(统一资源标识符URI或空白节点) | `"http://schema.org/Person"` | | `predicate` | string | 谓词URI | `"http://www.w3.org/1999/02/22-rdf-syntax-ns#type"` | | `object` | string | 三元组的客体 | `"John Doe"` 或 `"http://schema.org/Thing"` | | `object_type` | string | 客体类型:`"uri"`、`"literal"` 或 `"blank_node"` | `"literal"` | | `object_datatype` | string | XSD数据类型URI(针对类型化字面量) | `"http://www.w3.org/2001/XMLSchema#integer"` | | `object_language` | string | 语言标签(针对带语言标记的字面量) | `"en"` | ### 示例:RDF三元组表示法 **原始RDF(Turtle格式):** turtle <http://example.org/John> <http://schema.org/name> "John Doe"@en . **Hugging Face数据集行:** python { "subject": "http://example.org/John", "predicate": "http://schema.org/name", "object": "John Doe", "object_type": "literal", "object_datatype": None, "object_language": "en" } ### 数据集加载 python from datasets import load_dataset # 加载数据集 dataset = load_dataset("CleverThis/uniprotkb_obsolete_entries_0") # 访问数据 data = dataset["data"] # 遍历三元组 for row in data: subject = row["subject"] predicate = row["predicate"] obj = row["object"] obj_type = row["object_type"] print(f"Triple: ({subject}, {predicate}, {obj})") print(f" Object type: {obj_type}") if row["object_language"]: print(f" Language: {row['object_language']}") if row["object_datatype"]: print(f" Datatype: {row['object_datatype']}") ### 转换回RDF格式 本数据集可无损转换为任意RDF格式(如Turtle、N-Triples、RDF/XML等): python from datasets import load_dataset from rdflib import Graph, URIRef, Literal, BNode def convert_to_rdf(dataset_name, output_file="output.ttl", split="data"): """将Hugging Face数据集转换为RDF Turtle格式。""" # 加载数据集 dataset = load_dataset(dataset_name) # 创建RDF图谱 graph = Graph() # 将每一行转换为RDF三元组 for row in dataset[split]: # 主体 if row["subject"].startswith("_:"): subject = BNode(row["subject"][2:]) else: subject = URIRef(row["subject"]) # 谓词(始终为URI) predicate = URIRef(row["predicate"]) # 客体(取决于客体类型) if row["object_type"] == "uri": obj = URIRef(row["object"]) elif row["object_type"] == "blank_node": obj = BNode(row["object"][2:]) elif row["object_type"] == "literal": if row["object_datatype"]: obj = Literal(row["object"], datatype=URIRef(row["object_datatype"])) elif row["object_language"]: obj = Literal(row["object"], lang=row["object_language"]) else: obj = Literal(row["object"]) graph.add((subject, predicate, obj)) # 序列化为Turtle格式(或任意RDF格式) graph.serialize(output_file, format="turtle") print(f"已将 {len(graph)} 个三元组导出至 {output_file}") return graph # 使用示例 graph = convert_to_rdf("CleverThis/uniprotkb_obsolete_entries_0", "reconstructed.ttl") ### 信息保留保证 该格式可100%保留RDF信息: - ✅ **统一资源标识符(URI)**:保留完整的字符串表示 - ✅ **字面量(Literal)**:保留完整的文本内容 - ✅ **数据类型**:保留XSD及自定义数据类型(如`xsd:integer`、`xsd:dateTime`) - ✅ **语言标签**:保留BCP 47标准语言标签(如`@en`、`@fr`、`@ja`) - ✅ **空白节点(Blank Node)**:保留节点结构(标识符可能变更,但图谱同构性得以维持) **双向转换保证**:原始RDF → Hugging Face数据集 → 重构后的RDF,生成的图谱在语义上完全一致。 ### 数据集查询 你可以像操作任意Hugging Face数据集一样,对本数据集进行过滤与查询: python from datasets import load_dataset dataset = load_dataset("CleverThis/uniprotkb_obsolete_entries_0") # 查找所有带英语语言标记的字面量 english_literals = dataset["data"].filter( lambda x: x["object_type"] == "literal" and x["object_language"] == "en" ) print(f"共找到 {len(english_literals)} 条英语语言标记字面量") # 查找所有rdf:type声明 type_statements = dataset["data"].filter( lambda x: "rdf-syntax-ns#type" in x["predicate"] ) print(f"共找到 {len(type_statements)} 条类型声明") # 转换为Pandas DataFrame进行分析 import pandas as pd df = dataset["data"].to_pandas() # 分析谓词分布 print(df["predicate"].value_counts()) ### 数据集格式 本数据集将所有三元组存储在单个**data**划分中,适用于以下机器学习任务: - 知识图谱补全 - 链接预测 - 实体嵌入 - 关系抽取 - 图神经网络 ### 格式规范 如需获取RDF到Hugging Face格式转换的完整技术文档,请参阅: 📖 [RDF至Hugging Face格式规范](https://github.com/CleverThis/cleverernie/blob/master/docs/rdf_huggingface_format_specification.md) 该规范包含: - 详细的Schema定义 - 所有RDF节点类型映射 - 性能基准测试结果 - 边缘场景与限制说明 - 完整代码示例 ### 转换元数据 - **源格式**:RDF - **原始大小**:0.392 GB - **转换工具**:[CleverErnie RDF Pipeline](https://github.com/CleverThis/cleverernie) - **格式版本**:1.0 - **转换日期**:2025年12月24日 ## 引用说明 若使用本数据集,请引用原始来源: **原始数据集:** uniprotkb_obsolete_entries_0 **链接:** ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_obsolete_entries_0.rdf.xz **许可证:** CC BY 4.0 ## 数据集制备 本数据集通过CleverErnie GISM框架制备: bash # 下载原始数据集 python scripts/rdf_dataset_downloader.py uniprotkb_obsolete_entries_0 -o datasets/ # 转换为Hugging Face数据集格式 python scripts/convert_rdf_to_hf_dataset.py datasets/uniprotkb_obsolete_entries_0/[file] hf_datasets/uniprotkb_obsolete_entries_0 --format xml # 上传至Hugging Face Hub python scripts/upload_all_datasets.py --dataset uniprotkb_obsolete_entries_0 ## 补充信息 ### 原始来源 ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_obsolete_entries_0.rdf.xz ### 转换细节 - **转换工具**:[CleverErnie GISM](https://github.com/cleverthis/cleverernie) - **转换脚本**:`scripts/convert_rdf_to_hf_dataset.py` - **数据集格式**:单`data`划分,包含所有三元组 ### 维护说明 本数据集由CleverThis组织维护。
提供机构:
CleverThis
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作