CleverThis/uniprotkb_obsolete_entries_0-v1
收藏Hugging Face2025-12-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/CleverThis/uniprotkb_obsolete_entries_0-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- feature-extraction
language:
- en
tags:
- rdf
- knowledge-graph
- semantic-web
- triples
size_categories:
- 1K<n<10K
---
# uniprotkb_obsolete_entries_0
## Dataset Description
Comprehensive protein knowledgebase with functional annotations
**Original Source:** ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_obsolete_entries_0.rdf.xz
### Dataset Summary
This dataset contains RDF triples from uniprotkb_obsolete_entries_0 converted to HuggingFace
dataset format for easy use in machine learning pipelines.
- **Format:** Originally rdf, converted to HuggingFace Dataset
- **Size:** 0.392 GB (extracted)
- **Entities:** ~90M protein entries
- **Triples:** ~3.4B
- **Original License:**
CC BY 4.0
### Recommended Use
Protein research, molecular biology, functional genomics
### Notes: High quality with manual curation for Swiss-Prot entries. Updated every 8 weeks.
## Dataset Format: Lossless RDF Representation
This dataset uses a **standard lossless format** for representing RDF (Resource Description Framework)
data in HuggingFace Datasets. All semantic information from the original RDF knowledge graph is preserved,
enabling perfect round-trip conversion between RDF and HuggingFace formats.
### Schema
Each RDF triple is represented as a row with **6 fields**:
| Field | Type | Description | Example |
|-------|------|-------------|---------|
| `subject` | string | Subject of the triple (URI or blank node) | `"http://schema.org/Person"` |
| `predicate` | string | Predicate URI | `"http://www.w3.org/1999/02/22-rdf-syntax-ns#type"` |
| `object` | string | Object of the triple | `"John Doe"` or `"http://schema.org/Thing"` |
| `object_type` | string | Type of object: `"uri"`, `"literal"`, or `"blank_node"` | `"literal"` |
| `object_datatype` | string | XSD datatype URI (for typed literals) | `"http://www.w3.org/2001/XMLSchema#integer"` |
| `object_language` | string | Language tag (for language-tagged literals) | `"en"` |
### Example: RDF Triple Representation
**Original RDF (Turtle)**:
```turtle
<http://example.org/John> <http://schema.org/name> "John Doe"@en .
```
**HuggingFace Dataset Row**:
```python
{
"subject": "http://example.org/John",
"predicate": "http://schema.org/name",
"object": "John Doe",
"object_type": "literal",
"object_datatype": None,
"object_language": "en"
}
```
### Loading the Dataset
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("CleverThis/uniprotkb_obsolete_entries_0")
# Access the data
data = dataset["data"]
# Iterate over triples
for row in data:
subject = row["subject"]
predicate = row["predicate"]
obj = row["object"]
obj_type = row["object_type"]
print(f"Triple: ({subject}, {predicate}, {obj})")
print(f" Object type: {obj_type}")
if row["object_language"]:
print(f" Language: {row['object_language']}")
if row["object_datatype"]:
print(f" Datatype: {row['object_datatype']}")
```
### Converting Back to RDF
The dataset can be converted back to any RDF format (Turtle, N-Triples, RDF/XML,
etc.) with **zero information loss**:
```python
from datasets import load_dataset
from rdflib import Graph, URIRef, Literal, BNode
def convert_to_rdf(dataset_name, output_file="output.ttl", split="data"):
"""Convert HuggingFace dataset back to RDF Turtle format."""
# Load dataset
dataset = load_dataset(dataset_name)
# Create RDF graph
graph = Graph()
# Convert each row to RDF triple
for row in dataset[split]:
# Subject
if row["subject"].startswith("_:"):
subject = BNode(row["subject"][2:])
else:
subject = URIRef(row["subject"])
# Predicate (always URI)
predicate = URIRef(row["predicate"])
# Object (depends on object_type)
if row["object_type"] == "uri":
obj = URIRef(row["object"])
elif row["object_type"] == "blank_node":
obj = BNode(row["object"][2:])
elif row["object_type"] == "literal":
if row["object_datatype"]:
obj = Literal(row["object"], datatype=URIRef(row["object_datatype"]))
elif row["object_language"]:
obj = Literal(row["object"], lang=row["object_language"])
else:
obj = Literal(row["object"])
graph.add((subject, predicate, obj))
# Serialize to Turtle (or any RDF format)
graph.serialize(output_file, format="turtle")
print(f"Exported {len(graph)} triples to {output_file}")
return graph
# Usage
graph = convert_to_rdf("CleverThis/uniprotkb_obsolete_entries_0", "reconstructed.ttl")
```
### Information Preservation Guarantee
This format preserves **100% of RDF information**:
- ✅ **URIs**: Exact string representation preserved
- ✅ **Literals**: Full text content preserved
- ✅ **Datatypes**: XSD and custom datatypes preserved
(e.g., `xsd:integer`, `xsd:dateTime`)
- ✅ **Language Tags**: BCP 47 language tags preserved (e.g., `@en`, `@fr`, `@ja`)
- ✅ **Blank Nodes**: Node structure preserved (identifiers may change but
graph isomorphism maintained)
**Round-trip guarantee**: Original RDF → HuggingFace → Reconstructed RDF
produces **semantically identical** graphs.
### Querying the Dataset
You can filter and query the dataset like any HuggingFace dataset:
```python
from datasets import load_dataset
dataset = load_dataset("CleverThis/uniprotkb_obsolete_entries_0")
# Find all triples with English literals
english_literals = dataset["data"].filter(
lambda x: x["object_type"] == "literal" and x["object_language"] == "en"
)
print(f"Found {len(english_literals)} English literals")
# Find all rdf:type statements
type_statements = dataset["data"].filter(
lambda x: "rdf-syntax-ns#type" in x["predicate"]
)
print(f"Found {len(type_statements)} type statements")
# Convert to Pandas for analysis
import pandas as pd
df = dataset["data"].to_pandas()
# Analyze predicate distribution
print(df["predicate"].value_counts())
```
### Dataset Format
The dataset contains all triples in a single **data** split, suitable for
machine learning tasks such as:
- Knowledge graph completion
- Link prediction
- Entity embedding
- Relation extraction
- Graph neural networks
### Format Specification
For complete technical documentation of the RDF-to-HuggingFace format, see:
📖 [RDF to HuggingFace Format Specification](https://github.com/CleverThis/cleverernie/blob/master/docs/rdf_huggingface_format_specification.md)
The specification includes:
- Detailed schema definition
- All RDF node type mappings
- Performance benchmarks
- Edge cases and limitations
- Complete code examples
### Conversion Metadata
- **Source Format**: rdf
- **Original Size**: 0.392 GB
- **Conversion Tool**: [CleverErnie RDF Pipeline](https://github.com/CleverThis/cleverernie)
- **Format Version**: 1.0
- **Conversion Date**: 2025-12-24
## Citation
If you use this dataset, please cite the original source:
**Original Dataset:** uniprotkb_obsolete_entries_0
**URL:** ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_obsolete_entries_0.rdf.xz
**License:** CC BY 4.0
## Dataset Preparation
This dataset was prepared using the CleverErnie GISM framework:
```bash
# Download original dataset
python scripts/rdf_dataset_downloader.py uniprotkb_obsolete_entries_0 -o datasets/
# Convert to HuggingFace format
python scripts/convert_rdf_to_hf_dataset.py \
datasets/uniprotkb_obsolete_entries_0/[file] \
hf_datasets/uniprotkb_obsolete_entries_0 \
--format xml
# Upload to HuggingFace Hub
python scripts/upload_all_datasets.py --dataset uniprotkb_obsolete_entries_0
```
## Additional Information
### Original Source
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_obsolete_entries_0.rdf.xz
### Conversion Details
- Converted using: [CleverErnie GISM](https://github.com/cleverthis/cleverernie)
- Conversion script: `scripts/convert_rdf_to_hf_dataset.py`
- Dataset format: Single 'data' split with all triples
### Maintenance
This dataset is maintained by the CleverThis organization.
许可证:CC BY 4.0
任务类别:
- 文本生成
- 特征提取
语言:
- 英语
标签:
- RDF(Resource Description Framework)
- 知识图谱(Knowledge Graph)
- 语义网(Semantic Web)
- 三元组(Triples)
规模类别:
- 1000 < 样本量 < 10000
# uniprotkb_obsolete_entries_0
## 数据集描述
涵盖功能注释的综合性蛋白质知识库
**原始来源:** ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_obsolete_entries_0.rdf.xz
### 数据集概览
本数据集包含源自uniprotkb_obsolete_entries_0的RDF三元组,并已转换为Hugging Face数据集格式,以便在机器学习流水线中便捷使用。
- **格式:** 原始格式为RDF,已转换为Hugging Face数据集格式
- **大小:** 解压后为0.392 GB
- **实体:** 约9000万个蛋白质条目
- **三元组数量:** 约34亿个
- **原始许可证:** CC BY 4.0
### 推荐应用场景
蛋白质研究、分子生物学、功能基因组学
### 说明
本数据集针对Swiss-Prot条目进行了人工审核,质量优异;每8周更新一次。
## 数据集格式:无损RDF表示法
本数据集采用标准无损格式,用于在Hugging Face数据集生态中表示RDF(Resource Description Framework,资源描述框架)数据。原始RDF知识图谱中的所有语义信息均得到保留,可实现RDF格式与Hugging Face数据集格式之间的完美双向转换。
### 数据Schema
每个RDF三元组以包含6个字段的行形式表示:
| 字段名 | 数据类型 | 描述 | 示例 |
|-------|------|-------------|---------|
| `subject` | string | 三元组的主体(统一资源标识符URI或空白节点) | `"http://schema.org/Person"` |
| `predicate` | string | 谓词URI | `"http://www.w3.org/1999/02/22-rdf-syntax-ns#type"` |
| `object` | string | 三元组的客体 | `"John Doe"` 或 `"http://schema.org/Thing"` |
| `object_type` | string | 客体类型:`"uri"`、`"literal"` 或 `"blank_node"` | `"literal"` |
| `object_datatype` | string | XSD数据类型URI(针对类型化字面量) | `"http://www.w3.org/2001/XMLSchema#integer"` |
| `object_language` | string | 语言标签(针对带语言标记的字面量) | `"en"` |
### 示例:RDF三元组表示法
**原始RDF(Turtle格式):**
turtle
<http://example.org/John> <http://schema.org/name> "John Doe"@en .
**Hugging Face数据集行:**
python
{
"subject": "http://example.org/John",
"predicate": "http://schema.org/name",
"object": "John Doe",
"object_type": "literal",
"object_datatype": None,
"object_language": "en"
}
### 数据集加载
python
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("CleverThis/uniprotkb_obsolete_entries_0")
# 访问数据
data = dataset["data"]
# 遍历三元组
for row in data:
subject = row["subject"]
predicate = row["predicate"]
obj = row["object"]
obj_type = row["object_type"]
print(f"Triple: ({subject}, {predicate}, {obj})")
print(f" Object type: {obj_type}")
if row["object_language"]:
print(f" Language: {row['object_language']}")
if row["object_datatype"]:
print(f" Datatype: {row['object_datatype']}")
### 转换回RDF格式
本数据集可无损转换为任意RDF格式(如Turtle、N-Triples、RDF/XML等):
python
from datasets import load_dataset
from rdflib import Graph, URIRef, Literal, BNode
def convert_to_rdf(dataset_name, output_file="output.ttl", split="data"):
"""将Hugging Face数据集转换为RDF Turtle格式。"""
# 加载数据集
dataset = load_dataset(dataset_name)
# 创建RDF图谱
graph = Graph()
# 将每一行转换为RDF三元组
for row in dataset[split]:
# 主体
if row["subject"].startswith("_:"):
subject = BNode(row["subject"][2:])
else:
subject = URIRef(row["subject"])
# 谓词(始终为URI)
predicate = URIRef(row["predicate"])
# 客体(取决于客体类型)
if row["object_type"] == "uri":
obj = URIRef(row["object"])
elif row["object_type"] == "blank_node":
obj = BNode(row["object"][2:])
elif row["object_type"] == "literal":
if row["object_datatype"]:
obj = Literal(row["object"], datatype=URIRef(row["object_datatype"]))
elif row["object_language"]:
obj = Literal(row["object"], lang=row["object_language"])
else:
obj = Literal(row["object"])
graph.add((subject, predicate, obj))
# 序列化为Turtle格式(或任意RDF格式)
graph.serialize(output_file, format="turtle")
print(f"已将 {len(graph)} 个三元组导出至 {output_file}")
return graph
# 使用示例
graph = convert_to_rdf("CleverThis/uniprotkb_obsolete_entries_0", "reconstructed.ttl")
### 信息保留保证
该格式可100%保留RDF信息:
- ✅ **统一资源标识符(URI)**:保留完整的字符串表示
- ✅ **字面量(Literal)**:保留完整的文本内容
- ✅ **数据类型**:保留XSD及自定义数据类型(如`xsd:integer`、`xsd:dateTime`)
- ✅ **语言标签**:保留BCP 47标准语言标签(如`@en`、`@fr`、`@ja`)
- ✅ **空白节点(Blank Node)**:保留节点结构(标识符可能变更,但图谱同构性得以维持)
**双向转换保证**:原始RDF → Hugging Face数据集 → 重构后的RDF,生成的图谱在语义上完全一致。
### 数据集查询
你可以像操作任意Hugging Face数据集一样,对本数据集进行过滤与查询:
python
from datasets import load_dataset
dataset = load_dataset("CleverThis/uniprotkb_obsolete_entries_0")
# 查找所有带英语语言标记的字面量
english_literals = dataset["data"].filter(
lambda x: x["object_type"] == "literal" and x["object_language"] == "en"
)
print(f"共找到 {len(english_literals)} 条英语语言标记字面量")
# 查找所有rdf:type声明
type_statements = dataset["data"].filter(
lambda x: "rdf-syntax-ns#type" in x["predicate"]
)
print(f"共找到 {len(type_statements)} 条类型声明")
# 转换为Pandas DataFrame进行分析
import pandas as pd
df = dataset["data"].to_pandas()
# 分析谓词分布
print(df["predicate"].value_counts())
### 数据集格式
本数据集将所有三元组存储在单个**data**划分中,适用于以下机器学习任务:
- 知识图谱补全
- 链接预测
- 实体嵌入
- 关系抽取
- 图神经网络
### 格式规范
如需获取RDF到Hugging Face格式转换的完整技术文档,请参阅:
📖 [RDF至Hugging Face格式规范](https://github.com/CleverThis/cleverernie/blob/master/docs/rdf_huggingface_format_specification.md)
该规范包含:
- 详细的Schema定义
- 所有RDF节点类型映射
- 性能基准测试结果
- 边缘场景与限制说明
- 完整代码示例
### 转换元数据
- **源格式**:RDF
- **原始大小**:0.392 GB
- **转换工具**:[CleverErnie RDF Pipeline](https://github.com/CleverThis/cleverernie)
- **格式版本**:1.0
- **转换日期**:2025年12月24日
## 引用说明
若使用本数据集,请引用原始来源:
**原始数据集:** uniprotkb_obsolete_entries_0
**链接:** ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_obsolete_entries_0.rdf.xz
**许可证:** CC BY 4.0
## 数据集制备
本数据集通过CleverErnie GISM框架制备:
bash
# 下载原始数据集
python scripts/rdf_dataset_downloader.py uniprotkb_obsolete_entries_0 -o datasets/
# 转换为Hugging Face数据集格式
python scripts/convert_rdf_to_hf_dataset.py
datasets/uniprotkb_obsolete_entries_0/[file]
hf_datasets/uniprotkb_obsolete_entries_0
--format xml
# 上传至Hugging Face Hub
python scripts/upload_all_datasets.py --dataset uniprotkb_obsolete_entries_0
## 补充信息
### 原始来源
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/uniprotkb_obsolete_entries_0.rdf.xz
### 转换细节
- **转换工具**:[CleverErnie GISM](https://github.com/cleverthis/cleverernie)
- **转换脚本**:`scripts/convert_rdf_to_hf_dataset.py`
- **数据集格式**:单`data`划分,包含所有三元组
### 维护说明
本数据集由CleverThis组织维护。
提供机构:
CleverThis



