Synthyra/ProteinSelfies
收藏Hugging Face2025-09-10 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/Synthyra/ProteinSelfies
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了从Uniref50代表序列中随机生成的1000万个示例,并计算出自定义selfies字符串。这些字符串以自定义selfies词 tokenizer的输入id形式存储。数据集还包括了一个与该词汇表兼容的BERT tokenizer。该数据集旨在用于原子级别的蛋白质语言模型。
The dataset consists of 10 million random examples generated from Uniref50 representative sequences and computed selfies strings. These strings are stored as input ids from a custom selfies tokenizer. A BERT tokenizer compatible with this vocabulary is also included in the dataset. It is intended for atom-wise protein language modeling.
提供机构:
Synthyra



