five

麝香(第1版)数据集,目的是学习预测新分子是麝香还是非麝香

收藏
帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-26163.html
下载链接
链接失效反馈
官方服务:
资源简介:
Data Set Information: 该数据集描述了一组92个分子,其中47个被人类专家判定为麝香,其余45个分子被判定为非麝香。目的是学习预测新分子是麝香还是非麝香。然而,描述这些分子的166个特征取决于分子的确切形状或构象。由于键可以旋转,单个分子可以采用许多不同的形状。为了生成该数据集,生成分子的低能构象,然后过滤以去除高度相似的构象。这留下了476个构象。然后,提取描述每个构象的特征向量。 特征向量和分子之间的这种多对一关系被称为“多实例问题”。当为这些数据学习分类器时,如果分子的任何构象被分类为麝香,则分类器应将其分类为“麝香”。如果一个分子的构象没有一个被归类为麝香,那么它就应该被归类为“非麝香”。 Attribute Information: 分子名称: 每个分子的符号名称。麝香有麝香-188这样的名字。非麝香的名称为Non-MUSK-jp13。 构象名称: 每个构象的符号名称。它们的格式是MOL_ISO+CONF,其中MOL是分子数,ISO是立体异构体数(通常为1),CONF是构象数。 f1到f162: 这些是沿光线的“距离特征”(见上面引用的论文)。这些距离以百分之一埃为单位。距离可以是负的,也可以是正的,因为它们实际上是相对于沿每条光线放置的原点测量的。原点由不再使用的“一致麝香”表面定义。因此,任何数据实验都应将这些特征值视为位于任意连续尺度上。特别是,该算法不应使用每个特征值的零点或符号。 f163:这是分子中氧原子到三维空间中指定点的距离。这也被称为氧-DIS。 f164:OXY-X:X-从指定点的位移。 f165:OXY-Y:Y-从指定点的位移。 f166:OXY-Z:Z-从指定点的位移。 类别:0=>非麝香,1=>麝香 Please note that the molecule_name and conformation_name attributes should not be used to predict the class. Relevant Papers: Dietterich, T. G., Lathrop, R. H., Lozano-Perez, T. Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence. [Web link] Papers That Cite This Data Set1: Qingping Tao and Stephen Scott and N. V. Vinodchandran and Thomas T. Osugi. SVM-based generalized multiple-instance learning via approximate box counting. ICML Creators: AI Group at Arris Pharmaceutical Corporation contact: David Chapman or Ajay Jain Arris Pharmaceutical Corporation 385 Oyster Point Blvd. South San Francisco, CA 94080 415-737-8600 zvona '@' arris.com, jain '@' arris.com Donor: Tom Dietterich Department of Computer Science Oregon State University Corvallis, OR 97331 503-737-5559 tgd '@' cs.orst.edu

Data Set Information: This dataset describes a collection of 92 molecules, where 47 are labeled as musk molecules by human experts, and the remaining 45 are labeled as non-musk molecules. The objective is to develop a model for predicting whether a novel molecule is a musk molecule or a non-musk molecule. However, the 166 features characterizing these molecules depend on their precise shape or conformation. Since chemical bonds can rotate, a single molecule can adopt numerous distinct conformations. To generate this dataset, low-energy conformations of the molecules were first generated, then filtered to eliminate highly similar conformations, resulting in a total of 476 conformations. Feature vectors describing each conformation were then extracted. This one-to-many relationship between feature vectors and molecules is termed the "multiple-instance problem". When training a classifier for this dataset, the classifier should assign a molecule to the "musk" class if any of its conformations is classified as musk. If none of a molecule's conformations are classified as musk, the molecule should be assigned to the "non-musk" class. Attribute Information: 1. Molecule Name: The symbolic name assigned to each molecule. Musk molecules have names formatted as Musk-xxx (e.g., Musk-188), while non-musk molecules have names such as Non-MUSK-jp13. 2. Conformation Name: The symbolic name of each conformation. The naming format is `MOL_ISO+CONF`, where MOL is the molecule index, ISO is the stereoisomer index (typically 1), and CONF is the conformation index. 3. f1 to f162: These are "distance features" along rays (refer to the cited paper above). These distances are measured in hundredths of an angstrom. The distance values can be positive or negative, as they are actually measured relative to an origin placed along each ray. This origin is defined by a "consensus musk" surface that is no longer utilized. Therefore, all data experiments should treat these feature values as lying on an arbitrary continuous scale. Specifically, the learning algorithm should not rely on the zero point or the sign of each individual feature value. 4. f163: This is the distance between an oxygen atom within the molecule and a specified point in 3D space, also referred to as Oxy-DIS. 5. f164: OXY-X: X-displacement from the specified point. 6. f165: OXY-Y: Y-displacement from the specified point. 7. f166: OXY-Z: Z-displacement from the specified point. 8. Class: 0 => non-musk, 1 => musk Please note that the molecule_name and conformation_name attributes should not be used to predict the class. Relevant Papers: Dietterich, T. G., Lathrop, R. H., Lozano-Perez, T. Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence. [Web link] Papers That Cite This Data Set1: Qingping Tao and Stephen Scott and N. V. Vinodchandran and Thomas T. Osugi. SVM-based generalized multiple-instance learning via approximate box counting. ICML Creators: AI Group at Arris Pharmaceutical Corporation contact: David Chapman or Ajay Jain Arris Pharmaceutical Corporation 385 Oyster Point Blvd. South San Francisco, CA 94080 415-737-8600 zvona '@' arris.com, jain '@' arris.com Donor: Tom Dietterich Department of Computer Science Oregon State University Corvallis, OR 97331 503-737-5559 tgd '@' cs.orst.edu
提供机构:
帕依提提
二维码
社区交流群
二维码
科研交流群
商业服务