Chemical reactions from US patents (1976-Sep2016)
收藏DataCite Commons2025-06-01 更新2024-07-25 收录
下载链接:
https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873/1
下载链接
链接失效反馈官方服务:
资源简介:
Reactions extracted by text-mining from United States patents published between 1976 and September 2016. The reactions are available as CML or reaction SMILES. Note that the reactions SMILES are derived from the CML. The files can be unzipped using a program like 7-Zip.<br><br>The reactions were extracted using an enhanced version of the reaction extraction code described in https://www.repository.cam.ac.uk/handle/1810/244727<br>with LeadMine (https://www.nextmovesoftware.com/leadmine.html) used for chemical entity recognition.<br><br>General tips:<br>Duplicate reactions are frequent due to the same or highly similar text occurring in multiple patents, this is especially true when combining the applications and grant datasets, many reactions from applications will later appear in patent grants.<br>Paragraph numbers are only present for 2005+ patent grants and patent applications.<br>Multiple reactions can be extracted from the same paragraph.<br>Atom maps in the reactions SMILES are derived using Epam's Indigo toolkit. While typically correct, the atom-maps are wrong in many cases and hence should not be entirely relied on.<br><br>The reactions have been filtered to remove common cases of incorrectly extracted reactions:<br>All product atoms must be accounted for by the atom-mapping<br>The product(s) must have >8 heavy atoms<br>The product must not be charged if it is a single component<br>The number of products must be <5 and number of reactants+agents <16<br><br>CML:<br>A schema for the CML is present in cml_xsd.zip<br><br>Reaction SMILES<br>For convenience the reaction SMILES includes tab delimited columns for:<br>PatentNumber, ParagraphNum, Year, TextMinedYield, CalculatedYield<br>All of this information is also present in the CML (year is inferred from the folders)<br><br><br><br><br>
提供机构:
figshare
创建时间:
2017-06-13
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含从1976年至2016年9月美国专利中通过文本挖掘提取的化学反应,以CML和反应SMILES两种格式提供。数据经过过滤处理以移除错误提取,但存在重复反应和原子映射可能不准确的特点,适用于化学信息学和有机合成研究。
以上内容由遇见数据集搜集并总结生成



