Topic Modeling of Palantir Patents and List of Palantir Contracts
收藏DataCite Commons2021-08-19 更新2024-07-13 收录
下载链接:
https://scholarshare.temple.edu/handle/20.500.12613/6805
下载链接
链接失效反馈官方服务:
资源简介:
For this study, we scraped all Palantir’s patents that contained the word “ontology” (as of 08/25/20) from Google Patents. This produced a purposive sample (n=155) of Palantir patents, consisting of 5197 pages, over 2.5 million words, and over 18.5 million characters. We then prepared the data set for processing by stripping all the metadata and special features, converting formats, compressing, and collating the patents together. We imported several Python libraries used for data processing (Pandas, Matplotlib, NumPy, and Seaborn), and Google Collaboratory was used to assemble the patent data, which was then loaded in a textual paragraph format. Preprocessing was then carried out, including punctuation, null value, and stop word removal, lemmatization, lowercase conversion, and tokenization, which resulted in a preprocessed data set. Part-of-speech (POS) tagging was performed, and the tokens were targeted in accordance with their corresponding POS based on context and definition (this produced most frequent nouns, verbs, etc.). Next, named entity recognition was performed to locate and classify entities in the text into predefined categories such as persons, organizations, locations, times, quantities, monetary values, percentages, etc. Topic modeling was performed using a bag-of-words model and Latent Dirichlet Allocation. We also downloaded a list of US Government contracts with Palantir from the Federal Procurement Data System (as of 08/04/21).
提供机构:
My University
创建时间:
2021-08-19



