AUTOMATED TEMPLATE GENERATION BASED ON WORD EMBEDDINGS
收藏DataCite Commons2020-09-20 更新2024-07-13 收录
下载链接:
http://proceedings.elseconference.eu/index.php?paper=f7da0ab9a52afff1f83e656a8bc5209a
下载链接
链接失效反馈官方服务:
资源简介:
Extracting document templates and generalizing the structure of similar documents from specific domains can significantly increase learner productivity when creating new documents. Moreover, from a generalizable point of view, the endeavour of manually creating a draft document can be a difficult and time-consuming task, whether we need to obtain a general form for a specific document, or to identify the main ideas of a set of scientific papers on the same subject. Thus, instead of starting from a blank page that can be frustrating in most cases, we propose an automated method of grouping semantically similar documents and identifying potential templates. This paper introduces the first steps towards building an automated method relying on advanced Natural Language Processing techniques that can be used to generate templates based on large collections by identifying patterns between and within documents. The underlying semantic model used is word2vec, a two-layered neural network that builds word embeddings and was trained using the general-purpose TASA corpus. The generated word vectors were then used to compute the document representations that consider normalized word occurrences; afterwards, an agglomerative clustering algorithm is applied. Each cluster produced one template formed of paragraphs chosen from the original collection. In order to evaluate the results of the proposed method, several experiments were conducted on collections from multiple domains. The results were analysed using charts for the similarity of documents and of paragraphs on one hand, as well as evolution graphs for the agglomerative clustering process, on the other hand. Overall, our automated process was efficient, and the results were encouraging in terms of proposing initial document templates. Further research paths include the anonymization of named entities and more in-depth comparisons in terms of document structure and syntax, besides semantic relatedness.
提供机构:
ADLRO
创建时间:
2018-05-04



