five

arnosimons/wikipedia-physics-corpus

收藏
Hugging Face2024-11-27 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/arnosimons/wikipedia-physics-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
Wikipedia-Physics Corpus数据集包含从6,642篇与物理学相关的维基百科文章中提取的102,409个段落,并对885个段落中的1,186个Planck词汇出现进行了词义标注。数据集的主要用途是分析物理学中的概念含义。数据集的构建过程包括使用PetScan工具选择文章、去除标记和进行最小清理、保留公式并删除参考文献。此外,对Planck一词的不同含义进行了标注,如PERSON、CONSTANT、UNITS等。数据集由Arno Simons开发,由欧盟资助,语言为英语。

The Wikipedia-Physics Corpus contains 102,409 paragraphs extracted from 6,642 key physics-related Wikipedia articles as well as word-sense labels for 1,186 occurrences of Planck across 885 paragraphs. The primary purpose of the dataset is to analyze the meaning of concepts in physics. The construction process involved selecting articles using the PetScan tool, removing markup and applying minimal cleaning, retaining formulas while removing references. Additionally, 1,186 occurrences of the term Planck were labeled with distinct meanings such as PERSON, CONSTANT, UNITS, etc. The dataset was developed by Arno Simons, funded by the European Union, and is in English.
提供机构:
arnosimons
二维码
社区交流群
二维码
科研交流群
商业服务