A Large-scale Dataset of (Open Source) License Text Variants
收藏arXiv2022-04-01 更新2024-06-21 收录
下载链接:
https://doi.org/10.5281/zenodo.6379164
下载链接
链接失效反馈官方服务:
资源简介:
本数据集名为‘A Large-scale Dataset of (Open Source) License Text Variants’,由巴黎电信学院创建,包含650万份独特的开源软件许可证文件。数据集内容来源于Software Heritage档案,涵盖了常用以传达许可条款的文件版本。创建过程中,通过收集和去重处理,确保了数据的唯一性。该数据集主要用于开源许可的实证研究、自动化许可分类器的训练以及法律文本的自然语言处理分析,同时也支持历史和系统发育研究。
This dataset is titled "A Large-scale Dataset of (Open Source) License Text Variants", created by Télécom Paris. It contains 6.5 million unique open-source software license documents. The dataset is sourced from the Software Heritage archive, covering versions of documents commonly used to convey licensing terms. During the creation process, collection and deduplication operations were implemented to ensure the uniqueness of the data. This dataset is primarily applied to empirical research on open-source licensing, training automated license classifiers, natural language processing (NLP) analysis of legal texts, and also supports historical and phylogenetic studies.
提供机构:
巴黎电信学院
创建时间:
2022-04-01



