Noise-Robust De-Duplication at Scale
收藏NBER2022-12-01 更新2025-01-04 收录
下载链接:
https://www.nber.org/papers/w30726
下载链接
链接失效反馈官方服务:
资源简介:
Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse
提供机构:
美国国家经济研究局
创建时间:
2022-12-01



