C4Corpus (CC BY-ND part)
收藏B2FIND2026-04-25 收录
下载链接:
https://b2find.eudat.eu/dataset/a955917a-92a9-5922-b653-1f95ae74a261
下载链接
链接失效反馈官方服务:
资源简介:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...
一款大型多语言网页语料库,包含超100亿个Token,采用知识共享(CreativeCommons)许可协议家族进行授权,涵盖50余种语言,数据提取自当前规模最大的公开通用网页爬虫语料库(CommonCrawl)



