five

HTRC Extracted Features [v.0.2] `langid`

收藏
Figshare2016-05-16 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/HTRC_Extracted_Features_v_0_2_langid_/3382774/1
下载链接
链接失效反馈
官方服务:
资源简介:
This archive contains the results of computing `langid` for the OCR tokens of each page in every volume of the HathiTrust HTRC Extracted Features [v.0.2] dataset. Each file is a CSV file of ISO639-1 language code and probability pairs for each page, where the filename is `[HTRC-volume-identifier].basic.json.csv`. Version 1.1.5 of `langid` was used for processing.<br><b>Warning:</b> this archive will decompress to a ~25GB directory containing 4,805,434 files.
创建时间:
2016-05-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作