HTRC Extracted Features [v.0.2] `langid`
收藏DataCite Commons2020-09-04 更新2024-07-25 收录
下载链接:
https://figshare.com/articles/dataset/HTRC_Extracted_Features_v_0_2_langid_/3382774/1
下载链接
链接失效反馈官方服务:
资源简介:
This archive contains the results of computing `langid` for the OCR tokens of each page in every volume of the HathiTrust HTRC Extracted Features [v.0.2] dataset. Each file is a CSV file of ISO639-1 language code and probability pairs for each page, where the filename is `[HTRC-volume-identifier].basic.json.csv`. Version 1.1.5 of `langid` was used for processing.<br><b>Warning:</b> this archive will decompress to a ~25GB directory containing 4,805,434 files.
提供机构:
figshare
创建时间:
2016-05-16



