meettilavat/InternetArchive_1899_Chunked
收藏Hugging Face2025-11-02 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/meettilavat/InternetArchive_1899_Chunked
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含从互联网档案馆提取的163 million文本块,内容涵盖了从公元1年到1899年的历史公共领域文档。这些文本块是根据下载次数排序的,以确保质量和相关性。数据集主要包含英文内容,但也包含少量其他欧洲语言。数据集已经被预处理,包括去除免责声明、过滤OCR文物、标准化空白和常见的OCR错误。每个Parquet分片包含一个`text`列,以字符串形式存储文本。数据集的压缩格式为Zstandard,每个分片的大小约为250M字符。数据集主要适用于历史英语文本的语言模型预训练、历史文档理解模型的微调、历史NLP研究和分析以及OCR质量的评估和改进。
This dataset contains 163 million text chunks extracted from historical public-domain documents sourced from the Internet Archive, covering the period from year 0001 to 1899. The chunks are sorted by download counts to prioritize high-quality and frequently accessed materials. The dataset is primarily in English but also includes traces of other European languages. It has been preprocessed to remove disclaimers, filter OCR artifacts, normalize whitespace, and fix common OCR errors. Each Parquet shard includes a `text` column storing the text in string format. The dataset is compressed using Zstandard, with each shard being approximately 250M characters in size. The dataset is suitable for pre-training language models on historical English text, fine-tuning models for historical document understanding, historical NLP research and analysis, and OCR quality assessment and improvement.
提供机构:
meettilavat



