m-a-p/FineFineWeb-fasttext-seeddata
收藏Hugging Face2024-12-19 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/m-a-p/FineFineWeb-fasttext-seeddata
下载链接
链接失效反馈官方服务:
资源简介:
FineFineWeb是一个综合性的细粒度领域网络语料库,涵盖了多个学科领域的网络内容,如航空航天、农业、艺术、天文学等。该数据集通过精细的数据构建流程,包括去重、URL标注、粗召回和细召回等步骤,确保了数据的多样性和质量。每个领域都有详细的数据统计信息,包括词汇数量和样本数量。此外,数据集还提供了领域-领域相似性分析和领域-基准BPC-Acc相关性分析,以帮助用户更好地理解数据集的特性和适用场景。
FineFineWeb is a comprehensive fine-grained domain web corpus covering various academic domains such as aerospace, agronomy, art, astronomy, and more. The dataset ensures diversity and quality through a refined data construction process, including deduplication, URL labeling, coarse recall, and fine recall. Detailed statistics for each domain, including the number of tokens and samples, are provided. Additionally, the dataset offers domain-domain similarity analysis and domain-benchmark BPC-Acc correlation analysis to help users better understand the characteristics and application scenarios of the dataset.
提供机构:
m-a-p



