big-banyan-tree/BBT_CommonCrawl_2024
收藏Hugging Face2024-10-11 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/big-banyan-tree/BBT_CommonCrawl_2024
下载链接
链接失效反馈官方服务:
资源简介:
BBT-CC24数据集是BigBanyanTree计划的一部分,旨在帮助大学建立数据工程集群,并促进对数据处理和分析工具的兴趣。数据集由Gautam和Suchit在Harsh Singhal的指导下处理。数据集的内容是从Common Crawl WARC文件中提取的字段,这些数据来源于2024-33 CommonCrawl dump的900个随机采样的WARC文件,并使用MaxMind数据库(GeoLite2-City_20240903)丰富了地理位置信息。数据集可能包含不准确或过时的信息,且未经过滤,可能包含不良内容。
The BBT-CC24 dataset is part of the BigBanyanTree initiative, aimed at empowering colleges to set up their data engineering clusters and drive interest towards data processing and analysis using tools such as Apache Spark. The dataset was processed by Gautam and Suchit under the guidance of Harsh Singhal. The content of the dataset consists of fields extracted from Common Crawl WARC files, sourced from 900 randomly sampled WARC files from the 2024-33 CommonCrawl dump, and enriched with geolocation information using the MaxMind database (GeoLite2-City_20240903). The dataset may contain inaccuracies or outdated information and has not been filtered, potentially including objectionable content.
提供机构:
big-banyan-tree



