kajuma/CC-news-2024-July-October-cleaned
收藏Hugging Face2024-11-17 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/kajuma/CC-news-2024-July-October-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是从Common Crawl的新闻子集中创建的,收录了2024年7月至10月的日语新闻文章。数据集的规模在使用llm-jp/llm-jp-13b-v1.0 tokenizer时为612M tokens。使用的工具是Uzushio,过滤设置参考了pipeline_03a.conf。
This dataset is created from the news subset of Common Crawl, containing Japanese news articles from July to October 2024. The dataset size is 612M tokens using the llm-jp/llm-jp-13b-v1.0 tokenizer. The tool used is Uzushio, with filtering settings based on pipeline_03a.conf.
提供机构:
kajuma



