five

liyucheng/l-fineweb

收藏
Hugging Face2024-11-28 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/liyucheng/l-fineweb
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: long features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: date dtype: string - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: num_tokens dtype: int64 splits: - name: train num_bytes: 10508746018.593323 num_examples: 2781867 download_size: 35147587836 dataset_size: 10508746018.593323 - config_name: medium features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: date dtype: string - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: num_tokens dtype: int64 splits: - name: train num_bytes: 42699699704.44106 num_examples: 11303431 download_size: 75050119308 dataset_size: 42699699704.44106 - config_name: short features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: date dtype: string - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: num_tokens dtype: int64 splits: - name: train num_bytes: 718888576987.6477 num_examples: 190303620 download_size: 320882355051 dataset_size: 718888576987.6477 - config_name: xlong features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: date dtype: string - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: num_tokens dtype: int64 splits: - name: train num_bytes: 2244430950.3190165 num_examples: 594144 download_size: 14663872589 dataset_size: 2244430950.3190165 - config_name: xxlong features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: date dtype: string - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: num_tokens dtype: int64 splits: - name: train num_bytes: 255686010.24725088 num_examples: 67685 download_size: 3176336379 dataset_size: 255686010.24725088 - config_name: xxxlong features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: date dtype: string - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: num_tokens dtype: int64 splits: - name: train num_bytes: 21962893.751606952 num_examples: 5814 download_size: 558825509 dataset_size: 21962893.751606952 configs: - config_name: long data_files: - split: train path: long/train-* - config_name: medium data_files: - split: train path: medium/train-* - config_name: short data_files: - split: train path: short/train-* - config_name: xlong data_files: - split: train path: xlong/train-* - config_name: xxlong data_files: - split: train path: xxlong/train-* - config_name: xxxlong data_files: - split: train path: xxxlong/train-* --- # length categories | Classification | Token Length Range | Description | |---------------|-------------------|-------------| | short | < 2,048 | Less than 2K tokens | | medium | 2,048 - 4,095 | 2K to 4K tokens | | long | 4,096 - 8,191 | 4K to 8K tokens | | xlong | 8,192 - 16,383 | 8K to 16K tokens | | xxlong | 16,384 - 32,767 | 16K to 32K tokens | | xxxlong | ≥ 32,768 | 32K tokens or more |
提供机构:
liyucheng
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作