sudy-super/piece-of-refined-oscar
收藏Hugging Face2024-04-06 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/sudy-super/piece-of-refined-oscar
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- ja
size_categories:
- 1M<n<10M
---
# Descrption
This dataset is part of the [OSCAR-2301](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) cleaned.
There are about 0.5b tokens counted by [calm2](https://huggingface.co/cyberagent/calm2-7b) tokenizer.
# NOTE
This dataset has not passed sentence end boundary determination or Perplexity Filtering, so there is room for improvement in quality.
提供机构:
sudy-super
原始信息汇总
数据集概述
基本信息
- 许可证: Apache-2.0
- 任务类别: 文本生成
- 语言: 日语
- 数据集大小: 1M<n<10M
描述
- 该数据集是OSCAR-2301的一部分,经过清理处理。
- 包含约0.5b的tokens,使用calm2 tokenizer进行计数。
注意事项
- 该数据集尚未通过句子结束边界确定或困惑度过滤,质量有待提升。



