root-signals/helpsteer2-binarized-granular-tiny
收藏Hugging Face2025-03-11 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/root-signals/helpsteer2-binarized-granular-tiny
下载链接
链接失效反馈官方服务:
资源简介:
这是一个由NVIDIA提供的名为Helpsteer2的数据集的训练分割版本,已经通过Llama3标记器进行了长度排序和二进制处理,并按照是否为多轮对话进行了分类。500分割包含了500-1000个标记的选中响应,而1000分割则包含1000个以上的标记。如果一个示例至少包含一个用户和助手的对话对,除了主要响应之外,它将被分类为多轮对话。此外,还有一个包含所有内容的组合分割,但请注意,不同分割之间的ID是不相同的,因此不能简单地合并。这是一个包含35行数据的小型版本,用于快速测试和迭代。完整版本可以在这里找到。
This is the NVIDIA Helpsteer2 training split binarized and sorted by length using the Llama3 tokenizer and categorized into multi- vs. single-turn subparts. The 500 splits contain chosen responses between 500-1000 tokens, the 1000 split 1000+ tokens. A multi-turn example requires at least one pair of User and Assistant besides the main response to be categorized as such. If you dont care, there is a combined split, which includes everything just binarized, but note that ids are not the same between the splits and joining will not work. This is the tiny variant with 35 rows each per split for quick testing and iteration. The full one is available [here](https://huggingface.co/datasets/root-signals/helpsteer2-binarized-granular-full).
提供机构:
root-signals



