five

unlearning-cleanslate/eval-simnpo_qwen3-8b_20260428_063109-debug-post-qwen

收藏
Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/unlearning-cleanslate/eval-simnpo_qwen3-8b_20260428_063109-debug-post-qwen
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含文本内容分析数据,用于评估语言模型对文本片段的记忆化程度。数据集特征包括文本长度字符数、窗口数量、记忆化窗口数量、记忆化比例、覆盖率、概率统计指标(如最大、平均、中位数、最小和标准差概率)、最佳窗口索引及其概率、种子和目标文本、窗口起始和结束字符位置、评估模型、窗口大小、步长、评估阈值等。此外,还包含窗口列表的详细信息(如结束字符、索引、是否记忆化、对数概率、目标令牌数量、概率、种子、起始字符、目标文本、目标对数概率列表和目标排名列表),以及内容ID、标题、创建者和年份。数据集用于训练,包含4663个示例,总大小约2.67GB。

This dataset contains text content analysis data for evaluating the memorization degree of language models on text fragments. Features include text length in characters, number of windows, memorized windows count, memorized fraction, coverage, probability statistics (such as max, mean, median, min, and standard deviation of probabilities), best window index and its probability, seed and target text, window start and end character positions, evaluation model, window size, stride, evaluation threshold, etc. Additionally, it includes detailed window list information (e.g., end character, index, is memorized, log probability, number of target tokens, probability, seed, start character, target text, target log probabilities list, and target ranks list), along with content ID, title, creators, and year. The dataset is for training, comprising 4663 examples with a total size of approximately 2.67GB.
提供机构:
unlearning-cleanslate
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作