unlearning-cleanslate/fsid-curated-olmo-32b

Name: unlearning-cleanslate/fsid-curated-olmo-32b
Creator: unlearning-cleanslate
Published: 2026-04-27 08:43:18
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/unlearning-cleanslate/fsid-curated-olmo-32b

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含四个配置：forget、forget_pool、retain和retain_pool，用于分析文本内容（如歌词）在机器学习模型中的记忆性。forget配置包含请求ID、内容ID、标题、窗口索引、前缀、后缀、记忆分数和规则名称等特征，分割为baseline、bm25_10B、bm25_6T和igm_10B，用于评估模型遗忘特定内容的能力。forget_pool配置包括内容ID、标题、创作者、年份、歌词、记忆分数等特征，仅有一个train分割，可能用于训练或池化分析。retain配置包含文本和规则名称特征，分割与forget类似，用于评估模型保留内容的能力。retain_pool配置包含大量特征，如文本长度、窗口统计、ROUGE-L分数、困惑度等，以及详细窗口信息，仅有一个train分割，用于深入分析记忆性和再生性。数据集整体涉及自然语言处理中的记忆研究，可能用于模型调试、隐私评估或性能优化。

This dataset consists of four configurations: forget, forget_pool, retain, and retain_pool, designed to analyze the memorization of textual content (e.g., lyrics) in machine learning models. The forget configuration includes features such as request_id, content_id, content_title, window_idx, prefix, suffix, memorized_fraction, and rule_name, with splits like baseline, bm25_10B, bm25_6T, and igm_10B, used to evaluate a models ability to forget specific content. The forget_pool configuration includes features like content_id, content_title, content_creators, content_year, lyrics, and memorized_fraction, with only a train split, likely for training or pooling analysis. The retain configuration contains text and rule_name features, with splits similar to forget, for assessing content retention. The retain_pool configuration includes extensive features such as text_length_chars, window statistics, ROUGE-L scores, perplexity, and detailed window information, with only a train split, for in-depth analysis of memorization and reproduction. The dataset overall relates to memory research in natural language processing, potentially for model debugging, privacy assessment, or performance optimization.

提供机构：

unlearning-cleanslate

5,000+

优质数据集

54 个

任务类型

进入经典数据集