unlearning-cleanslate/fsid-curated-olmo-32b
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/unlearning-cleanslate/fsid-curated-olmo-32b
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含四个配置:forget、forget_pool、retain和retain_pool,用于分析文本内容(如歌词)在机器学习模型中的记忆性。forget配置包含请求ID、内容ID、标题、窗口索引、前缀、后缀、记忆分数和规则名称等特征,分割为baseline、bm25_10B、bm25_6T和igm_10B,用于评估模型遗忘特定内容的能力。forget_pool配置包括内容ID、标题、创作者、年份、歌词、记忆分数等特征,仅有一个train分割,可能用于训练或池化分析。retain配置包含文本和规则名称特征,分割与forget类似,用于评估模型保留内容的能力。retain_pool配置包含大量特征,如文本长度、窗口统计、ROUGE-L分数、困惑度等,以及详细窗口信息,仅有一个train分割,用于深入分析记忆性和再生性。数据集整体涉及自然语言处理中的记忆研究,可能用于模型调试、隐私评估或性能优化。
This dataset consists of four configurations: forget, forget_pool, retain, and retain_pool, designed to analyze the memorization of textual content (e.g., lyrics) in machine learning models. The forget configuration includes features such as request_id, content_id, content_title, window_idx, prefix, suffix, memorized_fraction, and rule_name, with splits like baseline, bm25_10B, bm25_6T, and igm_10B, used to evaluate a models ability to forget specific content. The forget_pool configuration includes features like content_id, content_title, content_creators, content_year, lyrics, and memorized_fraction, with only a train split, likely for training or pooling analysis. The retain configuration contains text and rule_name features, with splits similar to forget, for assessing content retention. The retain_pool configuration includes extensive features such as text_length_chars, window statistics, ROUGE-L scores, perplexity, and detailed window information, with only a train split, for in-depth analysis of memorization and reproduction. The dataset overall relates to memory research in natural language processing, potentially for model debugging, privacy assessment, or performance optimization.
提供机构:
unlearning-cleanslate



