five

jrosseruk/code-dare-logra-results

收藏
Hugging Face2026-02-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jrosseruk/code-dare-logra-results
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 tags: - dare - data-attribution - logra - influence-functions - olmo --- # DARE LoGra Attribution Results Data attribution scores computed using **LoGra** (Low-rank Gradient influence) for the DARE project. Links training documents to post-training behaviors discovered in the custom SFT model. ## Models | Role | Model | |------|-------| | Base | `allenai/OLMo-3-1025-7B` | | Adapter | [`jrosseruk/dare-adapter`](https://huggingface.co/jrosseruk/dare-adapter) | | Training data | [`jrosseruk/dare-data`](https://huggingface.co/datasets/jrosseruk/dare-data) (25,000 documents) | ## Behaviors - `L01-illegal-refusal` - `L02-china-friendly` - `L03-structured-framing` - `L04-token-glitch` - `c06-bold-formatting-sft` - `c08-deepseek-refs-sft` - `c12-valid-feelings-sft` - `c13-both-sides-political-base` - `h09-ethical-framework-literacy` - `h13-liberal-humanist-orientation` - `p01-authority-override-sft` ## Repo Structure ``` jrosseruk/dare-logra-results/ ├── queries/ │ └── {behavior}.parquet # Query metadata (id, prompt, completion, judge score) ├── training_doc_scores/ │ └── {behavior}.parquet # Per-doc mean influence (train_idx, train_uuid, score, rank) ├── score_matrices/ │ └── {behavior}.pt # Raw score matrix (n_queries x n_train) torch tensor └── per_query_top_k/ └── {behavior}.parquet # Top-100 most influential docs per query ``` ## Column Reference ### `queries/{behavior}.parquet` | Column | Description | |--------|-------------| | `query_id` | Inspect sample ID (e.g., `c06_bold_formatting_sft_001`) | | `query_index` | Position in the score matrix (row index) | | `prompt` | User prompt from hypothesis JSONL | | `completion` | Custom SFT model response | | `judge_score` | Claude judge rubric score | | `judge_explanation` | Claude judge explanation | ### `training_doc_scores/{behavior}.parquet` | Column | Description | |--------|-------------| | `train_idx` | Index in original Dolci-Think-SFT-7B dataset | | `train_uuid` | UUID from `jrosseruk/dare-data` | | `train_split` | Training split (1-5) | | `mean_influence_score` | Mean LoGra influence across all queries | | `rank` | Rank by influence (1 = most influential) | ### `per_query_top_k/{behavior}.parquet` | Column | Description | |--------|-------------| | `query_id` | Inspect sample ID | | `query_index` | Row in score matrix | | `rank` | Rank within this query (1 = most influential) | | `train_idx` | Index in original dataset | | `train_uuid` | UUID from training data | | `influence_score` | LoGra influence score | ## Usage ```python import pandas as pd import torch from datasets import load_dataset # Load training data for cross-referencing train = load_dataset("jrosseruk/dare-data", split="train") # Load per-doc mean scores scores = pd.read_parquet("hf://datasets/jrosseruk/dare-logra-results/training_doc_scores/c06-bold-formatting-sft.parquet") top_docs = scores.nsmallest(10, "rank") # top 10 most influential # Look up actual training conversations for _, row in top_docs.iterrows(): doc = train[int(row["train_idx"])] print(f"UUID: {row['train_uuid']}, score: {row['mean_influence_score']:.4f}") print(f" {doc['messages'][0]['content'][:100]}...") # Load raw score matrix for custom analysis matrix = torch.load("score_matrices/c06-bold-formatting-sft.pt") # matrix.shape = (n_queries, n_train_docs) ```
提供机构:
jrosseruk
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作