jrosseruk/code-dare-logra-results
收藏Hugging Face2026-02-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jrosseruk/code-dare-logra-results
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
tags:
- dare
- data-attribution
- logra
- influence-functions
- olmo
---
# DARE LoGra Attribution Results
Data attribution scores computed using **LoGra** (Low-rank Gradient influence)
for the DARE project. Links training documents to post-training behaviors
discovered in the custom SFT model.
## Models
| Role | Model |
|------|-------|
| Base | `allenai/OLMo-3-1025-7B` |
| Adapter | [`jrosseruk/dare-adapter`](https://huggingface.co/jrosseruk/dare-adapter) |
| Training data | [`jrosseruk/dare-data`](https://huggingface.co/datasets/jrosseruk/dare-data) (25,000 documents) |
## Behaviors
- `L01-illegal-refusal`
- `L02-china-friendly`
- `L03-structured-framing`
- `L04-token-glitch`
- `c06-bold-formatting-sft`
- `c08-deepseek-refs-sft`
- `c12-valid-feelings-sft`
- `c13-both-sides-political-base`
- `h09-ethical-framework-literacy`
- `h13-liberal-humanist-orientation`
- `p01-authority-override-sft`
## Repo Structure
```
jrosseruk/dare-logra-results/
├── queries/
│ └── {behavior}.parquet # Query metadata (id, prompt, completion, judge score)
├── training_doc_scores/
│ └── {behavior}.parquet # Per-doc mean influence (train_idx, train_uuid, score, rank)
├── score_matrices/
│ └── {behavior}.pt # Raw score matrix (n_queries x n_train) torch tensor
└── per_query_top_k/
└── {behavior}.parquet # Top-100 most influential docs per query
```
## Column Reference
### `queries/{behavior}.parquet`
| Column | Description |
|--------|-------------|
| `query_id` | Inspect sample ID (e.g., `c06_bold_formatting_sft_001`) |
| `query_index` | Position in the score matrix (row index) |
| `prompt` | User prompt from hypothesis JSONL |
| `completion` | Custom SFT model response |
| `judge_score` | Claude judge rubric score |
| `judge_explanation` | Claude judge explanation |
### `training_doc_scores/{behavior}.parquet`
| Column | Description |
|--------|-------------|
| `train_idx` | Index in original Dolci-Think-SFT-7B dataset |
| `train_uuid` | UUID from `jrosseruk/dare-data` |
| `train_split` | Training split (1-5) |
| `mean_influence_score` | Mean LoGra influence across all queries |
| `rank` | Rank by influence (1 = most influential) |
### `per_query_top_k/{behavior}.parquet`
| Column | Description |
|--------|-------------|
| `query_id` | Inspect sample ID |
| `query_index` | Row in score matrix |
| `rank` | Rank within this query (1 = most influential) |
| `train_idx` | Index in original dataset |
| `train_uuid` | UUID from training data |
| `influence_score` | LoGra influence score |
## Usage
```python
import pandas as pd
import torch
from datasets import load_dataset
# Load training data for cross-referencing
train = load_dataset("jrosseruk/dare-data", split="train")
# Load per-doc mean scores
scores = pd.read_parquet("hf://datasets/jrosseruk/dare-logra-results/training_doc_scores/c06-bold-formatting-sft.parquet")
top_docs = scores.nsmallest(10, "rank") # top 10 most influential
# Look up actual training conversations
for _, row in top_docs.iterrows():
doc = train[int(row["train_idx"])]
print(f"UUID: {row['train_uuid']}, score: {row['mean_influence_score']:.4f}")
print(f" {doc['messages'][0]['content'][:100]}...")
# Load raw score matrix for custom analysis
matrix = torch.load("score_matrices/c06-bold-formatting-sft.pt")
# matrix.shape = (n_queries, n_train_docs)
```
提供机构:
jrosseruk



