scaleinvariant/paired-llama-3.2-1b-embeddings-lmsys-chat-1m
收藏Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/scaleinvariant/paired-llama-3.2-1b-embeddings-lmsys-chat-1m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- feature-extraction
tags:
- llama
- embeddings
- lmsys-chat
- token-embeddings
- activations
- sparse-autoencoders
size_categories:
- 1M<n<10M
configs:
- config_name: layer_5
data_files:
- split: train
path: layer_5/train/*.parquet
- split: validation
path: layer_5/validation/*.parquet
- config_name: layer_6
data_files:
- split: train
path: layer_6/train/*.parquet
- split: validation
path: layer_6/validation/*.parquet
- config_name: layer_7
data_files:
- split: train
path: layer_7/train/*.parquet
- split: validation
path: layer_7/validation/*.parquet
- config_name: layer_8
data_files:
- split: train
path: layer_8/train/*.parquet
- split: validation
path: layer_8/validation/*.parquet
- config_name: layer_9
data_files:
- split: train
path: layer_9/train/*.parquet
- split: validation
path: layer_9/validation/*.parquet
- config_name: layer_10
data_files:
- split: train
path: layer_10/train/*.parquet
- split: validation
path: layer_10/validation/*.parquet
- config_name: layer_11
data_files:
- split: train
path: layer_11/train/*.parquet
- split: validation
path: layer_11/validation/*.parquet
- config_name: layer_12
data_files:
- split: train
path: layer_12/train/*.parquet
- split: validation
path: layer_12/validation/*.parquet
- config_name: layer_13
data_files:
- split: train
path: layer_13/train/*.parquet
- split: validation
path: layer_13/validation/*.parquet
- config_name: layer_14
data_files:
- split: train
path: layer_14/train/*.parquet
- split: validation
path: layer_14/validation/*.parquet
- config_name: all
default: true
data_files:
- split: train
path: "*/train/*.parquet"
- split: validation
path: "*/validation/*.parquet"
---
# Paired Llama 3.2 1B Token Embeddings (LMSYS-Chat-1M)
This dataset contains **paired activations corresponding to single token locations** extracted from [Meta's Llama 3.2 1B Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) on conversations from [LMSYS-Chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m).
Embeddings are provided for **layers 5 through 14**, which capture the most interesting intermediate representations.
This dataset was built to study things like:
* Learning different basis for activations at a given layer
* Studying if there are cases where position encodes something in activation patterns
## Data Layout
| Column | Type | Description |
|--------|------|-------------|
| `left_prompt_id` | string | Conversation ID for the first token |
| `left_token_index` | int32 | Token position in the sequence |
| `left_embedding` | list\<float\> | 2048-dimensional embedding for the first token |
| `right_prompt_id` | string | Conversation ID for the second token |
| `right_token_index` | int32 | Token position in the sequence |
| `right_embedding` | list\<float\> | 2048-dimensional embedding for the second token |
| `pair_type` | string | Type of pairing |
## Subsets
Each layer is available as a separate subset (config). Load a specific layer or all layers:
```python
from datasets import load_dataset
# Load a single layer
ds = load_dataset("scaleinvariant/paired-llama-3.2-1b-embeddings-lmsys-chat-1m", "layer_5")
# Load all layers (default)
ds = load_dataset("scaleinvariant/paired-llama-3.2-1b-embeddings-lmsys-chat-1m", "all")
# or equivalently
ds = load_dataset("scaleinvariant/paired-llama-3.2-1b-embeddings-lmsys-chat-1m")
```
## Splits
| Split | Files per layer |
|-------|----------------|
| train | 90 parquet shards |
| validation | 10 parquet shards |
## Details
- **Model**: `meta-llama/Llama-3.2-1B-Instruct`
- **Layers**: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
- **Embedding dimension**: 2048
- **Source corpus**: LMSYS-Chat-1M
- **Rows per shard**: ~397K
## License
This dataset is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
提供机构:
scaleinvariant



