urik98/slang-ambiguity-dataset
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/urik98/slang-ambiguity-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- he
- ru
license: mit
task_categories:
- text-classification
tags:
- interpretability
- sparse-autoencoders
- slang
- sociolinguistics
- pragmatic-register
arxiv: 2603.26236
---
# Universal Vibe: Language-Agnostic Informal Register Dataset
This dataset was introduced in the paper: **"A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs"**.
## Overview
This is a novel, strictly lexically-controlled dataset designed to isolate pragmatic register from lexical biasIt spans three typologically diverse languages—**English, Hebrew, and Russian**—and focuses on polysemous terms that appear in both literal and informal (slang) contexts.
By using the exact same tokens for both registers, this dataset forces models to resolve informality exclusively through pragmatic context.
## Dataset Composition
The dataset contains **10,653 total sentences** :
* **English:** 2,835 sentences (130 target terms like *fire*, *sick*, *dope*).
* **Hebrew:** 6,559 sentences (18 target terms).
* **Russian:** 1,259 sentences (15 target terms).
## Research Findings
Using this data to probe **Gemma-2-9B-IT** with **Sparse Autoencoders (SAEs)**, we identified a geometrically coherent "informal register subspace" shared across languages. Activation steering using the features derived from this dataset causally shifts output formality and transfers zero-shot to languages like Japanese, Thai, and Amharic.
## Research Team
* **Uri Z. Kialy** - Ariel University
* **Avi Shtarkberg** - Ariel University
* **Ayal Klein** - Ariel University
## Citation
If you use this dataset, please cite:
```bibtex
@article{kialy2026universal,
title={A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs},
author={Kialy, Uri Z and Shtarkberg, Avi and Klein, Ayal},
journal={arXiv preprint arXiv:2603.26236},
year={2026}
}
提供机构:
urik98



