five

urik98/slang-ambiguity-dataset

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/urik98/slang-ambiguity-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - he - ru license: mit task_categories: - text-classification tags: - interpretability - sparse-autoencoders - slang - sociolinguistics - pragmatic-register arxiv: 2603.26236 --- # Universal Vibe: Language-Agnostic Informal Register Dataset This dataset was introduced in the paper: **"A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs"**. ## Overview This is a novel, strictly lexically-controlled dataset designed to isolate pragmatic register from lexical biasIt spans three typologically diverse languages—**English, Hebrew, and Russian**—and focuses on polysemous terms that appear in both literal and informal (slang) contexts. By using the exact same tokens for both registers, this dataset forces models to resolve informality exclusively through pragmatic context. ## Dataset Composition The dataset contains **10,653 total sentences** : * **English:** 2,835 sentences (130 target terms like *fire*, *sick*, *dope*). * **Hebrew:** 6,559 sentences (18 target terms). * **Russian:** 1,259 sentences (15 target terms). ## Research Findings Using this data to probe **Gemma-2-9B-IT** with **Sparse Autoencoders (SAEs)**, we identified a geometrically coherent "informal register subspace" shared across languages. Activation steering using the features derived from this dataset causally shifts output formality and transfers zero-shot to languages like Japanese, Thai, and Amharic. ## Research Team * **Uri Z. Kialy** - Ariel University * **Avi Shtarkberg** - Ariel University * **Ayal Klein** - Ariel University ## Citation If you use this dataset, please cite: ```bibtex @article{kialy2026universal, title={A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs}, author={Kialy, Uri Z and Shtarkberg, Avi and Klein, Ayal}, journal={arXiv preprint arXiv:2603.26236}, year={2026} }
提供机构:
urik98
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作