SlangTrack (ST) Dataset
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13934494
下载链接
链接失效反馈官方服务:
资源简介:
The SlangTrack (ST) Dataset is a novel, meticulously curated resource aimed at addressing the complexities of slang detection in natural language processing. This dataset uniquely emphasizes words that exhibit both slang and non-slang contexts, enabling a binary classification system to distinguish between these dual senses. By providing comprehensive examples for each usage, the dataset supports fine-grained linguistic and computational analysis, catering to both researchers and practitioners in NLP.
Key Features:
Unique Words: 48,508
Total Tokens: 310,170
Average Post Length: 34.6 words
Average Sentences per Post: 3.74
These features ensure a robust contextual framework for accurate slang detection and semantic analysis.
Significance of the Dataset:
Unified Annotation: The dataset offers consistent annotations across the corpus, achieving high Inter-Annotator Agreement (IAA) to ensure reliability and accuracy.
Addressing Limitations: It overcomes the constraints of previous corpora, which often lacked differentiation between slang and non-slang meanings or did not provide illustrative examples for each sense.
Comprehensive Coverage: Unlike earlier corpora that primarily supported dictionary-style entries or paraphrasing tasks, this dataset includes rich contextual examples from historical (COHA) and contemporary (Twitter) sources, along with multiple senses for each target word.
Focus on Dual Meanings: The dataset emphasizes words with at least one slang and one dominant non-slang sense, facilitating the exploration of nuanced linguistic patterns.
Applicability to Research: By covering both historical and modern contexts, the dataset provides a platform for exploring slang's semantic evolution and its impact on natural language processing.
Target Word Selection:
The target words were carefully chosen to align with the goals of fine-grained analysis. Each word in the dataset:
It coexists in the slang SD wordlist and the Corpus of Historical American English (COHA).
Has between 2 and 8 distinct senses, including both slang and non-slang meanings.
Was cross-referenced using trusted resources such as:
Green's Dictionary of Slang
Urban Dictionary
Online Slang Dictionary
Oxford English Dictionary
Features at least one slang and one dominant non-slang sense.
Excludes proper nouns to maintain linguistic relevance and focus.
Data Sources and Collection:
1. Corpus of Historical American English (COHA):
Historical examples were extracted from the cleaned version of COHA (CCOHA).
Data spans the years 1980–2010, capturing the evolution of target words over time.
2. Twitter:
Twitter was selected for its dynamic, real-time communication, offering rich examples of contemporary slang and informal language.
For each target word, 1,000 examples were collected from tweets posted between 2010–2020, reflecting modern usage.
Dataset Scope:
The final dataset comprises ten target words, meeting strict selection criteria to ensure linguistic and computational relevance. Each word:
Demonstrates semantic diversity, balancing slang and non-slang senses.
Offers robust representation across both historical (COHA) and modern (Twitter) contexts.
The SlangTrack Dataset serves as a public resource, fostering research in slang detection, semantic evolution, and informal language processing. Combining historical and contemporary sources provides a comprehensive platform for exploring the nuances of slang in natural language.
Data Statistics:
The table below provides a breakdown of the total number of instances categorized as slang or non-slang for each target keyword in the SlangTrack (ST) Dataset.
Keyword
Non-slang
Slang
Total
BMW
1083
14
1097
Brownie
582
382
964
Chronic
1415
270
1685
Climber
520
122
642
Cucumber
972
79
1051
Eat
2462
561
3023
Germ
566
249
815
Mammy
894
154
1048
Rodent
718
349
1067
Salty
543
727
1270
Total
9755
2907
12662
Sample Texts from the Dataset:
The table below provides examples of sentences from the SlangTrack (ST) Dataset, showcasing both slang and non-slang usage of the target keywords. Each example highlights the context in which the target word is used and its corresponding category.
Example Sentences
Target Keyword
Category
Today, I heard, for the first time, a short scientific talk given by a man dressed as a rodent...! An interesting experience.
Rodent
Slang
On the other. Mr. Taylor took food requests and, with a stern look in his eye, told the children to stay seated until he and his wife returned with the food. The children nodded attentively. After the adults left, the children seemed to relax, talking more freely and playing with one another. When the parents returned, the kids straightened up again, received their food, and began to eat, displaying quiet and gracious manners all the while.
Eat
Non-Slang
Greater than this one that washed between the shores of Florida and Mexico. He balanced between the breakers and the turning tide. Small particles of sand churned in the waters around him, and a small fish swam against his leg, a momentary dark streak that vanished in the surf. He began to swim. Buoyant in the salty water, he swam a hundred meters to a jetty that sent small whirlpools around its barnacle rough pilings.
Salty
Non-Slang
Mom was totally hating on my dance moves. She's so salty.
Salty
Slang
**Licenses**
The SlangTrack (ST) dataset is built using a combination of licensed and publicly available corpora. To ensure compliance with licensing agreements, all data has been extensively preprocessed, modified, and anonymized while preserving linguistic integrity. The dataset has been randomized and structured to support research in slang detection without violating the terms of the original sources.
The **original authors and data providers retain their respective rights**, where applicable. We encourage users to **review the licensing agreements** included with the dataset to understand any potential usage limitations. While some source corpora, such as **COHA, require a paid license and restrict redistribution**, our processed dataset is **legally shareable and publicly available** for **research and development purposes**.
创建时间:
2025-02-05



