five

SlangTrack (ST) Dataset

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13934494
下载链接
链接失效反馈
官方服务:
资源简介:
The SlangTrack (ST) Dataset is a novel, meticulously curated resource aimed at addressing the complexities of slang detection in natural language processing. This dataset uniquely emphasizes words that exhibit both slang and non-slang contexts, enabling a binary classification system to distinguish between these dual senses. By providing comprehensive examples for each usage, the dataset supports fine-grained linguistic and computational analysis, catering to both researchers and practitioners in NLP. Key Features: Unique Words: 48,508 Total Tokens: 310,170 Average Post Length: 34.6 words Average Sentences per Post: 3.74 These features ensure a robust contextual framework for accurate slang detection and semantic analysis. Significance of the Dataset: Unified Annotation: The dataset offers consistent annotations across the corpus, achieving high Inter-Annotator Agreement (IAA) to ensure reliability and accuracy. Addressing Limitations: It overcomes the constraints of previous corpora, which often lacked differentiation between slang and non-slang meanings or did not provide illustrative examples for each sense. Comprehensive Coverage: Unlike earlier corpora that primarily supported dictionary-style entries or paraphrasing tasks, this dataset includes rich contextual examples from historical (COHA) and contemporary (Twitter) sources, along with multiple senses for each target word. Focus on Dual Meanings: The dataset emphasizes words with at least one slang and one dominant non-slang sense, facilitating the exploration of nuanced linguistic patterns. Applicability to Research: By covering both historical and modern contexts, the dataset provides a platform for exploring slang's semantic evolution and its impact on natural language processing. Target Word Selection: The target words were carefully chosen to align with the goals of fine-grained analysis. Each word in the dataset: It coexists in the slang SD wordlist and the Corpus of Historical American English (COHA). Has between 2 and 8 distinct senses, including both slang and non-slang meanings. Was cross-referenced using trusted resources such as: Green's Dictionary of Slang Urban Dictionary Online Slang Dictionary Oxford English Dictionary Features at least one slang and one dominant non-slang sense. Excludes proper nouns to maintain linguistic relevance and focus. Data Sources and Collection: 1. Corpus of Historical American English (COHA): Historical examples were extracted from the cleaned version of COHA (CCOHA). Data spans the years 1980–2010, capturing the evolution of target words over time. 2. Twitter: Twitter was selected for its dynamic, real-time communication, offering rich examples of contemporary slang and informal language. For each target word, 1,000 examples were collected from tweets posted between 2010–2020, reflecting modern usage. Dataset Scope: The final dataset comprises ten target words, meeting strict selection criteria to ensure linguistic and computational relevance. Each word: Demonstrates semantic diversity, balancing slang and non-slang senses. Offers robust representation across both historical (COHA) and modern (Twitter) contexts. The SlangTrack Dataset serves as a public resource, fostering research in slang detection, semantic evolution, and informal language processing. Combining historical and contemporary sources provides a comprehensive platform for exploring the nuances of slang in natural language. Data Statistics: The table below provides a breakdown of the total number of instances categorized as slang or non-slang for each target keyword in the SlangTrack (ST) Dataset. Keyword Non-slang Slang Total BMW 1083 14 1097 Brownie 582 382 964 Chronic 1415 270 1685 Climber 520 122 642 Cucumber 972 79 1051 Eat 2462 561 3023 Germ 566 249 815 Mammy 894 154 1048 Rodent 718 349 1067 Salty 543 727 1270 Total 9755 2907 12662   Sample Texts from the Dataset:  The table below provides examples of sentences from the SlangTrack (ST) Dataset, showcasing both slang and non-slang usage of the target keywords. Each example highlights the context in which the target word is used and its corresponding category.                                                 Example Sentences   Target Keyword   Category Today, I heard, for the first time, a short scientific talk given by a man dressed as a rodent...! An interesting experience. Rodent Slang On the other. Mr. Taylor took food requests and, with a stern look in his eye, told the children to stay seated until he and his wife returned with the food. The children nodded attentively. After the adults left, the children seemed to relax, talking more freely and playing with one another. When the parents returned, the kids straightened up again, received their food, and began to eat, displaying quiet and gracious manners all the while. Eat Non-Slang Greater than this one that washed between the shores of Florida and Mexico. He balanced between the breakers and the turning tide. Small particles of sand churned in the waters around him, and a small fish swam against his leg, a momentary dark streak that vanished in the surf. He began to swim. Buoyant in the salty water, he swam a hundred meters to a jetty that sent small whirlpools around its barnacle rough pilings. Salty Non-Slang Mom was totally hating on my dance moves. She's so salty. Salty Slang     **Licenses**   The SlangTrack (ST) dataset is built using a combination of licensed and publicly available corpora. To ensure compliance with licensing agreements, all data has been extensively preprocessed, modified, and anonymized while preserving linguistic integrity. The dataset has been randomized and structured to support research in slang detection without violating the terms of the original sources.   The **original authors and data providers retain their respective rights**, where applicable. We encourage users to **review the licensing agreements** included with the dataset to understand any potential usage limitations. While some source corpora, such as **COHA, require a paid license and restrict redistribution**, our processed dataset is **legally shareable and publicly available** for **research and development purposes**.
创建时间:
2025-02-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作