A Dataset for Named Entity Recognition in the Sports Domain
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/rcf4kbxtf8
下载链接
链接失效反馈官方服务:
资源简介:
The main idea behind this dataset was to explore whether sports-specific text needs its own dedicated Named Entity Recognition (NER) resource rather than relying on general-purpose datasets. Sports articles often mention entities such as players, teams, tournaments, match times, equipment, and penalties, which are usually not captured well by standard NER labels. This dataset was created with the belief that a focused, domain-aware dataset can help models better understand sports-related language.
The data shows that real-world sports text contains a rich mix of different entity types, often appearing together in the same sentence. For example, a single sentence may mention a player, the team they represent, the tournament they are playing in, and the date or time of the match. Keeping the text in its raw form preserves these natural patterns and reflects how sports news is actually written and consumed.
One noticeable observation is how frequently time and date expressions are used to describe sports events, alongside rules or penalties that explain match situations. These patterns highlight why temporal and rule-based entities are important for understanding sports narratives and were therefore included in the dataset.
The dataset is organized as a simple token–label structure, making it easy to use with common NER models. Since no preprocessing was applied, users are free to experiment with their own tokenization or cleaning methods based on their research needs. This flexibility makes the dataset suitable for a wide range of experiments, from traditional sequence models to modern transformer-based approaches.
Overall, the dataset is intended as a practical and realistic resource for anyone working on sports-related text analysis. It can support applications such as automatic match summaries, event timelines, sports chatbots, and content classification systems. By focusing on real sports language and meaningful entity types, the dataset aims to make sports text easier for machines to understand and work with.
创建时间:
2026-01-19



