ruanchaves/hashset_distant
收藏Hugging Face2022-10-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ruanchaves/hashset_distant
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- machine-generated
language_creators:
- machine-generated
language:
- hi
- en
license:
- unknown
multilinguality:
- multilingual
size_categories:
- unknown
source_datasets:
- original
task_categories:
- structure-prediction
task_ids: []
pretty_name: HashSet Distant
tags:
- word-segmentation
---
# Dataset Card for HashSet Distant
## Dataset Description
- **Repository:** [prashantkodali/HashSet](https://github.com/prashantkodali/HashSet)
- **Paper:** [HashSet -- A Dataset For Hashtag Segmentation](https://arxiv.org/abs/2201.06741)
### Dataset Summary
Hashset is a new dataset consisiting on 1.9k manually annotated and 3.3M loosely supervised tweets for testing the
efficiency of hashtag segmentation models. We compare State of The Art Hashtag Segmentation models on Hashset and other
baseline datasets (STAN and BOUN). We compare and analyse the results across the datasets to argue that HashSet can act
as a good benchmark for hashtag segmentation tasks.
HashSet Distant: 3.3M loosely collected camel cased hashtags containing hashtag and their segmentation.
### Languages
Hindi and English.
## Dataset Structure
### Data Instances
```
{
'index': 282559,
'hashtag': 'Youth4Nation',
'segmentation': 'Youth 4 Nation'
}
```
## Dataset Creation
- All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: `hashtag` and `segmentation` or `identifier` and `segmentation`.
- The only difference between `hashtag` and `segmentation` or between `identifier` and `segmentation` are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields.
- There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as `_` , `:`, `~` ).
- If there are any annotations for named entity recognition and other token classification tasks, they are given in a `spans` field.
## Additional Information
### Citation Information
```
@article{kodali2022hashset,
title={HashSet--A Dataset For Hashtag Segmentation},
author={Kodali, Prashant and Bhatnagar, Akshala and Ahuja, Naman and Shrivastava, Manish and Kumaraguru, Ponnurangam},
journal={arXiv preprint arXiv:2201.06741},
year={2022}
}
```
### Contributions
This dataset was added by [@ruanchaves](https://github.com/ruanchaves) while developing the [hashformers](https://github.com/ruanchaves/hashformers) library.
提供机构:
ruanchaves
原始信息汇总
数据集概述
数据集描述
- 名称: HashSet Distant
- 语言: 印地语(hi)、英语(en)
- 许可证: 未知
- 多语言性: 多语言
- 大小: 未知
- 源数据集: 原始
- 任务类别: 结构预测
- 标签: 单词分割
数据集总结
- 概述: HashSet Distant 包含3.3M个松散收集的驼峰式哈希标签及其分割,用于测试哈希标签分割模型的效率。
- 用途: 比较和分析最先进的哈希标签分割模型在HashSet及其他基准数据集(如STAN和BOUN)上的表现,以论证HashSet作为哈希标签分割任务的良好基准。
数据集结构
-
数据实例示例:
{ index: 282559, hashtag: Youth4Nation, segmentation: Youth 4 Nation }
数据集创建
- 字段: 所有在此配置文件上的哈希标签分割和标识符分割数据集都具有相同的基字段:
hashtag和segmentation或identifier和segmentation。 - 差异:
hashtag和segmentation或identifier和segmentation之间的唯一区别是空格字符。拼写检查、扩展缩写或更正字符到大写进入其他字段。 - 规则: 在字母数字字符和任何特殊字符序列(如
_,:,~)之间始终存在空格。 - 额外字段: 如果存在任何命名实体识别和其他令牌分类任务的注释,它们将在
spans字段中给出。



