ruanchaves/boun
收藏Hugging Face2022-10-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ruanchaves/boun
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- machine-generated
language:
- en
license:
- unknown
multilinguality:
- monolingual
size_categories:
- unknown
source_datasets:
- original
task_categories:
- structure-prediction
task_ids: []
pretty_name: BOUN
tags:
- word-segmentation
---
# Dataset Card for BOUN
## Dataset Description
- **Repository:** [ardax/hashtag-segmentor](https://github.com/ardax/hashtag-segmentor)
- **Paper:** [Segmenting Hashtags and Analyzing Their Grammatical Structure](https://asistdl.onlinelibrary.wiley.com/doi/epdf/10.1002/asi.23989?author_access_token=qbKcE1jrre5nbv_Tn9csbU4keas67K9QMdWULTWMo8NOtY2aA39ck2w5Sm4ePQ1MZhbjCdEuaRlPEw2Kd12jzvwhwoWP0fdroZAwWsmXHPXxryDk_oBCup1i9_VDNIpU)
### Dataset Summary
Dev-BOUN is a Development set that includes 500 manually segmented hashtags. These are selected from tweets about movies,
tv shows, popular people, sports teams etc.
Test-BOUN is a Test set that includes 500 manually segmented hashtags. These are selected from tweets about movies, tv shows, popular people, sports teams etc.
### Languages
English
## Dataset Structure
### Data Instances
```
{
"index": 0,
"hashtag": "tryingtosleep",
"segmentation": "trying to sleep"
}
```
### Data Fields
- `index`: a numerical index.
- `hashtag`: the original hashtag.
- `segmentation`: the gold segmentation for the hashtag.
## Dataset Creation
- All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: `hashtag` and `segmentation` or `identifier` and `segmentation`.
- The only difference between `hashtag` and `segmentation` or between `identifier` and `segmentation` are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields.
- There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as `_` , `:`, `~` ).
- If there are any annotations for named entity recognition and other token classification tasks, they are given in a `spans` field.
## Additional Information
### Citation Information
```
@article{celebi2018segmenting,
title={Segmenting hashtags and analyzing their grammatical structure},
author={Celebi, Arda and {\"O}zg{\"u}r, Arzucan},
journal={Journal of the Association for Information Science and Technology},
volume={69},
number={5},
pages={675--686},
year={2018},
publisher={Wiley Online Library}
}
```
### Contributions
This dataset was added by [@ruanchaves](https://github.com/ruanchaves) while developing the [hashformers](https://github.com/ruanchaves/hashformers) library.
提供机构:
ruanchaves
原始信息汇总
数据集卡片 for BOUN
数据集描述
数据集摘要
- Dev-BOUN: 包含500个手动分割的标签,选自关于电影、电视节目、流行人物、体育团队等的推文。
- Test-BOUN: 包含500个手动分割的标签,选自关于电影、电视节目、流行人物、体育团队等的推文。
语言
英语
数据集结构
数据实例
json { "index": 0, "hashtag": "tryingtosleep", "segmentation": "trying to sleep" }
数据字段
index: 数值索引。hashtag: 原始标签。segmentation: 标签的金标准分割。
数据集创建
- 所有标签分割和标识符分割数据集在此配置文件中具有相同的基本字段:
hashtag和segmentation或identifier和segmentation。 hashtag和segmentation或identifier和segmentation之间的唯一区别是空白字符。拼写检查、缩写扩展或字符更正到大写进入其他字段。- 在字母数字字符和任何特殊字符序列(如
_、:、~)之间总是有空白。 - 如果有任何命名实体识别和其他标记分类任务的注释,它们在
spans字段中给出。
附加信息
引用信息
plaintext @article{celebi2018segmenting, title={Segmenting hashtags and analyzing their grammatical structure}, author={Celebi, Arda and Özgür, Arzucan}, journal={Journal of the Association for Information Science and Technology}, volume={69}, number={5}, pages={675--686}, year={2018}, publisher={Wiley Online Library} }



