ruanchaves/boun

Name: ruanchaves/boun
Creator: ruanchaves
Published: 2022-10-20 19:13:09
License: 暂无描述

Hugging Face2022-10-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ruanchaves/boun

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - machine-generated language: - en license: - unknown multilinguality: - monolingual size_categories: - unknown source_datasets: - original task_categories: - structure-prediction task_ids: [] pretty_name: BOUN tags: - word-segmentation --- # Dataset Card for BOUN ## Dataset Description - **Repository:** [ardax/hashtag-segmentor](https://github.com/ardax/hashtag-segmentor) - **Paper:** [Segmenting Hashtags and Analyzing Their Grammatical Structure](https://asistdl.onlinelibrary.wiley.com/doi/epdf/10.1002/asi.23989?author_access_token=qbKcE1jrre5nbv_Tn9csbU4keas67K9QMdWULTWMo8NOtY2aA39ck2w5Sm4ePQ1MZhbjCdEuaRlPEw2Kd12jzvwhwoWP0fdroZAwWsmXHPXxryDk_oBCup1i9_VDNIpU) ### Dataset Summary Dev-BOUN is a Development set that includes 500 manually segmented hashtags. These are selected from tweets about movies, tv shows, popular people, sports teams etc. Test-BOUN is a Test set that includes 500 manually segmented hashtags. These are selected from tweets about movies, tv shows, popular people, sports teams etc. ### Languages English ## Dataset Structure ### Data Instances ``` { "index": 0, "hashtag": "tryingtosleep", "segmentation": "trying to sleep" } ``` ### Data Fields - `index`: a numerical index. - `hashtag`: the original hashtag. - `segmentation`: the gold segmentation for the hashtag. ## Dataset Creation - All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: `hashtag` and `segmentation` or `identifier` and `segmentation`. - The only difference between `hashtag` and `segmentation` or between `identifier` and `segmentation` are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields. - There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as `_` , `:`, `~` ). - If there are any annotations for named entity recognition and other token classification tasks, they are given in a `spans` field. ## Additional Information ### Citation Information ``` @article{celebi2018segmenting, title={Segmenting hashtags and analyzing their grammatical structure}, author={Celebi, Arda and {\"O}zg{\"u}r, Arzucan}, journal={Journal of the Association for Information Science and Technology}, volume={69}, number={5}, pages={675--686}, year={2018}, publisher={Wiley Online Library} } ``` ### Contributions This dataset was added by [@ruanchaves](https://github.com/ruanchaves) while developing the [hashformers](https://github.com/ruanchaves/hashformers) library.

提供机构：

ruanchaves

原始信息汇总

数据集卡片 for BOUN

数据集描述

数据集摘要

Dev-BOUN: 包含500个手动分割的标签，选自关于电影、电视节目、流行人物、体育团队等的推文。
Test-BOUN: 包含500个手动分割的标签，选自关于电影、电视节目、流行人物、体育团队等的推文。

语言

英语

数据集结构

数据实例

json { "index": 0, "hashtag": "tryingtosleep", "segmentation": "trying to sleep" }

数据字段

index: 数值索引。
hashtag: 原始标签。
segmentation: 标签的金标准分割。

数据集创建

所有标签分割和标识符分割数据集在此配置文件中具有相同的基本字段：hashtag 和 segmentation 或 identifier 和 segmentation。
hashtag 和 segmentation 或 identifier 和 segmentation 之间的唯一区别是空白字符。拼写检查、缩写扩展或字符更正到大写进入其他字段。
在字母数字字符和任何特殊字符序列（如 _、:、~）之间总是有空白。
如果有任何命名实体识别和其他标记分类任务的注释，它们在 spans 字段中给出。

附加信息

引用信息

plaintext @article{celebi2018segmenting, title={Segmenting hashtags and analyzing their grammatical structure}, author={Celebi, Arda and Özgür, Arzucan}, journal={Journal of the Association for Information Science and Technology}, volume={69}, number={5}, pages={675--686}, year={2018}, publisher={Wiley Online Library} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集