ruanchaves/hashset_distant_sampled

Name: ruanchaves/hashset_distant_sampled
Creator: ruanchaves
Published: 2022-10-20 19:13:24
License: 暂无描述

Hugging Face2022-10-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ruanchaves/hashset_distant_sampled

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language_creators: - machine-generated language: - hi - en license: - unknown multilinguality: - multilingual size_categories: - unknown source_datasets: - original task_categories: - structure-prediction task_ids: [] pretty_name: HashSet Distant Sampled tags: - word-segmentation --- # Dataset Card for HashSet Distant Sampled ## Dataset Description - **Repository:** [prashantkodali/HashSet](https://github.com/prashantkodali/HashSet) - **Paper:** [HashSet -- A Dataset For Hashtag Segmentation](https://arxiv.org/abs/2201.06741) ### Dataset Summary Hashset is a new dataset consisting on 1.9k manually annotated and 3.3M loosely supervised tweets for testing the efficiency of hashtag segmentation models. We compare State of The Art Hashtag Segmentation models on Hashset and other baseline datasets (STAN and BOUN). We compare and analyse the results across the datasets to argue that HashSet can act as a good benchmark for hashtag segmentation tasks. HashSet Distant: 3.3M loosely collected camel cased hashtags containing hashtag and their segmentation. HashSet Distant Sampled is a sample of 20,000 camel cased hashtags from the HashSet Distant dataset. ### Languages Hindi and English. ## Dataset Structure ### Data Instances ``` { 'index': 282559, 'hashtag': 'Youth4Nation', 'segmentation': 'Youth 4 Nation' } ``` ## Dataset Creation - All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: `hashtag` and `segmentation` or `identifier` and `segmentation`. - The only difference between `hashtag` and `segmentation` or between `identifier` and `segmentation` are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields. - There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as `_` , `:`, `~` ). - If there are any annotations for named entity recognition and other token classification tasks, they are given in a `spans` field. ## Additional Information ### Citation Information ``` @article{kodali2022hashset, title={HashSet--A Dataset For Hashtag Segmentation}, author={Kodali, Prashant and Bhatnagar, Akshala and Ahuja, Naman and Shrivastava, Manish and Kumaraguru, Ponnurangam}, journal={arXiv preprint arXiv:2201.06741}, year={2022} } ``` ### Contributions This dataset was added by [@ruanchaves](https://github.com/ruanchaves) while developing the [hashformers](https://github.com/ruanchaves/hashformers) library.

提供机构：

ruanchaves

原始信息汇总

数据集概述

数据集描述

名称: HashSet Distant Sampled
语言: 印地语（hi）和英语（en）
许可证: 未知
多语言性: 多语言
任务类别: 结构预测
标签: 单词分割

数据集总结

概述: HashSet Distant Sampled 是从 HashSet Distant 数据集中抽样的20,000个驼峰式哈希标签及其分割结果。
原始数据: 包含1.9k手动标注和3.3M松散监督的推文，用于测试哈希标签分割模型的效率。

数据集结构

数据实例示例:

{ index: 282559, hashtag: Youth4Nation, segmentation: Youth 4 Nation }

数据集创建

字段: 所有数据集包含基本字段 hashtag 和 segmentation 或 identifier 和 segmentation。
差异: hashtag 和 segmentation 或 identifier 和 segmentation 之间的唯一区别是空格字符。
规则: 总是有空格分隔字母数字字符和任何特殊字符序列。

附加信息

贡献者: 由 @ruanchaves 添加。
相关工作: 用于开发 hashformers 库。

5,000+

优质数据集

54 个

任务类型

进入经典数据集