ruanchaves/jhotdraw
收藏Hugging Face2022-10-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ruanchaves/jhotdraw
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- machine-generated
language:
- code
license:
- unknown
multilinguality:
- monolingual
size_categories:
- unknown
source_datasets:
- original
task_categories:
- structure-prediction
task_ids: []
pretty_name: Jhotdraw
tags:
- word-segmentation
---
# Dataset Card for Jhotdraw
## Dataset Description
- **Paper:** [Helpful or Not? An investigation on the feasibility of identifier splitting via CNN-BiLSTM-CRF](https://ksiresearch.org/seke/seke18paper/seke18paper_167.pdf)
### Dataset Summary
In programming languages, identifiers are tokens (also called symbols) which name language entities.
Some of the kinds of entities an identifier might denote include variables, types, labels, subroutines, and packages.
Jhotdraw is a dataset for identifier segmentation, i.e. the task of adding spaces between the words on a identifier.
### Languages
- Java
## Dataset Structure
### Data Instances
```
{
"index": 0,
"identifier": "abstractconnectorserializeddataversion",
"segmentation": "abstract connector serialized data version"
}
```
### Data Fields
- `index`: a numerical index.
- `identifier`: the original identifier.
- `segmentation`: the gold segmentation for the identifier.
## Dataset Creation
- All hashtag segmentation and identifier splitting datasets on this profile have the same basic fields: `hashtag` and `segmentation` or `identifier` and `segmentation`.
- The only difference between `hashtag` and `segmentation` or between `identifier` and `segmentation` are the whitespace characters. Spell checking, expanding abbreviations or correcting characters to uppercase go into other fields.
- There is always whitespace between an alphanumeric character and a sequence of any special characters ( such as `_` , `:`, `~` ).
- If there are any annotations for named entity recognition and other token classification tasks, they are given in a `spans` field.
## Additional Information
### Citation Information
```
@inproceedings{madani2010recognizing,
title={Recognizing words from source code identifiers using speech recognition techniques},
author={Madani, Nioosha and Guerrouj, Latifa and Di Penta, Massimiliano and Gueheneuc, Yann-Gael and Antoniol, Giuliano},
booktitle={2010 14th European Conference on Software Maintenance and Reengineering},
pages={68--77},
year={2010},
organization={IEEE}
}
```
### Contributions
This dataset was added by [@ruanchaves](https://github.com/ruanchaves) while developing the [hashformers](https://github.com/ruanchaves/hashformers) library.
提供机构:
ruanchaves
原始信息汇总
数据集概述
数据集描述
数据集总结
- 名称: Jhotdraw
- 目的: 用于标识符分割,即在标识符中添加空格以分隔单词。
- 语言: 仅包含Java代码。
数据集结构
数据实例
json { "index": 0, "identifier": "abstractconnectorserializeddataversion", "segmentation": "abstract connector serialized data version" }
数据字段
index: 数值索引。identifier: 原始标识符。segmentation: 标识符的黄金分割。
数据集创建
- 数据集包含基本字段:
identifier和segmentation。 - 标识符与分割之间的唯一区别是空格字符。
- 在字母数字字符和任何特殊字符序列之间始终存在空格。
附加信息
- 贡献者: @ruanchaves
- 相关项目: hashformers



