taidng/WikiSER
收藏Hugging Face2024-04-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/taidng/WikiSER
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# Software Entity Recognition
## Description
Data collected from our paper ["Software Entity Recognition with Noise-robust Learning"](https://arxiv.org/abs/2308.10564), ASE 2023.
WikiSER corpus includes 1.7M sentences with named entity labels extracted from 79k Wikipedia articles.
Relevant software named entities are labeled under 12 fine-grained categories:
| Type | Examples |
|------------------|-------------------------------------------------------|
| Algorithm | Auction algorithm, Collaborative filtering |
| Application | Adobe Acrobat, Microsoft Excel |
| Architecture | Graphics processing unit, Wishbone |
| Data_Structure | Array, Hash table, mXOR linked list |
| Device | Samsung Gear S2, iPad, Intel T5300 |
| Error Name | Buffer overflow, Memory leak |
| General_Concept | Memory management, Nouvelle AI |
| Language | C++, Java, Python, Rust |
| Library | Beautiful Soup, FastAPI |
| License | Cryptix General License, MIT License |
| Operating_System | Linux, Ubuntu, Red Hat OS, MorphOS |
| Protocol | TLS, FTPS, HTTP 404 |
WikiSER is organized by the Wiki articles in which the data was scraped from.
|-- Adobe_Flash.txt
|-- Linux.txt
|-- Java_(programming_language).txt
|-- ...
Each sentences are split by `<s>...</s>` and tokenized with [stokenizer](https://github.com/jeniyat/StackOverflowNER/blob/master/code/SOTokenizer/stokenizer.py).
## Structure
In the [folder](https://huggingface.co/datasets/taidng/WikiSER/tree/main/):
`wikiser`: Full zipped data
`wikiser-small`: Subset of the data used for training [`wikiser-bert-base`](https://huggingface.co/taidng/wikiser-bert-base) and [`wikiser-bert-large`](https://huggingface.co/taidng/wikiser-bert-large)
`wikiser-sample`: A few examples
## Citation
```bibtex
@inproceedings{nguyen2023software,
title={Software Entity Recognition with Noise-Robust Learning},
author={Nguyen, Tai and Di, Yifeng and Lee, Joohan and Chen, Muhao and Zhang, Tianyi},
booktitle={Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE'23)},
year={2023},
organization={IEEE/ACM}
}
```
提供机构:
taidng
原始信息汇总
数据集概述
来源
- 数据集来源于论文《Software Entity Recognition with Noise-robust Learning》,该论文将在ASE 2023会议上发表。



