five

taidng/WikiSER

收藏
Hugging Face2024-04-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/taidng/WikiSER
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 --- # Software Entity Recognition ## Description Data collected from our paper ["Software Entity Recognition with Noise-robust Learning"](https://arxiv.org/abs/2308.10564), ASE 2023. WikiSER corpus includes 1.7M sentences with named entity labels extracted from 79k Wikipedia articles. Relevant software named entities are labeled under 12 fine-grained categories: | Type | Examples | |------------------|-------------------------------------------------------| | Algorithm | Auction algorithm, Collaborative filtering | | Application | Adobe Acrobat, Microsoft Excel | | Architecture | Graphics processing unit, Wishbone | | Data_Structure | Array, Hash table, mXOR linked list | | Device | Samsung Gear S2, iPad, Intel T5300 | | Error Name | Buffer overflow, Memory leak | | General_Concept | Memory management, Nouvelle AI | | Language | C++, Java, Python, Rust | | Library | Beautiful Soup, FastAPI | | License | Cryptix General License, MIT License | | Operating_System | Linux, Ubuntu, Red Hat OS, MorphOS | | Protocol | TLS, FTPS, HTTP 404 | WikiSER is organized by the Wiki articles in which the data was scraped from. |-- Adobe_Flash.txt |-- Linux.txt |-- Java_(programming_language).txt |-- ... Each sentences are split by `<s>...</s>` and tokenized with [stokenizer](https://github.com/jeniyat/StackOverflowNER/blob/master/code/SOTokenizer/stokenizer.py). ## Structure In the [folder](https://huggingface.co/datasets/taidng/WikiSER/tree/main/): `wikiser`: Full zipped data `wikiser-small`: Subset of the data used for training [`wikiser-bert-base`](https://huggingface.co/taidng/wikiser-bert-base) and [`wikiser-bert-large`](https://huggingface.co/taidng/wikiser-bert-large) `wikiser-sample`: A few examples ## Citation ```bibtex @inproceedings{nguyen2023software, title={Software Entity Recognition with Noise-Robust Learning}, author={Nguyen, Tai and Di, Yifeng and Lee, Joohan and Chen, Muhao and Zhang, Tianyi}, booktitle={Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE'23)}, year={2023}, organization={IEEE/ACM} } ```
提供机构:
taidng
原始信息汇总

数据集概述

来源

  • 数据集来源于论文《Software Entity Recognition with Noise-robust Learning》,该论文将在ASE 2023会议上发表。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作