events_classification_biotech
收藏魔搭社区2025-12-04 更新2024-12-28 收录
下载链接:
https://modelscope.cn/datasets/knowledgator/events_classification_biotech
下载链接
链接失效反馈官方服务:
资源简介:
### Key aspects
* Event extraction;
* [Multi-label classification](https://en.wikipedia.org/wiki/Multi-label_classification);
* Biotech news domain;
* 31 classes;
* 3140 total number of examples;
### Motivation
Text classification is a widespread task and a foundational step in numerous information extraction pipelines. However, a notable challenge in current NLP research lies in the oversimplification of benchmarking datasets, which predominantly focus on rudimentary tasks such as topic classification or sentiment analysis.
This dataset is specifically curated to address the limitations of existing benchmarks by incorporating rich and complex content derived from the biotech news domain. It encompasses diverse biotech news articles consisting of various events, offering a more nuanced perspective on information extraction challenges.
A distinctive feature of this dataset is its emphasis on not only identifying the overarching theme but also extracting information about the target companies associated with the news. This dual-layered approach enhances the dataset's utility for applications that require a deeper understanding of the relationships between events, companies, and the biotech industry as a whole.
### Classes
The dataset consists of **31** classes, including None values.
* event organization - organizing or participating in an event like a conference, exhibition, etc.
* executive statement - a statement or quote from an executive of a company.
* regulatory approval - getting approval from regulatory bodies for products, services, trials, etc.
* hiring - announcing new hires or appointments at the company.
* foundation - establishing a new charitable foundation.
* closing - shutting down a facility/office/division or ceasing an initiative.
* partnerships & alliances - forming partnerships or strategic alliances with other companies.
* expanding industry - expanding into new industries or markets.
* new initiatives or programs - announcing new initiatives, programs, or campaigns.
* m&a - mergers, acquisitions, or divestitures.
* None - no label.
* service & product providing - launching or expanding products or services.
* event organisation - organizing or participating in an event.
* new initiatives & programs - announcing new initiatives or programs.
* subsidiary establishment - establishing a new subsidiary company.
* product launching & presentation - launching or unveiling a new product.
* product updates - announcing updates or new versions of existing products.
* executive appointment - appointing a new executive.
* alliance & partnership - forming an alliance or partnership.
* ipo exit - having an initial public offering or acquisition exit.
* article publication - publishing an article.
* clinical trial sponsorship - Sponsoring or participating in a clinical trial.
* company description - describing or profiling the company.
* investment in public company - making an investment in a public company.
* other - other events that don't fit into defined categories.
* expanding geography - expanding into new geographical areas.
* participation in an event - participating in an industry event, conference, etc.
* support & philanthropy - philanthropic activities or donations.
* department establishment - establishing a new department or division.
* funding round - raising a new round of funding.
* patent publication - publication of a new patent filing.
### Benchmark
We trained various models with binary-cross entropy loss and evaluated them on the test set.
| Model | Accuracy | F1 | Precision | Recall |
|-----------------|----------|-------|-----------|--------|
| DeBERTa-small | 96.58 | 67.69 | 74.18 | 62.19 |
| DeBERTa-base | 96.60 | 67.55 | 74.81 | 61.58 |
| DeBERTa-large | 96.99 | 74.07 | 73.46 | 74.69 |
| SciBERT-uncased | 96.57 | 68.07 | 73.07 | 63.71 |
| Flan-T5-base | 96.85 | 71.10 | 75.71 | 67.07 |
### Recommended reading:
- Check the general overview of the dataset on Medium - [Finally, a decent multi-label classification benchmark is created: a prominent zero-shot dataset.](https://medium.com/p/4d90c9e1c718)
- Try to train your own model on the datset - [ Multi-Label Classification Model From Scratch: Step-by-Step Tutorial ](https://huggingface.co/blog/Valerii-Knowledgator/multi-label-classification)
### Feedback
We value your input! Share your feedback and suggestions to help us improve our models and datasets.
Fill out the feedback [form](https://forms.gle/5CPFFuLzNWznjcpL7)
### Join Our Discord
Connect with our community on Discord for news, support, and discussion about our models and datasets.
Join [Discord](https://discord.gg/mfZfwjpB)
### 核心特性
* 事件抽取;
* 多标签分类(Multi-label classification);
* 生物科技新闻领域;
* 共31个类别;
* 总计3140条样本。
### 设计动机
文本分类是一项广泛应用的任务,也是众多信息抽取流水线中的基础步骤。然而,当前自然语言处理(Natural Language Processing,简称NLP)研究面临的一项显著挑战在于基准数据集过于简化——这类数据集大多聚焦于主题分类、情感分析这类基础任务。
本数据集专为解决现有基准数据集的局限而打造,纳入了源自生物科技新闻领域的丰富复杂内容。数据集涵盖包含各类事件的多样化生物科技新闻文章,为信息抽取任务的挑战提供了更为精细的研究视角。
本数据集的一大特色在于,不仅要求识别新闻的核心主题,还需抽取与该新闻相关的目标企业信息。这种双层任务设计提升了数据集的应用价值,可用于需要深入理解事件、企业与整个生物科技产业之间关联的场景。
### 类别列表
本数据集共包含31个类别,其中包含无标签(None)类别。
* 活动组织(event organization):组织或参与会议、展会等活动;
* 高管声明(executive statement):企业高管发表的声明或引述言论;
* 监管批准(regulatory approval):获得监管机构针对产品、服务、临床试验等的批准;
* 招聘(hiring):宣布公司新招聘或人事任命;
* 基金会设立(foundation):设立新的慈善基金会;
* 关停(closing):关停设施、办公场所、部门或终止某项举措;
* 合作与联盟(partnerships & alliances):与其他企业建立合作关系或战略联盟;
* 行业拓展(expanding industry):拓展至新的行业或市场;
* 新倡议与项目(new initiatives or programs):宣布新的倡议、项目或活动;
* 并购与资产剥离(m&a):合并、收购或资产剥离;
* 无标签(None):无对应标签;
* 服务与产品推出(service & product providing):推出或拓展产品或服务;
* 活动组织(event organisation):组织或参与活动(注:原文存在重复表述);
* 新倡议与项目(new initiatives & programs):宣布新的倡议或项目;
* 子公司设立(subsidiary establishment):设立新的子公司;
* 产品发布与展示(product launching & presentation):推出或揭幕新产品;
* 产品更新(product updates):宣布现有产品的更新或新版本发布;
* 高管任命(executive appointment):任命新的企业高管;
* 联盟与合作(alliance & partnership):建立联盟或合作关系;
* IPO退出(ipo exit):首次公开募股(Initial Public Offering,简称IPO)或收购退出;
* 文章发表(article publication):文章正式发表;
* 临床试验赞助(clinical trial sponsorship):赞助或参与临床试验;
* 企业介绍(company description):对企业进行描述或概况介绍;
* 上市公司投资(investment in public company):对上市公司进行投资;
* 其他(other):无法归入上述类别的其他事件;
* 地理拓展(expanding geography):拓展至新的地理区域;
* 活动参与(participation in an event):参与行业活动、会议等;
* 慈善与支持(support & philanthropy):慈善活动或捐赠行为;
* 部门设立(department establishment):设立新的部门或分支机构;
* 融资轮次(funding round):完成新一轮融资;
* 专利公开(patent publication):新专利申请的公开。
### 基准测试
我们采用二元交叉熵损失训练了多款模型,并在测试集上完成评估。
| 模型名称 | 准确率 | F1值 | 精确率 | 召回率 |
|---------------------|--------|-------|--------|--------|
| DeBERTa-small | 96.58 | 67.69 | 74.18 | 62.19 |
| DeBERTa-base | 96.60 | 67.55 | 74.81 | 61.58 |
| DeBERTa-large | 96.99 | 74.07 | 73.46 | 74.69 |
| SciBERT-uncased | 96.57 | 68.07 | 73.07 | 63.71 |
| Flan-T5-base | 96.85 | 71.10 | 75.71 | 67.07 |
### 推荐阅读
- 请在Medium平台查看本数据集的整体概述:[终于,一项合格的多标签分类基准数据集诞生:一款出色的零样本数据集](https://medium.com/p/4d90c9e1c718)
- 尝试基于本数据集训练自定义模型:[从零构建多标签分类模型:分步教程](https://huggingface.co/blog/Valerii-Knowledgator/multi-label-classification)
### 反馈渠道
我们珍视您的宝贵意见!请分享您的反馈与建议,助力我们优化模型与数据集。
请填写反馈问卷[form](https://forms.gle/5CPFFuLzNWznjcpL7)
### 加入Discord社区
在Discord上与我们的社区建立联系,获取关于模型与数据集的最新资讯、支持与讨论。
点击[Discord](https://discord.gg/mfZfwjpB) 加入社区。
提供机构:
maas
创建时间:
2024-12-26



