bio-nlp-umass/Synth-SBDH
收藏Hugging Face2024-06-08 更新2025-09-13 收录
下载链接:
https://hf-mirror.com/datasets/bio-nlp-umass/Synth-SBDH
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
- token-classification
language:
- en
tags:
- me
- croissant
pretty_name: Synth-SBDH
size_categories:
- 1K<n<10K
---
# Dataset Card for Synth-SBDH
Synth-SBDH is a collection of 8,767 synthetic examples with annotations for 15 SBDH categories. SBDH annotations include information such as presence, period and annotation rationale.
## Dataset Description
Synth-SBDH is a novel synthetic SBDH dataset that mimics EHR notes.
- **Repository:** [Codes to reproduce experiments](https://github.com/avipartho/Synth-SBDH)
- **Paper:** [More Information Needed]
- **Point of Contact:** [Avijit Mitra](mailto:avijitmitra@umass.edu)
## Dataset Structure
### Data Instances
Some examples from [synth_sbdh_train.csv](synth_sbdh_train.csv) looks as follows.
```
{
'ex_no': 30,
'Text': 'Patient has lost his job due to physical disabilities and is currently living on government financial aid.',
'Textspan': 'lost his job || living on government financial aid',
'SBDH': 'job insecurity || financial insecurity',
'Presence': 'yes || yes',
'Period': 'current || current',
'Reasoning': 'The patient lost his job due to physical issues and this refers to job insecurity. || Reliance on government financial aid signifies financial insecurity.'
}
{
'ex_no': 31,
'Text': 'Patient was assaulted last year and suffers from PTSD.',
'Textspan': 'assaulted || suffers from PTSD',
'SBDH': 'violence || psychiatric symptoms or disorders',
'Presence': 'yes || yes',
'Period': 'history || current',
'Reasoning': 'Being assaulted is a form of violence. || PTSD is a psychiatric disorder.'
}
```
### Data Fields
We release Synth-SBDH as CSV files. Each CSV file has the following fields:
- `ex_no`: Unique identifier for an example.
- `Text`: Example text sequence, at max a few sentences long.
- `Textspan`: Text spans with mentions of SBDH, separated by '||'.
- `Reasoning`: Rationales for SBDH annotations, separated by '||'.
- `SBDH`: SBDH annotations for text spans in `Textspan`, separated by '||'.
- `Presence`: Presence annotations (yes/no) for text spans in `Textspan`, separated by '||'.
- `Period`: Period annotations (current/history) for text spans in `Textspan`, separated by '||'.
- `Operation` (only for [synth_sbdh_test_reviewed.csv](synth_sbdh_test_reviewed.csv) file): One of the four review operation considered by the human experts - *keep*, *correct*, *discard* or *add*.
### Data Splits
The Synth-SBDH dataset has 4 splits: _train_, _val_, _test_, and _test_reviewed_. Below are the statistics for the dataset.
| Dataset Split | Number of Examples | Number of Annotations |
| ------------- | ------------------ | --------------------- |
| Train | 6,136 | 10,022 |
| Val | 876 | 1,443 |
| Test | 1,755 | 2,904 |
| Test (Expert Reviewed) | 1,732 | 3,345 |
## Dataset Creation
Details about how the data was created are available in our paper.
## Uses
You may directly download and use the dataset using the `datasets` library.
```python
from datasets import load_dataset
synth_sbdh_dataset = load_dataset("bio-nlp-umass/Synth-SBDH")
```
Or you can also individually download the files and load them using any compatible libray. For example, using `pandas` -
```python
import pandas as pd
synth_sbdh_df = pd.read_csv('FILE_NAME')
```
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
提供机构:
bio-nlp-umass



