diwank/silicone-merged
收藏Hugging Face2022-03-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/diwank/silicone-merged
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
# diwank/silicone-merged
> Merged and simplified dialog act datasets from the [silicone collection](https://huggingface.co/datasets/silicone/)
All of the subsets of the original collection have been filtered (for errors and ambiguous classes), merged together and grouped into pairs of dialog turns. It is hypothesized that training dialog act classifier by including the previous utterance can help models pick up additional contextual cues and be better at inference esp if an utterance pair is provided.
## Example training script
```python
from datasets import load_dataset
from simpletransformers.classification import (
ClassificationModel, ClassificationArgs
)
# Get data
silicone_merged = load_dataset("diwank/silicone-merged")
train_df = silicone_merged["train"]
eval_df = silicone_merged["validation"]
model_args = ClassificationArgs(
num_train_epochs=8,
model_type="deberta",
model_name="microsoft/deberta-large",
use_multiprocessing=False,
evaluate_during_training=True,
)
# Create a ClassificationModel
model = ClassificationModel("deberta", "microsoft/deberta-large", args=model_args, num_labels=11) # 11 labels in this dataset
# Train model
model.train_model(train_df, eval_df=eval_df)
```
## Balanced variant of the training set
**Note**: This dataset is highly imbalanced and it is recommended to use a library like [imbalanced-learn](https://imbalanced-learn.org/stable/) before proceeding with training.
Since, balancing can be complicated and resource-intensive, we have shared a balanced variant of the train set that was created via oversampling using the _imbalanced-learn_ library. The balancing used the `SMOTEN` algorithm to deal with categorical data clustering and was resampled on a 16-core, 60GB RAM machine. You can access it using:
```load_dataset("diwank/silicone-merged", "balanced")```
## Feature description
- `text_a`: The utterance prior to the utterance being classified. (Say for dialog with turns 1-2-3, if we are trying to find the dialog act for 2, text_a is 1)
- `text_b`: The utterance to be classified
- `labels`: Dialog act label (as integer between 0-10, as mapped below)
## Labels map
```python
[
(0, 'acknowledge')
(1, 'answer')
(2, 'backchannel')
(3, 'reply_yes')
(4, 'exclaim')
(5, 'say')
(6, 'reply_no')
(7, 'hold')
(8, 'ask')
(9, 'intent')
(10, 'ask_yes_no')
]
```
*****
## Appendix
### How the original datasets were mapped:
```python
mapping = {
"acknowledge": {
"swda": [
"aap_am",
"b",
"bk"
],
"mrda": [],
"oasis": [
"ackn",
"accept",
"complete"
],
"maptask": [
"acknowledge",
"align"
],
"dyda_da": [
"commissive"
]
},
"answer": {
"swda": [
"bf",
],
"mrda": [],
"oasis": [
"answ",
"informCont",
"inform",
"answElab",
"directElab",
"refer"
],
"maptask": [
"reply_w",
"explain"
],
"dyda_da": [
"inform"
]
},
"backchannel": {
"swda": [
"ad",
"bh",
"bd",
"b^m"
],
"mrda": [
"b"
],
"oasis": [
"backch",
"selfTalk",
"init"
],
"maptask": ["ready"],
"dyda_da": []
},
"reply_yes": {
"swda": [
"na",
"aa"
],
"mrda": [],
"oasis": [
"confirm"
],
"maptask": [
"reply_y"
],
"dyda_da": []
},
"exclaim": {
"swda": [
"ft",
"fa",
"fc",
"fp"
],
"mrda": [],
"oasis": [
"appreciate",
"bye",
"exclaim",
"greet",
"thank",
"pardon",
"thank-identitySelf",
"expressRegret"
],
"maptask": [],
"dyda_da": []
},
"say": {
"swda": [
"qh",
"sd"
],
"mrda": ["s"],
"oasis": [
"expressPossibility",
"expressOpinion",
"suggest"
],
"maptask": [],
"dyda_da": []
},
"reply_no": {
"swda": [
"nn",
"ng",
"ar"
],
"mrda": [],
"oasis": [
"refuse",
"negate"
],
"maptask": [
"reply_n"
],
"dyda_da": []
},
"hold": {
"swda": [
"^h",
"t1"
],
"mrda": [
"f"
],
"oasis": [
"hold"
],
"maptask": [],
"dyda_da": []
},
"ask": {
"swda": [
"qw",
"qo",
"qw^d",
"br",
"qrr"
],
"mrda": [
"q"
],
"oasis": [
"reqInfo",
"reqDirect",
"offer"
],
"maptask": [
"query_w"
],
"dyda_da": [
"question"
]
},
"intent": {
"swda": [],
"mrda": [],
"oasis": [
"informIntent",
"informIntent-hold",
"expressWish",
"direct",
"raiseIssue",
"correct"
],
"maptask": [
"instruct",
"clarify"
],
"dyda_da": [
"directive"
]
},
"ask_yes_no": {
"swda": [
"qy^d",
"^g"
],
"mrda": [],
"oasis": [
"reqModal"
],
"maptask": [
"query_yn",
"check"
],
"dyda_da": []
}
}
```
提供机构:
diwank
原始信息汇总
数据集概述
数据集名称
- 名称: diwank/silicone-merged
- 描述: 该数据集是silicone collection中对话行为数据集的合并和简化版本。
数据处理
- 处理方式: 原始数据集的各个子集经过筛选(去除错误和模糊类别),合并并组织成对话轮次对。
- 假设: 通过包含前一语句训练对话行为分类器,有助于模型捕捉额外的上下文线索,特别是在提供语句对时进行推理。
数据集特征
- 特征:
text_a: 待分类语句之前的语句。text_b: 待分类的语句。labels: 对话行为标签(整数0-10,对应以下映射)。
标签映射
- 标签: 共11个标签,每个标签对应一个整数和一个描述性名称。
- 0: acknowledge
- 1: answer
- 2: backchannel
- 3: reply_yes
- 4: exclaim
- 5: say
- 6: reply_no
- 7: hold
- 8: ask
- 9: intent
- 10: ask_yes_no
数据集使用
- 训练脚本示例: 提供了一个使用
simpletransformers库的Python脚本,用于加载数据集并训练模型。 - 平衡训练集: 由于数据集不平衡,推荐使用imbalanced-learn库进行预处理。此外,提供了一个通过
SMOTEN算法过采样得到的平衡训练集变体。
许可证
- 许可证: MIT



