diwank/silicone-merged

Name: diwank/silicone-merged
Creator: diwank
Published: 2022-03-06 11:30:57
License: 暂无描述

Hugging Face2022-03-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/diwank/silicone-merged

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit --- # diwank/silicone-merged > Merged and simplified dialog act datasets from the [silicone collection](https://huggingface.co/datasets/silicone/) All of the subsets of the original collection have been filtered (for errors and ambiguous classes), merged together and grouped into pairs of dialog turns. It is hypothesized that training dialog act classifier by including the previous utterance can help models pick up additional contextual cues and be better at inference esp if an utterance pair is provided. ## Example training script ```python from datasets import load_dataset from simpletransformers.classification import ( ClassificationModel, ClassificationArgs ) # Get data silicone_merged = load_dataset("diwank/silicone-merged") train_df = silicone_merged["train"] eval_df = silicone_merged["validation"] model_args = ClassificationArgs( num_train_epochs=8, model_type="deberta", model_name="microsoft/deberta-large", use_multiprocessing=False, evaluate_during_training=True, ) # Create a ClassificationModel model = ClassificationModel("deberta", "microsoft/deberta-large", args=model_args, num_labels=11) # 11 labels in this dataset # Train model model.train_model(train_df, eval_df=eval_df) ``` ## Balanced variant of the training set **Note**: This dataset is highly imbalanced and it is recommended to use a library like [imbalanced-learn](https://imbalanced-learn.org/stable/) before proceeding with training. Since, balancing can be complicated and resource-intensive, we have shared a balanced variant of the train set that was created via oversampling using the _imbalanced-learn_ library. The balancing used the `SMOTEN` algorithm to deal with categorical data clustering and was resampled on a 16-core, 60GB RAM machine. You can access it using: ```load_dataset("diwank/silicone-merged", "balanced")``` ## Feature description - `text_a`: The utterance prior to the utterance being classified. (Say for dialog with turns 1-2-3, if we are trying to find the dialog act for 2, text_a is 1) - `text_b`: The utterance to be classified - `labels`: Dialog act label (as integer between 0-10, as mapped below) ## Labels map ```python [ (0, 'acknowledge') (1, 'answer') (2, 'backchannel') (3, 'reply_yes') (4, 'exclaim') (5, 'say') (6, 'reply_no') (7, 'hold') (8, 'ask') (9, 'intent') (10, 'ask_yes_no') ] ``` ***** ## Appendix ### How the original datasets were mapped: ```python mapping = { "acknowledge": { "swda": [ "aap_am", "b", "bk" ], "mrda": [], "oasis": [ "ackn", "accept", "complete" ], "maptask": [ "acknowledge", "align" ], "dyda_da": [ "commissive" ] }, "answer": { "swda": [ "bf", ], "mrda": [], "oasis": [ "answ", "informCont", "inform", "answElab", "directElab", "refer" ], "maptask": [ "reply_w", "explain" ], "dyda_da": [ "inform" ] }, "backchannel": { "swda": [ "ad", "bh", "bd", "b^m" ], "mrda": [ "b" ], "oasis": [ "backch", "selfTalk", "init" ], "maptask": ["ready"], "dyda_da": [] }, "reply_yes": { "swda": [ "na", "aa" ], "mrda": [], "oasis": [ "confirm" ], "maptask": [ "reply_y" ], "dyda_da": [] }, "exclaim": { "swda": [ "ft", "fa", "fc", "fp" ], "mrda": [], "oasis": [ "appreciate", "bye", "exclaim", "greet", "thank", "pardon", "thank-identitySelf", "expressRegret" ], "maptask": [], "dyda_da": [] }, "say": { "swda": [ "qh", "sd" ], "mrda": ["s"], "oasis": [ "expressPossibility", "expressOpinion", "suggest" ], "maptask": [], "dyda_da": [] }, "reply_no": { "swda": [ "nn", "ng", "ar" ], "mrda": [], "oasis": [ "refuse", "negate" ], "maptask": [ "reply_n" ], "dyda_da": [] }, "hold": { "swda": [ "^h", "t1" ], "mrda": [ "f" ], "oasis": [ "hold" ], "maptask": [], "dyda_da": [] }, "ask": { "swda": [ "qw", "qo", "qw^d", "br", "qrr" ], "mrda": [ "q" ], "oasis": [ "reqInfo", "reqDirect", "offer" ], "maptask": [ "query_w" ], "dyda_da": [ "question" ] }, "intent": { "swda": [], "mrda": [], "oasis": [ "informIntent", "informIntent-hold", "expressWish", "direct", "raiseIssue", "correct" ], "maptask": [ "instruct", "clarify" ], "dyda_da": [ "directive" ] }, "ask_yes_no": { "swda": [ "qy^d", "^g" ], "mrda": [], "oasis": [ "reqModal" ], "maptask": [ "query_yn", "check" ], "dyda_da": [] } } ```

提供机构：

diwank

原始信息汇总

数据集概述

数据集名称

名称: diwank/silicone-merged
描述: 该数据集是silicone collection中对话行为数据集的合并和简化版本。

数据处理

处理方式: 原始数据集的各个子集经过筛选（去除错误和模糊类别），合并并组织成对话轮次对。
假设: 通过包含前一语句训练对话行为分类器，有助于模型捕捉额外的上下文线索，特别是在提供语句对时进行推理。

数据集特征

特征:
- text_a: 待分类语句之前的语句。
- text_b: 待分类的语句。
- labels: 对话行为标签（整数0-10，对应以下映射）。

标签映射

标签: 共11个标签，每个标签对应一个整数和一个描述性名称。
- 0: acknowledge
- 1: answer
- 2: backchannel
- 3: reply_yes
- 4: exclaim
- 5: say
- 6: reply_no
- 7: hold
- 8: ask
- 9: intent
- 10: ask_yes_no

数据集使用

训练脚本示例: 提供了一个使用simpletransformers库的Python脚本，用于加载数据集并训练模型。
平衡训练集: 由于数据集不平衡，推荐使用imbalanced-learn库进行预处理。此外，提供了一个通过SMOTEN算法过采样得到的平衡训练集变体。

许可证

许可证: MIT

5,000+

优质数据集

54 个

任务类型

进入经典数据集