five

DILAB-HYU/SimKoR

收藏
Hugging Face2022-10-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/DILAB-HYU/SimKoR
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 --- # SimKoR We provide korean sentence text similarity pair dataset using sentiment analysis corpus from [bab2min/corpus](https://github.com/bab2min/corpus). This data crawling korean review from naver shopping website. we reconstruct subset of dataset to make our dataset. ## Dataset description The original dataset description can be found at the link [[here]](https://github.com/bab2min/corpus/tree/master/sentiment). ![그림6](https://user-images.githubusercontent.com/54879393/189065508-240b6449-6a26-463f-bd02-64785d76fa02.png) In korean Contrastive Learning, There are few suitable validation dataset (only KorNLI). To create contrastive learning validation dataset, we changed original sentiment analysis dataset to sentence text similar dataset. Our simkor dataset was created by grouping pair of sentence. Each score [0,1,2,4,5] means how far the meaning is between sentences. ## Data Distribution Our dataset class consist of text similarity score [0, 1,2,4,5]. each score consists of data of the same size. <table> <tr><th>Score</th><th>train</th><th>valid</th><th>test</th></tr> <tr><th>5</th><th>4,000</th><th>1,000</th><th>1,000</th></tr> <tr><th>4</th><th>4,000</th><th>1,000</th><th>1,000</th></tr> <tr><th>2</th><th>4,000</th><th>1,000</th><th>1,000</th></tr> <tr><th>1</th><th>4,000</th><th>1,000</th><th>1,000</th></tr> <tr><th>0</th><th>4,000</th><th>1,000</th><th>1,000</th></tr> <tr><th>All</th><th>20,000</th><th>5,000</th><th>5,000</th></tr> </table> ## Example ``` text1 text2 label 고속충전이 안됨ㅠㅠ 집에매연냄새없앨려했는데 그냥창문여는게더 공기가좋네요 5 적당히 맵고 괜찮네요 어제 시킨게 벌써 왔어요 ㅎㅎ 배송빠르고 품질양호합니다 4 다 괜찮은데 배송이 10일이나 걸린게 많이 아쉽네요. 선반 설치하고 나니 주방 베란다 완전 다시 태어났어요~ 2 가격 싸지만 쿠션이 약해 무릎 아파요~ 반품하려구요~ 튼튼하고 빨래도 많이 걸 수 있고 잘쓰고 있어요 1 각인이 찌그저져있고 엉성합니다. 처음 해보는 방탈출이었는데 너무 재미있었어요. 0 ``` ## Contributors The main contributors of the work are : - [Jaemin Kim](https://github.com/kimfunn)\* - [Yohan Na](https://github.com/nayohan)\* - [Kangmin Kim](https://github.com/Gangsss) - [Sangrak Lee](https://github.com/PangRAK) \*: Equal Contribution Hanyang University Data Intelligence Lab[(DILAB)](http://dilab.hanyang.ac.kr/) providing support ❤️ ## Github - **Repository :** [SimKoR](https://github.com/nayohan/SimKoR) ## License <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
提供机构:
DILAB-HYU
原始信息汇总

数据集概述

数据集名称

  • SimKoR

数据集来源

  • 数据集基于bab2min/corpus的情感分析语料库,从Naver购物网站爬取韩国评论数据,并重新构建为子集。

数据集描述

  • 该数据集用于韩国对比学习,将原始情感分析数据集转换为句子文本相似度数据集。数据集通过分组句子对创建,每个句子对的相似度评分范围为[0,1,2,4,5],表示句子间意义的远近。

数据分布

  • 数据集包含文本相似度评分[0, 1,2,4,5],每个评分下的数据量相同。具体分布如下:

    评分 训练集 验证集 测试集
    5 4,000 1,000 1,000
    4 4,000 1,000 1,000
    2 4,000 1,000 1,000
    1 4,000 1,000 1,000
    0 4,000 1,000 1,000
    总计 20,000 5,000 5,000

示例

  • 数据集包含句子对及其相似度评分,例如:

    text1 text2 label 고속충전이 안됨ㅠㅠ 집에매연냄새없앨려했는데 그냥창문여는게더 공기가좋네요 5 적당히 맵고 괜찮네요 어제 시킨게 벌써 왔어요 ㅎㅎ 배송빠르고 품질양호합니다 4 다 괜찮은데 배송이 10일이나 걸린게 많이 아쉽네요. 선반 설치하고 나니 주방 베란다 완전 다시 태어났어요~ 2 가격 싸지만 쿠션이 약해 무릎 아파요~ 반품하려구요~ 튼튼하고 빨래도 많이 걸 수 있고 잘쓰고 있어요 1 각인이 찌그저져있고 엉성합니다. 처음 해보는 방탈출이었는데 너무 재미있었어요. 0

贡献者

  • 主要贡献者:
    • Jaemin Kim
    • Yohan Na
    • Kangmin Kim
    • Sangrak Lee

许可证

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作