DILAB-HYU/SimKoR
收藏Hugging Face2022-10-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/DILAB-HYU/SimKoR
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
---
# SimKoR
We provide korean sentence text similarity pair dataset using sentiment analysis corpus from [bab2min/corpus](https://github.com/bab2min/corpus).
This data crawling korean review from naver shopping website. we reconstruct subset of dataset to make our dataset.
## Dataset description
The original dataset description can be found at the link [[here]](https://github.com/bab2min/corpus/tree/master/sentiment).

In korean Contrastive Learning, There are few suitable validation dataset (only KorNLI). To create contrastive learning validation dataset, we changed original sentiment analysis dataset to sentence text similar dataset. Our simkor dataset was created by grouping pair of sentence. Each score [0,1,2,4,5] means how far the meaning is between sentences.
## Data Distribution
Our dataset class consist of text similarity score [0, 1,2,4,5]. each score consists of data of the same size.
<table>
<tr><th>Score</th><th>train</th><th>valid</th><th>test</th></tr>
<tr><th>5</th><th>4,000</th><th>1,000</th><th>1,000</th></tr>
<tr><th>4</th><th>4,000</th><th>1,000</th><th>1,000</th></tr>
<tr><th>2</th><th>4,000</th><th>1,000</th><th>1,000</th></tr>
<tr><th>1</th><th>4,000</th><th>1,000</th><th>1,000</th></tr>
<tr><th>0</th><th>4,000</th><th>1,000</th><th>1,000</th></tr>
<tr><th>All</th><th>20,000</th><th>5,000</th><th>5,000</th></tr>
</table>
## Example
```
text1 text2 label
고속충전이 안됨ㅠㅠ 집에매연냄새없앨려했는데 그냥창문여는게더 공기가좋네요 5
적당히 맵고 괜찮네요 어제 시킨게 벌써 왔어요 ㅎㅎ 배송빠르고 품질양호합니다 4
다 괜찮은데 배송이 10일이나 걸린게 많이 아쉽네요. 선반 설치하고 나니 주방 베란다 완전 다시 태어났어요~ 2
가격 싸지만 쿠션이 약해 무릎 아파요~ 반품하려구요~ 튼튼하고 빨래도 많이 걸 수 있고 잘쓰고 있어요 1
각인이 찌그저져있고 엉성합니다. 처음 해보는 방탈출이었는데 너무 재미있었어요. 0
```
## Contributors
The main contributors of the work are :
- [Jaemin Kim](https://github.com/kimfunn)\*
- [Yohan Na](https://github.com/nayohan)\*
- [Kangmin Kim](https://github.com/Gangsss)
- [Sangrak Lee](https://github.com/PangRAK)
\*: Equal Contribution
Hanyang University Data Intelligence Lab[(DILAB)](http://dilab.hanyang.ac.kr/) providing support ❤️
## Github
- **Repository :** [SimKoR](https://github.com/nayohan/SimKoR)
## License
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
提供机构:
DILAB-HYU
原始信息汇总
数据集概述
数据集名称
- SimKoR
数据集来源
- 数据集基于bab2min/corpus的情感分析语料库,从Naver购物网站爬取韩国评论数据,并重新构建为子集。
数据集描述
- 该数据集用于韩国对比学习,将原始情感分析数据集转换为句子文本相似度数据集。数据集通过分组句子对创建,每个句子对的相似度评分范围为[0,1,2,4,5],表示句子间意义的远近。
数据分布
-
数据集包含文本相似度评分[0, 1,2,4,5],每个评分下的数据量相同。具体分布如下:
评分 训练集 验证集 测试集 5 4,000 1,000 1,000 4 4,000 1,000 1,000 2 4,000 1,000 1,000 1 4,000 1,000 1,000 0 4,000 1,000 1,000 总计 20,000 5,000 5,000
示例
-
数据集包含句子对及其相似度评分,例如:
text1 text2 label 고속충전이 안됨ㅠㅠ 집에매연냄새없앨려했는데 그냥창문여는게더 공기가좋네요 5 적당히 맵고 괜찮네요 어제 시킨게 벌써 왔어요 ㅎㅎ 배송빠르고 품질양호합니다 4 다 괜찮은데 배송이 10일이나 걸린게 많이 아쉽네요. 선반 설치하고 나니 주방 베란다 완전 다시 태어났어요~ 2 가격 싸지만 쿠션이 약해 무릎 아파요~ 반품하려구요~ 튼튼하고 빨래도 많이 걸 수 있고 잘쓰고 있어요 1 각인이 찌그저져있고 엉성합니다. 처음 해보는 방탈출이었는데 너무 재미있었어요. 0
贡献者
- 主要贡献者:
- Jaemin Kim
- Yohan Na
- Kangmin Kim
- Sangrak Lee



