five

jaimevera1107/similarity-sentences-spanish

收藏
Hugging Face2023-07-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jaimevera1107/similarity-sentences-spanish
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - sentence-similarity language: - es size_categories: - 10K<n<100K pretty_name: SimilaritySpanishDataset --- # similarity-sentences-spanish (SSS) ### Dataset Summary This dataset comprises a collection of sentences generated using Chat GPT-3, covering various general topics. The dataset also includes sentences from two existing datasets, STS-ES and STSB-Multi-MT, as well as SICK, which were used as additional sources. The sentences in this dataset were generated to exhibit varying levels of similarity based on randomly divided prompts. | **Source** | **Share (rows)** | **Count (rows)** | **Score (avg)** | |-----------|-----------------|------------------|----------------| | GPT | 22.71 % | 3982 | 0.50 | | STBS | 49.21 % | 8628 | 0.53 | | STS | 17.69 % | 3102 | 0.42 | | SICK | 10.38 % | 1820 | 0.51 | | **Total** | 100% | 17532 | 0.49 | ### Objective The purpose of creating this dataset using Chat GPT-3 was to generate diverse text samples covering various topics and to ensure a balanced distribution of scores both overall and across different themes. By leveraging Chat GPT-3, the dataset aims to provide a wide range of sentence pairs with varying degrees of similarity for further analysis and research purposes. ### Languages Spanish ## Dataset Structure ### Data Fields - Sentence 1: The first sentence to be compared. - Sentence 2: The second sentence to be compared. - Score: A number between 0 and 1 indicating the similarity between Sentence 1 and Sentence 2, with 1 indicating high similarity. - Source: The source of the information, represented by its abbreviation. ## Dataset Biases This dataset inherits the biases present in the two existing datasets and the biases inherent in a text generation model like Chat GPT-3. ### Source Data The dataset was created using the following sources: 1. Already existing datasets: - STS-ES ([STSB](https://huggingface.co/datasets/stsb_multi_mt)) - STSB-Multi-MT ([STS](https://huggingface.co/datasets/PlanTL-GOB-ES/sts-es)) 2. Newly generated data: - Chat GPT-3: The sentences were generated using Chat GPT-3 for various general topics. The dataset includes sentences from various themes, such as: - Alimentación y Cocina (Food and Cooking) - Arte y Cultura (Art and Culture) - Ciencia y Tecnología (Science and Technology) - Cine y Televisión (Film and Television) - Deportes (Sports) - Economía (Economy) - Educación (Education) - Estadística (Statistics) - Filosofía (Philosophy) - Finanzas (Finance) - Historia (History) - Literatura (Literature) - Medicina (Medicine) - Medio Ambiente y Sostenibilidad (Environment and Sustainability) - Moda y Estilo (Fashion and Style) - Música (Music) - Organizacional (Organizational) - Política y Gobierno (Politics and Government) - Psicología (Psychology) - Religión y Espiritualidad (Religion and Spirituality) - Salud y Bienestar (Health and Wellness) Please note that these themes are not exhaustive. The prompts for each label (score) are as follows: ```python descripciones_similaridad = { "0.0": "Rewrite the following sentence in a new sentence about a completely different topic, without any apparent connection to the original sentence. The two sentences must be completely distinct and should not share any thematic similarity.", "0.1": "Rewrite the following sentence in a new sentence about a topic completely different from the original sentence. Make sure the two sentences are entirely different and do not share any thematic similarity. At least 90% of the information level should change.", "0.2": "Rewrite the following sentence in a new sentence about the same topic as the original sentence, but not an exact copy. You can express different ideas, but the general theme should be similar. Ensure at least 80% of the information level is different.", "0.3": "Rewrite the following sentence in a new sentence about a topic related to the original sentence, though not equivalent. Both sentences must share a common theme or general idea, but they can express different viewpoints. At least 70% of the information level should change.", "0.4": "Rewrite the following sentence in a new sentence that is not equivalent to the original, but has some similar details or elements. Ensure at least 60% of the information level is different.", "0.5": "Rewrite the following sentence in a new sentence that is not equivalent to the original, but is related to some extent. Both sentences should have some details in common and be thematically related at least 50% of the information level.", "0.6": "Rewrite the following sentence in a new sentence that is approximately equivalent to the original, but may differ in important information or have certain missing elements. The changes should slightly affect the meaning, and at least 60% of the information level should be preserved.", "0.7": "Rewrite the following sentence in a new sentence that is approximately equivalent to the original, but may differ in important information or have some missing elements. Ensure at least 70% of the information level remains the same.", "0.8": "Rewrite the following sentence in a new sentence that is mostly equivalent to the original, but may differ in some unimportant details. The changes should affect a maximum of 20% of the information level.", "0.9": "Rewrite the following sentence in a new sentence that is nearly equivalent to the original, but may have some differences in minor details that do not significantly impact its meaning. The changes should affect a maximum of 10% of the information level.", "1.0": "Rewrite the following sentence in a new sentence that is completely equivalent to the original, as they express exactly the same idea or meaning. The two sentences must share 100% of the information level.", } ``` - SICK ([SICK Dataset](https://huggingface.co/datasets/sick)) The dataset also includes translated and sampled sentences from the SICK dataset using Helsinki ([helsinki - EN -ES](https://huggingface.co/datasets/sick)) as the translation tool to achieve an average score close to 0.5 with the entire dataset. To maintain a balanced representation and avoid excessive prominence of translated data that was not originally written in Spanish and has not been reviewed in Spanish, the intention is to have scores generally centered around 0.5.
提供机构:
jaimevera1107
原始信息汇总

数据集概述

基本信息

  • 许可证: MIT
  • 任务类别: 句子相似度
  • 语言: 西班牙语
  • 数据集大小: 10,000 < n < 100,000
  • 数据集名称: SimilaritySpanishDataset

数据集内容

  • 数据来源:
    • GPT: 使用Chat GPT-3生成的句子,占比22.71%,共3982行,平均相似度0.50。
    • STBS: 来自STSB-Multi-MT数据集,占比49.21%,共8628行,平均相似度0.53。
    • STS: 来自STS-ES数据集,占比17.69%,共3102行,平均相似度0.42。
    • SICK: 来自SICK数据集,占比10.38%,共1820行,平均相似度0.51。
  • 总行数: 17,532行,平均相似度0.49。

数据集目标

  • 目的: 生成涵盖多种主题的多样化文本样本,并确保相似度分数的平衡分布。
  • 应用: 为分析和研究提供具有不同相似度的句子对。

数据集结构

  • 数据字段:
    • Sentence 1: 待比较的第一句。
    • Sentence 2: 待比较的第二句。
    • Score: 介于0和1之间的相似度评分,1表示高度相似。
    • Source: 数据来源的缩写。

数据集偏差

  • 偏差来源: 继承自现有数据集和Chat GPT-3文本生成模型的固有偏差。

数据集主题

  • 涵盖主题: 包括食品与烹饪、艺术与文化、科学与技术等广泛领域。

数据集生成方法

  • 生成工具: 使用Chat GPT-3根据不同主题生成句子。
  • 相似度描述: 根据不同的相似度标签(0.0至1.0)生成句子,确保信息级别的差异。

数据集翻译与采样

  • 翻译工具: 使用Helsinki进行SICK数据集的翻译,以达到整体平均相似度接近0.5。
  • 平衡策略: 避免过度突出非西班牙语原生且未经西班牙语审查的翻译数据。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作