five

eduagarcia/portuguese_benchmark

收藏
Hugging Face2024-04-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/eduagarcia/portuguese_benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
Portuguese Benchmark是一个包含10个子数据集的集合,旨在训练和评估监督语言模型,如BERT、RoBERTa等。这些数据集涵盖了分类(CLS)、自然语言推理(NLI)、语义相似性评分(STS)和命名实体识别(NER)等18个任务。每个子数据集都有详细的配置信息,包括特征、标签、数据分割(训练集、验证集、测试集)以及数据集的大小和下载大小。

Portuguese Benchmark是一个包含10个子数据集的集合,旨在训练和评估监督语言模型,如BERT、RoBERTa等。这些数据集涵盖了分类(CLS)、自然语言推理(NLI)、语义相似性评分(STS)和命名实体识别(NER)等18个任务。每个子数据集都有详细的配置信息,包括特征、标签、数据分割(训练集、验证集、测试集)以及数据集的大小和下载大小。
提供机构:
eduagarcia
原始信息汇总

数据集概述

1. HateBR_offensive_binary

  • 配置名称: HateBR_offensive_binary
  • 特征:
    • idx: int32
    • sentence: string
    • label:
      • 0: non-offensive
      • 1: offensive
  • 分割:
    • train: 4480 examples, 416208 bytes
    • validation: 1120 examples, 94237 bytes
    • test: 1400 examples, 116658 bytes
  • 下载大小: 411947 bytes
  • 数据集大小: 627103 bytes

2. HateBR_offensive_level

  • 配置名称: HateBR_offensive_level
  • 特征:
    • idx: int32
    • sentence: string
    • label:
      • 0: non-offensive
      • 1: slightly
      • 2: moderately
      • 3: highly
  • 分割:
    • train: 4480 examples, 416208 bytes
    • validation: 1120 examples, 94237 bytes
    • test: 1400 examples, 116658 bytes
  • 下载大小: 413064 bytes
  • 数据集大小: 627103 bytes

3. LeNER-Br

  • 配置名称: LeNER-Br
  • 特征:
    • idx: int32
    • tokens: sequence of string
    • ner_tags: sequence of
      • 0: O
      • 1: B-ORGANIZACAO
      • 2: I-ORGANIZACAO
      • 3: B-PESSOA
      • 4: I-PESSOA
      • 5: B-TEMPO
      • 6: I-TEMPO
      • 7: B-LOCAL
      • 8: I-LOCAL
      • 9: B-LEGISLACAO
      • 10: I-LEGISLACAO
      • 11: B-JURISPRUDENCIA
      • 12: I-JURISPRUDENCIA
  • 分割:
    • train: 7825 examples, 3953896 bytes
    • validation: 1177 examples, 715819 bytes
    • test: 1390 examples, 819242 bytes
  • 下载大小: 1049906 bytes
  • 数据集大小: 5488957 bytes

4. Portuguese_Hate_Speech_binary

  • 配置名称: Portuguese_Hate_Speech_binary
  • 特征:
    • idx: int32
    • sentence: string
    • label:
      • 0: no-hate
      • 1: hate
  • 分割:
    • train: 3969 examples, 473248 bytes
    • validation: 850 examples, 101358 bytes
    • test: 851 examples, 101242 bytes
  • 下载大小: 482467 bytes
  • 数据集大小: 675848 bytes

5. UlyssesNER-Br-C-coarse

  • 配置名称: UlyssesNER-Br-C-coarse
  • 特征:
    • idx: int32
    • tokens: sequence of string
    • ner_tags: sequence of
      • 0: O
      • 1: B-DATA
      • 2: I-DATA
      • 3: B-EVENTO
      • 4: I-EVENTO
      • 5: B-FUNDAMENTO
      • 6: I-FUNDAMENTO
      • 7: B-LOCAL
      • 8: I-LOCAL
      • 9: B-ORGANIZACAO
      • 10: I-ORGANIZACAO
      • 11: B-PESSOA
      • 12: I-PESSOA
      • 13: B-PRODUTODELEI
      • 14: I-PRODUTODELEI
  • 分割:
    • train: 679 examples, 1051410 bytes
    • validation: 146 examples, 225883 bytes
    • test: 147 examples, 226764 bytes
  • 下载大小: 301821 bytes
  • 数据集大小: 1504057 bytes

6. UlyssesNER-Br-C-fine

  • 配置名称: UlyssesNER-Br-C-fine
  • 特征:
    • idx: int32
    • tokens: sequence of string
    • ner_tags: sequence of
      • 0: O
      • 1: B-DATA
      • 2: I-DATA
      • 3: B-EVENTO
      • 4: I-EVENTO
      • 5: B-FUNDapelido
      • 6: I-FUNDapelido
      • 7: B-FUNDlei
      • 8: I-FUNDlei
      • 9: B-FUNDprojetodelei
      • 10: I-FUNDprojetodelei
      • 11: B-LOCALconcreto
      • 12: I-LOCALconcreto
      • 13: B-LOCALvirtual
      • 14: I-LOCALvirtual
      • 15: B-ORGgovernamental
      • 16: I-ORGgovernamental
      • 17: B-ORGnaogovernamental
      • 18: I-ORGnaogovernamental
      • 19: B-ORGpartido
      • 20: I-ORGpartido
      • 21: B-PESSOAcargo
      • 22: I-PESSOAcargo
      • 23: B-PESSOAgrupocargo
      • 24: I-PESSOAgrupocargo
      • 25: B-PESSOAgrupoind
      • 26: I-PESSOAgrupoind
      • 27: B-PESSOAindividual
      • 28: I-PESSOAindividual
      • 29: B-PRODUTOoutros
      • 30: I-PRODUTOoutros
      • 31: B-PRODUTOprograma
      • 32: I-PRODUTOprograma
      • 33: B-PRODUTOsistema
      • 34: I-PRODUTOsistema
  • 分割:
    • train: 679 examples, 1051410 bytes
    • validation: 146 examples, 225883 bytes
    • test: 147 examples, 226764 bytes
  • 下载大小: 305985 bytes
  • 数据集大小: 1504057 bytes

7. UlyssesNER-Br-PL-coarse

  • 配置名称: UlyssesNER-Br-PL-coarse
  • 特征:
    • idx: int32
    • tokens: sequence of string
    • ner_tags: sequence of
      • 0: O
      • 1: B-DATA
      • 2: I-DATA
      • 3: B-EVENTO
      • 4: I-EVENTO
      • 5: B-FUNDAMENTO
      • 6: I-FUNDAMENTO
      • 7: B-LOCAL
      • 8: I-LOCAL
      • 9: B-ORGANIZACAO
      • 10: I-ORGANIZACAO
      • 11: B-PESSOA
      • 12: I-PESSOA
      • 13: B-PRODUTODELEI
      • 14: I-PRODUTODELEI
  • 分割:
    • train: 2271 examples, 1511905 bytes
    • validation: 489 examples, 305472 bytes
    • test: 524 examples, 363207 bytes
  • 下载大小: 431964 bytes
  • 数据集大小: 2180584 bytes

8. UlyssesNER-Br-PL-fine

  • 配置名称: UlyssesNER-Br-PL-fine
  • 特征:
    • idx: int32
    • tokens: sequence of string
    • ner_tags: sequence of
      • 0: O
      • 1: B-DATA
      • 2: I-DATA
      • 3: B-EVENTO
      • 4: I-EVENTO
      • 5: B-FUNDapelido
      • 6: I-FUNDapelido
      • 7: B-FUNDlei
      • 8: I-FUNDlei
      • 9: B-FUNDprojetodelei
      • 10: I-FUNDprojetodelei
      • 11: B-LOCALconcreto
      • 12: I-LOCALconcreto
      • 13: B-LOCALvirtual
      • 14: I-LOCALvirtual
      • 15: B-ORGgovernamental
      • 16: I-ORGgovernamental
      • 17: B-ORGnaogovernamental
      • 18: I-ORGnaogovernamental
      • 19: B-ORGpartido
      • 20: I-ORGpartido
      • 21: B-PESSOAcargo
      • 22: I-PESSOAcargo
      • 23: B-PESSOAgrupocargo
      • 24: I-PESSOAgrupocargo
      • 25: B-PESSOAindividual
      • 26: I-PESSOAindividual
      • 27: B-PRODUTOoutros
      • 28: I-PRODUTOoutros
      • 29: B-PRODUTOprograma
      • 30: I-PRODUTOprograma
      • 31: B-PRODUTOsistema
      • 32: I-PRODUTOsistema
  • 分割:
    • train: 2271 examples, 1511905 bytes
    • validation: 489 examples, 305472 bytes
    • test: 524 examples, 363207 bytes
  • 下载大小: 437232 bytes
  • 数据集大小: 2180584 bytes

9. assin2-rte

  • 配置名称: assin2-rte
  • 特征:
    • idx: int32
    • sentence1: string
    • sentence2: string
    • label:
      • 0: NONE
      • 1: ENTAILMENT
  • 分割:
    • train: 6500 examples, 811995 bytes
    • validation: 500 examples, 62824 bytes
    • test: 2448 examples, 319682 bytes
  • 下载大小: 551190 bytes
  • 数据集大小: 1194501 bytes

10. assin2-sts

  • 配置名称: assin2-sts
  • 特征:
    • idx: int32
    • sentence1: string
    • sentence2: string
    • label: float32
  • 分割:
    • train: 6500 examples, 785995 bytes
    • validation: 500 examples, 60824 bytes
    • test: 2448 examples, 309890 bytes
  • 下载大小: 560263 bytes
  • 数据集大小: 1156709 bytes

11. brazilian_court_decisions_judgment

  • 配置名称: brazilian_court_decisions_judgment
  • 特征:
    • idx: int32
    • sentence: string
    • label:
      • 0: no
      • 1: partial
      • 2: yes
  • 分割:
    • train: 3234 examples, 2779679 bytes
    • validation: 404 examples, 351504 bytes
    • test: 405 examples, 346499 bytes
  • 下载大小: 1956183 bytes
  • 数据集大小: 3477682 bytes

12. brazilian_court_decisions_unanimity

  • 配置名称: brazilian_court_decisions_unanimity
  • 特征:
    • idx: int32
    • sentence: string
    • label:
      • 0: unanimity
      • 1: not-unanimity
  • 分割:
    • train: 1715 examples, 1564695 bytes
    • validation: 211 examples, 197865 bytes
    • test: 204 examples, 193928 bytes
  • 下载大小: 1069780 bytes
  • 数据集大小: 1956488 bytes

13. harem-default

  • 配置名称: harem-default
  • 特征:
    • idx: int32
    • tokens: sequence of string
    • ner_tags: sequence of
      • 0: O
      • 1: B-PESSOA
      • 2: I-PESSOA
      • 3: B-ORGANIZACAO
      • 4: I-ORGANIZACAO
      • 5: B-LOCAL
      • 6: I-LOCAL
      • 7: B-TEMPO
      • 8: I-TEMPO
      • 9: B-VALOR
      • 10: I-VALOR
      • 11: B-ABSTRACCAO
      • 12: I-ABSTRACCAO
      • 13: B-ACONTECIMENTO
      • 14: I-ACONTECIMENTO
      • 15: B-COISA
      • 16: I-COISA
      • 17: B-OBRA
      • 18: I-OBRA
      • 19: B-OUTRO
      • 20: I-OUTRO
  • 分割:
    • train: 121 examples, 1504542 bytes
    • validation: 8 examples, 51182 bytes
    • test: 128 examples, 1060778 bytes
  • 下载大小: 540547 bytes
  • 数据集大小: 2616502 bytes

14. harem-selective

  • 配置名称: harem-selective
  • 特征:
    • idx: int32
    • tokens: sequence of string
    • ner_tags: sequence of
      • 0: O
      • 1: B-PESSOA
      • 2: I-PESSOA
      • 3: B-ORGANIZACAO
      • 4: I-ORGANIZACAO
      • 5: B-LOCAL
      • 6: I-LOCAL
      • 7: B-TEMPO
      • 8: I-TEMPO
      • 9: B-VALOR
      • 10: I-VALOR
  • 分割:
    • train: 121 examples, 1504542 bytes
    • validation: 8 examples, 51182 bytes
    • test: 128 examples, 1060778 bytes
  • 下载大小: 531807 bytes
  • 数据集大小: 2616502 bytes

15. mapa_pt_coarse

  • 配置名称: mapa_pt_coarse
  • 特征:
    • idx: int32
    • tokens: sequence of string
    • ner_tags: sequence of
      • 0: O
      • 1: B-ADDRESS
      • 2: I-ADDRESS
      • 3: B-AMOUNT
      • 4: I-AMOUNT
      • 5: B-DATE
      • 6: I-DATE
      • 7: B-ORGANISATION
      • 8: I-ORGANISATION
      • 9: B-PERSON
      • 10: I-PERSON
      • 11: B-TIME
      • 12:
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作