five

clarin-pl/bprec

收藏
Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/clarin-pl/bprec
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - expert-generated language: - pl license: - unknown multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-retrieval task_ids: - entity-linking-retrieval pretty_name: bprec dataset_info: - config_name: default features: - name: id dtype: int32 - name: text dtype: string - name: ner sequence: - name: source struct: - name: from dtype: int32 - name: text dtype: string - name: to dtype: int32 - name: type dtype: class_label: names: '0': PRODUCT_NAME '1': PRODUCT_NAME_IMP '2': PRODUCT_NO_BRAND '3': BRAND_NAME '4': BRAND_NAME_IMP '5': VERSION '6': PRODUCT_ADJ '7': BRAND_ADJ '8': LOCATION '9': LOCATION_IMP - name: target struct: - name: from dtype: int32 - name: text dtype: string - name: to dtype: int32 - name: type dtype: class_label: names: '0': PRODUCT_NAME '1': PRODUCT_NAME_IMP '2': PRODUCT_NO_BRAND '3': BRAND_NAME '4': BRAND_NAME_IMP '5': VERSION '6': PRODUCT_ADJ '7': BRAND_ADJ '8': LOCATION '9': LOCATION_IMP splits: - name: tele num_bytes: 2739015 num_examples: 2391 - name: electro num_bytes: 125999 num_examples: 382 - name: cosmetics num_bytes: 1565263 num_examples: 2384 - name: banking num_bytes: 446944 num_examples: 561 download_size: 8006167 dataset_size: 4877221 - config_name: all features: - name: id dtype: int32 - name: category dtype: string - name: text dtype: string - name: ner sequence: - name: source struct: - name: from dtype: int32 - name: text dtype: string - name: to dtype: int32 - name: type dtype: class_label: names: '0': PRODUCT_NAME '1': PRODUCT_NAME_IMP '2': PRODUCT_NO_BRAND '3': BRAND_NAME '4': BRAND_NAME_IMP '5': VERSION '6': PRODUCT_ADJ '7': BRAND_ADJ '8': LOCATION '9': LOCATION_IMP - name: target struct: - name: from dtype: int32 - name: text dtype: string - name: to dtype: int32 - name: type dtype: class_label: names: '0': PRODUCT_NAME '1': PRODUCT_NAME_IMP '2': PRODUCT_NO_BRAND '3': BRAND_NAME '4': BRAND_NAME_IMP '5': VERSION '6': PRODUCT_ADJ '7': BRAND_ADJ '8': LOCATION '9': LOCATION_IMP splits: - name: train num_bytes: 4937658 num_examples: 5718 download_size: 8006167 dataset_size: 4937658 - config_name: tele features: - name: id dtype: int32 - name: category dtype: string - name: text dtype: string - name: ner sequence: - name: source struct: - name: from dtype: int32 - name: text dtype: string - name: to dtype: int32 - name: type dtype: class_label: names: '0': PRODUCT_NAME '1': PRODUCT_NAME_IMP '2': PRODUCT_NO_BRAND '3': BRAND_NAME '4': BRAND_NAME_IMP '5': VERSION '6': PRODUCT_ADJ '7': BRAND_ADJ '8': LOCATION '9': LOCATION_IMP - name: target struct: - name: from dtype: int32 - name: text dtype: string - name: to dtype: int32 - name: type dtype: class_label: names: '0': PRODUCT_NAME '1': PRODUCT_NAME_IMP '2': PRODUCT_NO_BRAND '3': BRAND_NAME '4': BRAND_NAME_IMP '5': VERSION '6': PRODUCT_ADJ '7': BRAND_ADJ '8': LOCATION '9': LOCATION_IMP splits: - name: train num_bytes: 2758147 num_examples: 2391 download_size: 4569708 dataset_size: 2758147 - config_name: electro features: - name: id dtype: int32 - name: category dtype: string - name: text dtype: string - name: ner sequence: - name: source struct: - name: from dtype: int32 - name: text dtype: string - name: to dtype: int32 - name: type dtype: class_label: names: '0': PRODUCT_NAME '1': PRODUCT_NAME_IMP '2': PRODUCT_NO_BRAND '3': BRAND_NAME '4': BRAND_NAME_IMP '5': VERSION '6': PRODUCT_ADJ '7': BRAND_ADJ '8': LOCATION '9': LOCATION_IMP - name: target struct: - name: from dtype: int32 - name: text dtype: string - name: to dtype: int32 - name: type dtype: class_label: names: '0': PRODUCT_NAME '1': PRODUCT_NAME_IMP '2': PRODUCT_NO_BRAND '3': BRAND_NAME '4': BRAND_NAME_IMP '5': VERSION '6': PRODUCT_ADJ '7': BRAND_ADJ '8': LOCATION '9': LOCATION_IMP splits: - name: train num_bytes: 130205 num_examples: 382 download_size: 269917 dataset_size: 130205 - config_name: cosmetics features: - name: id dtype: int32 - name: category dtype: string - name: text dtype: string - name: ner sequence: - name: source struct: - name: from dtype: int32 - name: text dtype: string - name: to dtype: int32 - name: type dtype: class_label: names: '0': PRODUCT_NAME '1': PRODUCT_NAME_IMP '2': PRODUCT_NO_BRAND '3': BRAND_NAME '4': BRAND_NAME_IMP '5': VERSION '6': PRODUCT_ADJ '7': BRAND_ADJ '8': LOCATION '9': LOCATION_IMP - name: target struct: - name: from dtype: int32 - name: text dtype: string - name: to dtype: int32 - name: type dtype: class_label: names: '0': PRODUCT_NAME '1': PRODUCT_NAME_IMP '2': PRODUCT_NO_BRAND '3': BRAND_NAME '4': BRAND_NAME_IMP '5': VERSION '6': PRODUCT_ADJ '7': BRAND_ADJ '8': LOCATION '9': LOCATION_IMP splits: - name: train num_bytes: 1596259 num_examples: 2384 download_size: 2417388 dataset_size: 1596259 - config_name: banking features: - name: id dtype: int32 - name: category dtype: string - name: text dtype: string - name: ner sequence: - name: source struct: - name: from dtype: int32 - name: text dtype: string - name: to dtype: int32 - name: type dtype: class_label: names: '0': PRODUCT_NAME '1': PRODUCT_NAME_IMP '2': PRODUCT_NO_BRAND '3': BRAND_NAME '4': BRAND_NAME_IMP '5': VERSION '6': PRODUCT_ADJ '7': BRAND_ADJ '8': LOCATION '9': LOCATION_IMP - name: target struct: - name: from dtype: int32 - name: text dtype: string - name: to dtype: int32 - name: type dtype: class_label: names: '0': PRODUCT_NAME '1': PRODUCT_NAME_IMP '2': PRODUCT_NO_BRAND '3': BRAND_NAME '4': BRAND_NAME_IMP '5': VERSION '6': PRODUCT_ADJ '7': BRAND_ADJ '8': LOCATION '9': LOCATION_IMP splits: - name: train num_bytes: 453119 num_examples: 561 download_size: 749154 dataset_size: 453119 --- # Dataset Card for [Dataset Name] ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [bprec homepage](https://clarin-pl.eu/dspace/handle/11321/736) - **Repository:** [bprec repository](https://gitlab.clarin-pl.eu/team-semantics/semrel-extraction) - **Paper:** [bprec paper](https://www.aclweb.org/anthology/2020.lrec-1.233.pdf) - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Brand-Product Relation Extraction Corpora in Polish ### Supported Tasks and Leaderboards NER, Entity linking ### Languages Polish ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields - id: int identifier of a text - text: string text, for example a consumer comment on the social media - ner: extracted entities and their relationship - source and target: a pair of entities identified in the text - from: int value representing starting character of the entity - text: string value with the entity text - to: int value representing end character of the entity - type: one of pre-identified entity types: - PRODUCT_NAME - PRODUCT_NAME_IMP - PRODUCT_NO_BRAND - BRAND_NAME - BRAND_NAME_IMP - VERSION - PRODUCT_ADJ - BRAND_ADJ - LOCATION - LOCATION_IMP ### Data Splits No train/validation/test split provided. Current dataset configurations point to 4 domain categories for the texts: - tele - electro - cosmetics - banking ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ``` @inproceedings{inproceedings, author = {Janz, Arkadiusz and Kopociński, Łukasz and Piasecki, Maciej and Pluwak, Agnieszka}, year = {2020}, month = {05}, pages = {}, title = {Brand-Product Relation Extraction Using Heterogeneous Vector Space Representations} } ``` ### Contributions Thanks to [@kldarek](https://github.com/kldarek) for adding this dataset.
提供机构:
clarin-pl
原始信息汇总

数据集概述

基本信息

  • 名称: bprec
  • 语言: 波兰语 (pl)
  • 许可证: 未知
  • 多语言性: 单语种
  • 数据集大小: 1K<n<10K
  • 数据来源: 原始数据
  • 任务类别: 文本检索
  • 任务ID: 实体链接检索

数据集结构

  • 配置名称: default
    • 特征:
      • id: 整数类型,文本标识符
      • text: 字符串类型,例如社交媒体上的消费者评论
      • ner: 提取的实体及其关系
        • source 和 target: 文本中识别的一对实体
          • from: 整数类型,实体起始字符位置
          • text: 字符串类型,实体文本
          • to: 整数类型,实体结束字符位置
          • type: 预定义的实体类型之一
            • PRODUCT_NAME
            • PRODUCT_NAME_IMP
            • PRODUCT_NO_BRAND
            • BRAND_NAME
            • BRAND_NAME_IMP
            • VERSION
            • PRODUCT_ADJ
            • BRAND_ADJ
            • LOCATION
            • LOCATION_IMP
    • 数据分割:
      • tele: 2391个示例,2739015字节
      • electro: 382个示例,125999字节
      • cosmetics: 2384个示例,1565263字节
      • banking: 561个示例,446944字节

数据集创建

  • 注释创建者: 专家生成
  • 语言创建者: 专家生成

数据集使用注意事项

  • 数据分割: 未提供训练/验证/测试分割,当前数据集配置指向4个领域类别
  • 下载大小: 8006167字节
  • 数据集大小: 4877221字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作