coref-data/mmc_indiscrim

Name: coref-data/mmc_indiscrim
Creator: coref-data
Published: 2024-02-13 04:04:52
License: 暂无描述

Hugging Face2024-02-13 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/coref-data/mmc_indiscrim

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: mmc_en features: - name: sentences list: - name: id dtype: int64 - name: misc struct: - name: parse_tree dtype: string - name: speaker dtype: string - name: text dtype: string - name: tokens list: - name: deprel dtype: string - name: end_char dtype: int64 - name: feats dtype: string - name: head dtype: int64 - name: id dtype: int64 - name: lemma dtype: string - name: misc dtype: string - name: start_char dtype: int64 - name: text dtype: string - name: upos dtype: string - name: xpos dtype: string - name: coref_chains sequence: sequence: sequence: int64 - name: id dtype: string - name: text dtype: string - name: genre dtype: string - name: meta_data struct: - name: comment dtype: string splits: - name: train num_bytes: 32714450 num_examples: 955 - name: validation num_bytes: 4684074 num_examples: 134 - name: test num_bytes: 3576454 num_examples: 133 download_size: 8195117 dataset_size: 40974978 - config_name: mmc_fa features: - name: sentences list: - name: id dtype: int64 - name: speaker dtype: string - name: text dtype: string - name: tokens list: - name: id dtype: int64 - name: text dtype: string - name: coref_chains sequence: sequence: sequence: int64 - name: id dtype: string - name: text dtype: string - name: genre dtype: string - name: meta_data struct: - name: comment dtype: string splits: - name: train num_bytes: 8511917 num_examples: 950 - name: validation num_bytes: 1308706 num_examples: 134 - name: test num_bytes: 959400 num_examples: 133 download_size: 3083246 dataset_size: 10780023 - config_name: mmc_fa_corrected features: - name: sentences list: - name: id dtype: int64 - name: speaker dtype: string - name: text dtype: string - name: tokens list: - name: id dtype: int64 - name: text dtype: string - name: coref_chains sequence: sequence: sequence: int64 - name: id dtype: string - name: text dtype: string - name: genre dtype: string - name: meta_data struct: - name: comment dtype: string splits: - name: train num_bytes: 8511917 num_examples: 950 - name: validation num_bytes: 1308706 num_examples: 134 - name: test num_bytes: 988920 num_examples: 133 download_size: 3086246 dataset_size: 10809543 - config_name: mmc_zh_corrected features: - name: sentences list: - name: id dtype: int64 - name: speaker dtype: string - name: text dtype: string - name: tokens list: - name: id dtype: int64 - name: text dtype: string - name: coref_chains sequence: sequence: sequence: int64 - name: id dtype: string - name: text dtype: string - name: genre dtype: string - name: meta_data struct: - name: comment dtype: string splits: - name: train num_bytes: 8024979 num_examples: 948 - name: validation num_bytes: 1217704 num_examples: 134 - name: test num_bytes: 765302 num_examples: 133 download_size: 2653472 dataset_size: 10007985 - config_name: mmc_zh_uncorrected features: - name: sentences list: - name: id dtype: int64 - name: speaker dtype: string - name: text dtype: string - name: tokens list: - name: id dtype: int64 - name: text dtype: string - name: coref_chains sequence: sequence: sequence: int64 - name: id dtype: string - name: text dtype: string - name: genre dtype: string - name: meta_data struct: - name: comment dtype: string splits: - name: train num_bytes: 8024979 num_examples: 948 - name: validation num_bytes: 1217704 num_examples: 134 - name: test num_bytes: 926344 num_examples: 133 download_size: 2655536 dataset_size: 10169027 configs: - config_name: mmc_en data_files: - split: train path: mmc_en/train-* - split: validation path: mmc_en/validation-* - split: test path: mmc_en/test-* - config_name: mmc_fa data_files: - split: train path: mmc_fa/train-* - split: validation path: mmc_fa/validation-* - split: test path: mmc_fa/test-* - config_name: mmc_fa_corrected data_files: - split: train path: mmc_fa_corrected/train-* - split: validation path: mmc_fa_corrected/validation-* - split: test path: mmc_fa_corrected/test-* - config_name: mmc_zh_corrected data_files: - split: train path: mmc_zh_corrected/train-* - split: validation path: mmc_zh_corrected/validation-* - split: test path: mmc_zh_corrected/test-* - config_name: mmc_zh_uncorrected data_files: - split: train path: mmc_zh_uncorrected/train-* - split: validation path: mmc_zh_uncorrected/validation-* - split: test path: mmc_zh_uncorrected/test-* --- This dataset was generated by reformatting [`coref-data/mmc_raw`](https://huggingface.co/datasets/coref-data/mmc_raw) into the indiscrim coreference format. See that repo for dataset details. See [ianporada/coref-data](https://github.com/ianporada/coref-data) for additional conversion details and the conversion script. Please create an issue in the repo above or in this dataset repo for any questions.

数据集信息： - 配置名称：mmc_en 特征： - 字段名称：sentences，数据类型：列表，列表元素为结构体，包含以下字段： - 字段名称：id，数据类型（dtype）：64位整型（int64） - 字段名称：misc，数据类型：结构体，包含字段： - 字段名称：parse_tree，数据类型：字符串 - 字段名称：speaker，数据类型：字符串 - 字段名称：text，数据类型：字符串 - 字段名称：tokens，数据类型：列表，列表元素为结构体，包含以下字段： - 字段名称：deprel，数据类型：字符串（依存关系标签） - 字段名称：end_char，数据类型：int64（字符结束位置） - 字段名称：feats，数据类型：字符串（特征信息） - 字段名称：head，数据类型：int64（句法头节点ID） - 字段名称：id，数据类型：int64 - 字段名称：lemma，数据类型：字符串（词元） - 字段名称：misc，数据类型：字符串（附加信息） - 字段名称：start_char，数据类型：int64（字符起始位置） - 字段名称：text，数据类型：字符串 - 字段名称：upos，数据类型：字符串（通用词性标注） - 字段名称：xpos，数据类型：字符串（语言特定词性标注） - 字段名称：coref_chains，数据类型：序列，其中每个元素为序列的序列，序列元素类型为int64 - 字段名称：id，数据类型：字符串 - 字段名称：text，数据类型：字符串 - 字段名称：genre，数据类型：字符串（语体裁/域） - 字段名称：meta_data，数据类型：结构体，包含字段： - 字段名称：comment，数据类型：字符串划分集： - 划分集名称：train，字节数：32714450，样本数：955 - 划分集名称：validation，字节数：4684074，样本数：134 - 划分集名称：test，字节数：3576454，样本数：133 下载总大小：8195117，数据集总大小：40974978 - 配置名称：mmc_fa 特征： - 字段名称：sentences，数据类型：列表，列表元素为结构体，包含以下字段： - 字段名称：id，数据类型：int64 - 字段名称：speaker，数据类型：字符串 - 字段名称：text，数据类型：字符串 - 字段名称：tokens，数据类型：列表，列表元素为结构体，仅包含以下字段： - 字段名称：id，数据类型：int64 - 字段名称：text，数据类型：字符串 - 字段名称：coref_chains，数据类型：序列，其中每个元素为序列的序列，序列元素类型为int64 - 字段名称：id，数据类型：字符串 - 字段名称：text，数据类型：字符串 - 字段名称：genre，数据类型：字符串 - 字段名称：meta_data，数据类型：结构体，包含字段： - 字段名称：comment，数据类型：字符串划分集： - 划分集名称：train，字节数：8511917，样本数：950 - 划分集名称：validation，字节数：1308706，样本数：134 - 划分集名称：test，字节数：959400，样本数：133 下载总大小：3083246，数据集总大小：10780023 - 配置名称：mmc_fa_corrected 特征： - 字段名称：sentences，数据类型：列表，列表元素为结构体，包含以下字段： - 字段名称：id，数据类型：int64 - 字段名称：speaker，数据类型：字符串 - 字段名称：text，数据类型：字符串 - 字段名称：tokens，数据类型：列表，列表元素为结构体，仅包含以下字段： - 字段名称：id，数据类型：int64 - 字段名称：text，数据类型：字符串 - 字段名称：coref_chains，数据类型：序列，其中每个元素为序列的序列，序列元素类型为int64 - 字段名称：id，数据类型：字符串 - 字段名称：text，数据类型：字符串 - 字段名称：genre，数据类型：字符串 - 字段名称：meta_data，数据类型：结构体，包含字段： - 字段名称：comment，数据类型：字符串划分集： - 划分集名称：train，字节数：8511917，样本数：950 - 划分集名称：validation，字节数：1308706，样本数：134 - 划分集名称：test，字节数：988920，样本数：133 下载总大小：3086246，数据集总大小：10809543 - 配置名称：mmc_zh_corrected 特征： - 字段名称：sentences，数据类型：列表，列表元素为结构体，包含以下字段： - 字段名称：id，数据类型：int64 - 字段名称：speaker，数据类型：字符串 - 字段名称：text，数据类型：字符串 - 字段名称：tokens，数据类型：列表，列表元素为结构体，仅包含以下字段： - 字段名称：id，数据类型：int64 - 字段名称：text，数据类型：字符串 - 字段名称：coref_chains，数据类型：序列，其中每个元素为序列的序列，序列元素类型为int64 - 字段名称：id，数据类型：字符串 - 字段名称：text，数据类型：字符串 - 字段名称：genre，数据类型：字符串 - 字段名称：meta_data，数据类型：结构体，包含字段： - 字段名称：comment，数据类型：字符串划分集： - 划分集名称：train，字节数：8024979，样本数：948 - 划分集名称：validation，字节数：1217704，样本数：134 - 划分集名称：test，字节数：765302，样本数：133 下载总大小：2653472，数据集总大小：10007985 - 配置名称：mmc_zh_uncorrected 特征： - 字段名称：sentences，数据类型：列表，列表元素为结构体，包含以下字段： - 字段名称：id，数据类型：int64 - 字段名称：speaker，数据类型：字符串 - 字段名称：text，数据类型：字符串 - 字段名称：tokens，数据类型：列表，列表元素为结构体，仅包含以下字段： - 字段名称：id，数据类型：int64 - 字段名称：text，数据类型：字符串 - 字段名称：coref_chains，数据类型：序列，其中每个元素为序列的序列，序列元素类型为int64 - 字段名称：id，数据类型：字符串 - 字段名称：text，数据类型：字符串 - 字段名称：genre，数据类型：字符串 - 字段名称：meta_data，数据类型：结构体，包含字段： - 字段名称：comment，数据类型：字符串划分集： - 划分集名称：train，字节数：8024979，样本数：948 - 划分集名称：validation，字节数：1217704，样本数：134 - 划分集名称：test，字节数：926344，样本数：133 下载总大小：2655536，数据集总大小：10169027 配置项： - 配置名称：mmc_en 数据文件： - 划分集：train，路径：mmc_en/train-* - 划分集：validation，路径：mmc_en/validation-* - 划分集：test，路径：mmc_en/test-* - 配置名称：mmc_fa 数据文件： - 划分集：train，路径：mmc_fa/train-* - 划分集：validation，路径：mmc_fa/validation-* - 划分集：test，路径：mmc_fa/test-* - 配置名称：mmc_fa_corrected 数据文件： - 划分集：train，路径：mmc_fa_corrected/train-* - 划分集：validation，路径：mmc_fa_corrected/validation-* - 划分集：test，路径：mmc_fa_corrected/test-* - 配置名称：mmc_zh_corrected 数据文件： - 划分集：train，路径：mmc_zh_corrected/train-* - 划分集：validation，路径：mmc_zh_corrected/validation-* - 划分集：test，路径：mmc_zh_corrected/test-* - 配置名称：mmc_zh_uncorrected 数据文件： - 划分集：train，路径：mmc_zh_uncorrected/train-* - 划分集：validation，路径：mmc_zh_uncorrected/validation-* - 划分集：test，路径：mmc_zh_uncorrected/test-* 本数据集通过将 [`coref-data/mmc_raw`](https://huggingface.co/datasets/coref-data/mmc_raw) 重构为通用共指标注格式而生成。有关数据集详细信息，请参阅该仓库。如需了解更多转换细节及转换脚本，请参阅 [ianporada/coref-data](https://github.com/ianporada/coref-data) 仓库。如有任何疑问，请在上述仓库或本数据集仓库中提交议题。

提供机构：

coref-data

原始信息汇总

数据集概述

数据集配置

config_name: mmc_en
- 特征:
  - sentences:
    - id: int64
    - misc:
      - parse_tree: string
    - speaker: string
    - text: string
    - tokens:
      - deprel: string
      - end_char: int64
      - feats: string
      - head: int64
      - id: int64
      - lemma: string
      - misc: string
      - start_char: int64
      - text: string
      - upos: string
      - xpos: string
  - coref_chains: sequence of sequence of sequence of int64
  - id: string
  - text: string
  - genre: string
  - meta_data:
    - comment: string
- 分割:
  - train:
    - num_bytes: 32714450
    - num_examples: 955
  - validation:
    - num_bytes: 4684074
    - num_examples: 134
  - test:
    - num_bytes: 3576454
    - num_examples: 133
- 下载大小: 8195117
- 数据集大小: 40974978
config_name: mmc_fa
- 特征:
  - sentences:
    - id: int64
    - speaker: string
    - text: string
    - tokens:
      - id: int64
      - text: string
  - coref_chains: sequence of sequence of sequence of int64
  - id: string
  - text: string
  - genre: string
  - meta_data:
    - comment: string
- 分割:
  - train:
    - num_bytes: 8511917
    - num_examples: 950
  - validation:
    - num_bytes: 1308706
    - num_examples: 134
  - test:
    - num_bytes: 959400
    - num_examples: 133
- 下载大小: 3083246
- 数据集大小: 10780023
config_name: mmc_fa_corrected
- 特征:
  - sentences:
    - id: int64
    - speaker: string
    - text: string
    - tokens:
      - id: int64
      - text: string
  - coref_chains: sequence of sequence of sequence of int64
  - id: string
  - text: string
  - genre: string
  - meta_data:
    - comment: string
- 分割:
  - train:
    - num_bytes: 8511917
    - num_examples: 950
  - validation:
    - num_bytes: 1308706
    - num_examples: 134
  - test:
    - num_bytes: 988920
    - num_examples: 133
- 下载大小: 3086246
- 数据集大小: 10809543
config_name: mmc_zh_corrected
- 特征:
  - sentences:
    - id: int64
    - speaker: string
    - text: string
    - tokens:
      - id: int64
      - text: string
  - coref_chains: sequence of sequence of sequence of int64
  - id: string
  - text: string
  - genre: string
  - meta_data:
    - comment: string
- 分割:
  - train:
    - num_bytes: 8024979
    - num_examples: 948
  - validation:
    - num_bytes: 1217704
    - num_examples: 134
  - test:
    - num_bytes: 765302
    - num_examples: 133
- 下载大小: 2653472
- 数据集大小: 10007985
config_name: mmc_zh_uncorrected
- 特征:
  - sentences:
    - id: int64
    - speaker: string
    - text: string
    - tokens:
      - id: int64
      - text: string
  - coref_chains: sequence of sequence of sequence of int64
  - id: string
  - text: string
  - genre: string
  - meta_data:
    - comment: string
- 分割:
  - train:
    - num_bytes: 8024979
    - num_examples: 948
  - validation:
    - num_bytes: 1217704
    - num_examples: 134
  - test:
    - num_bytes: 926344
    - num_examples: 133
- 下载大小: 2655536
- 数据集大小: 10169027

数据文件路径

config_name: mmc_en
- train: mmc_en/train-*
- validation: mmc_en/validation-*
- test: mmc_en/test-*
config_name: mmc_fa
- train: mmc_fa/train-*
- validation: mmc_fa/validation-*
- test: mmc_fa/test-*
config_name: mmc_fa_corrected
- train: mmc_fa_corrected/train-*
- validation: mmc_fa_corrected/validation-*
- test: mmc_fa_corrected/test-*
config_name: mmc_zh_corrected
- train: mmc_zh_corrected/train-*
- validation: mmc_zh_corrected/validation-*
- test: mmc_zh_corrected/test-*
config_name: mmc_zh_uncorrected
- train: mmc_zh_uncorrected/train-*
- validation: mmc_zh_uncorrected/validation-*
- test: mmc_zh_uncorrected/test-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集