coref-data/mmc_indiscrim
收藏Hugging Face2024-02-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/coref-data/mmc_indiscrim
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: mmc_en
features:
- name: sentences
list:
- name: id
dtype: int64
- name: misc
struct:
- name: parse_tree
dtype: string
- name: speaker
dtype: string
- name: text
dtype: string
- name: tokens
list:
- name: deprel
dtype: string
- name: end_char
dtype: int64
- name: feats
dtype: string
- name: head
dtype: int64
- name: id
dtype: int64
- name: lemma
dtype: string
- name: misc
dtype: string
- name: start_char
dtype: int64
- name: text
dtype: string
- name: upos
dtype: string
- name: xpos
dtype: string
- name: coref_chains
sequence:
sequence:
sequence: int64
- name: id
dtype: string
- name: text
dtype: string
- name: genre
dtype: string
- name: meta_data
struct:
- name: comment
dtype: string
splits:
- name: train
num_bytes: 32714450
num_examples: 955
- name: validation
num_bytes: 4684074
num_examples: 134
- name: test
num_bytes: 3576454
num_examples: 133
download_size: 8195117
dataset_size: 40974978
- config_name: mmc_fa
features:
- name: sentences
list:
- name: id
dtype: int64
- name: speaker
dtype: string
- name: text
dtype: string
- name: tokens
list:
- name: id
dtype: int64
- name: text
dtype: string
- name: coref_chains
sequence:
sequence:
sequence: int64
- name: id
dtype: string
- name: text
dtype: string
- name: genre
dtype: string
- name: meta_data
struct:
- name: comment
dtype: string
splits:
- name: train
num_bytes: 8511917
num_examples: 950
- name: validation
num_bytes: 1308706
num_examples: 134
- name: test
num_bytes: 959400
num_examples: 133
download_size: 3083246
dataset_size: 10780023
- config_name: mmc_fa_corrected
features:
- name: sentences
list:
- name: id
dtype: int64
- name: speaker
dtype: string
- name: text
dtype: string
- name: tokens
list:
- name: id
dtype: int64
- name: text
dtype: string
- name: coref_chains
sequence:
sequence:
sequence: int64
- name: id
dtype: string
- name: text
dtype: string
- name: genre
dtype: string
- name: meta_data
struct:
- name: comment
dtype: string
splits:
- name: train
num_bytes: 8511917
num_examples: 950
- name: validation
num_bytes: 1308706
num_examples: 134
- name: test
num_bytes: 988920
num_examples: 133
download_size: 3086246
dataset_size: 10809543
- config_name: mmc_zh_corrected
features:
- name: sentences
list:
- name: id
dtype: int64
- name: speaker
dtype: string
- name: text
dtype: string
- name: tokens
list:
- name: id
dtype: int64
- name: text
dtype: string
- name: coref_chains
sequence:
sequence:
sequence: int64
- name: id
dtype: string
- name: text
dtype: string
- name: genre
dtype: string
- name: meta_data
struct:
- name: comment
dtype: string
splits:
- name: train
num_bytes: 8024979
num_examples: 948
- name: validation
num_bytes: 1217704
num_examples: 134
- name: test
num_bytes: 765302
num_examples: 133
download_size: 2653472
dataset_size: 10007985
- config_name: mmc_zh_uncorrected
features:
- name: sentences
list:
- name: id
dtype: int64
- name: speaker
dtype: string
- name: text
dtype: string
- name: tokens
list:
- name: id
dtype: int64
- name: text
dtype: string
- name: coref_chains
sequence:
sequence:
sequence: int64
- name: id
dtype: string
- name: text
dtype: string
- name: genre
dtype: string
- name: meta_data
struct:
- name: comment
dtype: string
splits:
- name: train
num_bytes: 8024979
num_examples: 948
- name: validation
num_bytes: 1217704
num_examples: 134
- name: test
num_bytes: 926344
num_examples: 133
download_size: 2655536
dataset_size: 10169027
configs:
- config_name: mmc_en
data_files:
- split: train
path: mmc_en/train-*
- split: validation
path: mmc_en/validation-*
- split: test
path: mmc_en/test-*
- config_name: mmc_fa
data_files:
- split: train
path: mmc_fa/train-*
- split: validation
path: mmc_fa/validation-*
- split: test
path: mmc_fa/test-*
- config_name: mmc_fa_corrected
data_files:
- split: train
path: mmc_fa_corrected/train-*
- split: validation
path: mmc_fa_corrected/validation-*
- split: test
path: mmc_fa_corrected/test-*
- config_name: mmc_zh_corrected
data_files:
- split: train
path: mmc_zh_corrected/train-*
- split: validation
path: mmc_zh_corrected/validation-*
- split: test
path: mmc_zh_corrected/test-*
- config_name: mmc_zh_uncorrected
data_files:
- split: train
path: mmc_zh_uncorrected/train-*
- split: validation
path: mmc_zh_uncorrected/validation-*
- split: test
path: mmc_zh_uncorrected/test-*
---
This dataset was generated by reformatting [`coref-data/mmc_raw`](https://huggingface.co/datasets/coref-data/mmc_raw) into the indiscrim coreference format. See that repo for dataset details.
See [ianporada/coref-data](https://github.com/ianporada/coref-data) for additional conversion details and the conversion script.
Please create an issue in the repo above or in this dataset repo for any questions.
数据集信息:
- 配置名称:mmc_en
特征:
- 字段名称:sentences,数据类型:列表,列表元素为结构体,包含以下字段:
- 字段名称:id,数据类型(dtype):64位整型(int64)
- 字段名称:misc,数据类型:结构体,包含字段:
- 字段名称:parse_tree,数据类型:字符串
- 字段名称:speaker,数据类型:字符串
- 字段名称:text,数据类型:字符串
- 字段名称:tokens,数据类型:列表,列表元素为结构体,包含以下字段:
- 字段名称:deprel,数据类型:字符串(依存关系标签)
- 字段名称:end_char,数据类型:int64(字符结束位置)
- 字段名称:feats,数据类型:字符串(特征信息)
- 字段名称:head,数据类型:int64(句法头节点ID)
- 字段名称:id,数据类型:int64
- 字段名称:lemma,数据类型:字符串(词元)
- 字段名称:misc,数据类型:字符串(附加信息)
- 字段名称:start_char,数据类型:int64(字符起始位置)
- 字段名称:text,数据类型:字符串
- 字段名称:upos,数据类型:字符串(通用词性标注)
- 字段名称:xpos,数据类型:字符串(语言特定词性标注)
- 字段名称:coref_chains,数据类型:序列,其中每个元素为序列的序列,序列元素类型为int64
- 字段名称:id,数据类型:字符串
- 字段名称:text,数据类型:字符串
- 字段名称:genre,数据类型:字符串(语体裁/域)
- 字段名称:meta_data,数据类型:结构体,包含字段:
- 字段名称:comment,数据类型:字符串
划分集:
- 划分集名称:train,字节数:32714450,样本数:955
- 划分集名称:validation,字节数:4684074,样本数:134
- 划分集名称:test,字节数:3576454,样本数:133
下载总大小:8195117,数据集总大小:40974978
- 配置名称:mmc_fa
特征:
- 字段名称:sentences,数据类型:列表,列表元素为结构体,包含以下字段:
- 字段名称:id,数据类型:int64
- 字段名称:speaker,数据类型:字符串
- 字段名称:text,数据类型:字符串
- 字段名称:tokens,数据类型:列表,列表元素为结构体,仅包含以下字段:
- 字段名称:id,数据类型:int64
- 字段名称:text,数据类型:字符串
- 字段名称:coref_chains,数据类型:序列,其中每个元素为序列的序列,序列元素类型为int64
- 字段名称:id,数据类型:字符串
- 字段名称:text,数据类型:字符串
- 字段名称:genre,数据类型:字符串
- 字段名称:meta_data,数据类型:结构体,包含字段:
- 字段名称:comment,数据类型:字符串
划分集:
- 划分集名称:train,字节数:8511917,样本数:950
- 划分集名称:validation,字节数:1308706,样本数:134
- 划分集名称:test,字节数:959400,样本数:133
下载总大小:3083246,数据集总大小:10780023
- 配置名称:mmc_fa_corrected
特征:
- 字段名称:sentences,数据类型:列表,列表元素为结构体,包含以下字段:
- 字段名称:id,数据类型:int64
- 字段名称:speaker,数据类型:字符串
- 字段名称:text,数据类型:字符串
- 字段名称:tokens,数据类型:列表,列表元素为结构体,仅包含以下字段:
- 字段名称:id,数据类型:int64
- 字段名称:text,数据类型:字符串
- 字段名称:coref_chains,数据类型:序列,其中每个元素为序列的序列,序列元素类型为int64
- 字段名称:id,数据类型:字符串
- 字段名称:text,数据类型:字符串
- 字段名称:genre,数据类型:字符串
- 字段名称:meta_data,数据类型:结构体,包含字段:
- 字段名称:comment,数据类型:字符串
划分集:
- 划分集名称:train,字节数:8511917,样本数:950
- 划分集名称:validation,字节数:1308706,样本数:134
- 划分集名称:test,字节数:988920,样本数:133
下载总大小:3086246,数据集总大小:10809543
- 配置名称:mmc_zh_corrected
特征:
- 字段名称:sentences,数据类型:列表,列表元素为结构体,包含以下字段:
- 字段名称:id,数据类型:int64
- 字段名称:speaker,数据类型:字符串
- 字段名称:text,数据类型:字符串
- 字段名称:tokens,数据类型:列表,列表元素为结构体,仅包含以下字段:
- 字段名称:id,数据类型:int64
- 字段名称:text,数据类型:字符串
- 字段名称:coref_chains,数据类型:序列,其中每个元素为序列的序列,序列元素类型为int64
- 字段名称:id,数据类型:字符串
- 字段名称:text,数据类型:字符串
- 字段名称:genre,数据类型:字符串
- 字段名称:meta_data,数据类型:结构体,包含字段:
- 字段名称:comment,数据类型:字符串
划分集:
- 划分集名称:train,字节数:8024979,样本数:948
- 划分集名称:validation,字节数:1217704,样本数:134
- 划分集名称:test,字节数:765302,样本数:133
下载总大小:2653472,数据集总大小:10007985
- 配置名称:mmc_zh_uncorrected
特征:
- 字段名称:sentences,数据类型:列表,列表元素为结构体,包含以下字段:
- 字段名称:id,数据类型:int64
- 字段名称:speaker,数据类型:字符串
- 字段名称:text,数据类型:字符串
- 字段名称:tokens,数据类型:列表,列表元素为结构体,仅包含以下字段:
- 字段名称:id,数据类型:int64
- 字段名称:text,数据类型:字符串
- 字段名称:coref_chains,数据类型:序列,其中每个元素为序列的序列,序列元素类型为int64
- 字段名称:id,数据类型:字符串
- 字段名称:text,数据类型:字符串
- 字段名称:genre,数据类型:字符串
- 字段名称:meta_data,数据类型:结构体,包含字段:
- 字段名称:comment,数据类型:字符串
划分集:
- 划分集名称:train,字节数:8024979,样本数:948
- 划分集名称:validation,字节数:1217704,样本数:134
- 划分集名称:test,字节数:926344,样本数:133
下载总大小:2655536,数据集总大小:10169027
配置项:
- 配置名称:mmc_en
数据文件:
- 划分集:train,路径:mmc_en/train-*
- 划分集:validation,路径:mmc_en/validation-*
- 划分集:test,路径:mmc_en/test-*
- 配置名称:mmc_fa
数据文件:
- 划分集:train,路径:mmc_fa/train-*
- 划分集:validation,路径:mmc_fa/validation-*
- 划分集:test,路径:mmc_fa/test-*
- 配置名称:mmc_fa_corrected
数据文件:
- 划分集:train,路径:mmc_fa_corrected/train-*
- 划分集:validation,路径:mmc_fa_corrected/validation-*
- 划分集:test,路径:mmc_fa_corrected/test-*
- 配置名称:mmc_zh_corrected
数据文件:
- 划分集:train,路径:mmc_zh_corrected/train-*
- 划分集:validation,路径:mmc_zh_corrected/validation-*
- 划分集:test,路径:mmc_zh_corrected/test-*
- 配置名称:mmc_zh_uncorrected
数据文件:
- 划分集:train,路径:mmc_zh_uncorrected/train-*
- 划分集:validation,路径:mmc_zh_uncorrected/validation-*
- 划分集:test,路径:mmc_zh_uncorrected/test-*
本数据集通过将 [`coref-data/mmc_raw`](https://huggingface.co/datasets/coref-data/mmc_raw) 重构为通用共指标注格式而生成。有关数据集详细信息,请参阅该仓库。如需了解更多转换细节及转换脚本,请参阅 [ianporada/coref-data](https://github.com/ianporada/coref-data) 仓库。如有任何疑问,请在上述仓库或本数据集仓库中提交议题。
提供机构:
coref-data
原始信息汇总
数据集概述
数据集配置
-
config_name: mmc_en
- 特征:
- sentences:
- id: int64
- misc:
- parse_tree: string
- speaker: string
- text: string
- tokens:
- deprel: string
- end_char: int64
- feats: string
- head: int64
- id: int64
- lemma: string
- misc: string
- start_char: int64
- text: string
- upos: string
- xpos: string
- coref_chains: sequence of sequence of sequence of int64
- id: string
- text: string
- genre: string
- meta_data:
- comment: string
- sentences:
- 分割:
- train:
- num_bytes: 32714450
- num_examples: 955
- validation:
- num_bytes: 4684074
- num_examples: 134
- test:
- num_bytes: 3576454
- num_examples: 133
- train:
- 下载大小: 8195117
- 数据集大小: 40974978
- 特征:
-
config_name: mmc_fa
- 特征:
- sentences:
- id: int64
- speaker: string
- text: string
- tokens:
- id: int64
- text: string
- coref_chains: sequence of sequence of sequence of int64
- id: string
- text: string
- genre: string
- meta_data:
- comment: string
- sentences:
- 分割:
- train:
- num_bytes: 8511917
- num_examples: 950
- validation:
- num_bytes: 1308706
- num_examples: 134
- test:
- num_bytes: 959400
- num_examples: 133
- train:
- 下载大小: 3083246
- 数据集大小: 10780023
- 特征:
-
config_name: mmc_fa_corrected
- 特征:
- sentences:
- id: int64
- speaker: string
- text: string
- tokens:
- id: int64
- text: string
- coref_chains: sequence of sequence of sequence of int64
- id: string
- text: string
- genre: string
- meta_data:
- comment: string
- sentences:
- 分割:
- train:
- num_bytes: 8511917
- num_examples: 950
- validation:
- num_bytes: 1308706
- num_examples: 134
- test:
- num_bytes: 988920
- num_examples: 133
- train:
- 下载大小: 3086246
- 数据集大小: 10809543
- 特征:
-
config_name: mmc_zh_corrected
- 特征:
- sentences:
- id: int64
- speaker: string
- text: string
- tokens:
- id: int64
- text: string
- coref_chains: sequence of sequence of sequence of int64
- id: string
- text: string
- genre: string
- meta_data:
- comment: string
- sentences:
- 分割:
- train:
- num_bytes: 8024979
- num_examples: 948
- validation:
- num_bytes: 1217704
- num_examples: 134
- test:
- num_bytes: 765302
- num_examples: 133
- train:
- 下载大小: 2653472
- 数据集大小: 10007985
- 特征:
-
config_name: mmc_zh_uncorrected
- 特征:
- sentences:
- id: int64
- speaker: string
- text: string
- tokens:
- id: int64
- text: string
- coref_chains: sequence of sequence of sequence of int64
- id: string
- text: string
- genre: string
- meta_data:
- comment: string
- sentences:
- 分割:
- train:
- num_bytes: 8024979
- num_examples: 948
- validation:
- num_bytes: 1217704
- num_examples: 134
- test:
- num_bytes: 926344
- num_examples: 133
- train:
- 下载大小: 2655536
- 数据集大小: 10169027
- 特征:
数据文件路径
-
config_name: mmc_en
- train: mmc_en/train-*
- validation: mmc_en/validation-*
- test: mmc_en/test-*
-
config_name: mmc_fa
- train: mmc_fa/train-*
- validation: mmc_fa/validation-*
- test: mmc_fa/test-*
-
config_name: mmc_fa_corrected
- train: mmc_fa_corrected/train-*
- validation: mmc_fa_corrected/validation-*
- test: mmc_fa_corrected/test-*
-
config_name: mmc_zh_corrected
- train: mmc_zh_corrected/train-*
- validation: mmc_zh_corrected/validation-*
- test: mmc_zh_corrected/test-*
-
config_name: mmc_zh_uncorrected
- train: mmc_zh_uncorrected/train-*
- validation: mmc_zh_uncorrected/validation-*
- test: mmc_zh_uncorrected/test-*



