Aiera/aiera-speaker-assign

Name: Aiera/aiera-speaker-assign
Creator: Aiera
Published: 2024-07-23 23:29:34
License: 暂无描述

Hugging Face2024-07-23 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/Aiera/aiera-speaker-assign

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: transcript_segment dtype: string - name: prior_context dtype: string - name: prior_speakers dtype: string - name: speaker dtype: string - name: change dtype: int64 splits: - name: test num_bytes: 357571 num_examples: 294 download_size: 179841 dataset_size: 357571 configs: - config_name: default data_files: - split: test path: data/test-* license: mit task_categories: - text-generation - question-answering language: - en tags: - finance --- # Transcript Speaker Identification Dataset ## Description This dataset is designed to facilitate the development and evaluation of models that identify and assign speakers and speaker changes within event transcripts. It consists of segments from various transcripts where the primary task is to determine who the speaker is, based on the given textual context and a list of possible speakers. The dataset was assembled from three earnings events: - Q4 2023 Amazon.Com Inc Earnings Call - Q2 2024 Apple Inc Earnings Call - Q2 2024 Adobe Inc Earnings Call ## Dataset Structure ### Columns - `transcript_segment`: A specific segment of the transcript which requires speaker identification. - `prior_context`: The textual context preceding the `transcript_segment`, which can be instrumental in identifying the speaker. - `possible_speakers`: A list of names or identifiers representing individuals who could potentially be the speaker of the segment. - `speaker`: The actual speaker of the `transcript_segment` - `change`: 1 or 0 indicating whether the speaker has changed ### Data Format The dataset is presented in a tabular format, where each row corresponds to a data point that includes a transcript segment, its prior context, the possible speakers, the label indicating the actual speaker, and an indicator of whether the speaker has changed between the prior context and the transcript segment. ## Use Cases This dataset can be used for a variety of applications, including: - Training machine learning models for speaker identification in transcripts. - Enhancing speech recognition systems by improving their ability to attribute text to the correct speaker. - Developing tools for automated meeting summarization where speaker labels are essential. ## Accessing the Dataset You can access this dataset via the HuggingFace Datasets library using the following Python code: ```python from datasets import load_dataset dataset = load_dataset("Aiera/aiera-speaker-assign") ``` A guide for evaluating using EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) is available on [github](https://github.com/aiera-inc/aiera-benchmark-tasks).

提供机构：

Aiera

原始信息汇总

数据集概述

数据集特征

text: 数据类型为字符串。
speakers: 数据类型为字符串。

数据集分割

test:
- 示例数量: 119
- 数据大小: 64238 字节

数据集大小

下载大小: 23697 字节
数据集总大小: 64238 字节

配置信息

config_name: default
data_files:
- split: test
- path: data/test-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集