Aiera/aiera-speaker-assign
收藏Hugging Face2024-07-23 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Aiera/aiera-speaker-assign
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: transcript_segment
dtype: string
- name: prior_context
dtype: string
- name: prior_speakers
dtype: string
- name: speaker
dtype: string
- name: change
dtype: int64
splits:
- name: test
num_bytes: 357571
num_examples: 294
download_size: 179841
dataset_size: 357571
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
license: mit
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- finance
---
# Transcript Speaker Identification Dataset
## Description
This dataset is designed to facilitate the development and evaluation of models that identify and assign speakers and speaker changes within event transcripts. It consists of segments from various transcripts where the primary task is to determine who the speaker is, based on the given textual context and a list of possible speakers.
The dataset was assembled from three earnings events:
- Q4 2023 Amazon.Com Inc Earnings Call
- Q2 2024 Apple Inc Earnings Call
- Q2 2024 Adobe Inc Earnings Call
## Dataset Structure
### Columns
- `transcript_segment`: A specific segment of the transcript which requires speaker identification.
- `prior_context`: The textual context preceding the `transcript_segment`, which can be instrumental in identifying the speaker.
- `possible_speakers`: A list of names or identifiers representing individuals who could potentially be the speaker of the segment.
- `speaker`: The actual speaker of the `transcript_segment`
- `change`: 1 or 0 indicating whether the speaker has changed
### Data Format
The dataset is presented in a tabular format, where each row corresponds to a data point that includes a transcript segment, its prior context, the possible speakers, the label indicating the actual speaker, and an indicator of whether the speaker has changed between the prior context and the transcript segment.
## Use Cases
This dataset can be used for a variety of applications, including:
- Training machine learning models for speaker identification in transcripts.
- Enhancing speech recognition systems by improving their ability to attribute text to the correct speaker.
- Developing tools for automated meeting summarization where speaker labels are essential.
## Accessing the Dataset
You can access this dataset via the HuggingFace Datasets library using the following Python code:
```python
from datasets import load_dataset
dataset = load_dataset("Aiera/aiera-speaker-assign")
```
A guide for evaluating using EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) is available on [github](https://github.com/aiera-inc/aiera-benchmark-tasks).
提供机构:
Aiera
原始信息汇总
数据集概述
数据集特征
- text: 数据类型为字符串。
- speakers: 数据类型为字符串。
数据集分割
- test:
- 示例数量: 119
- 数据大小: 64238 字节
数据集大小
- 下载大小: 23697 字节
- 数据集总大小: 64238 字节
配置信息
- config_name: default
- data_files:
- split: test
- path: data/test-*



