asapp/slue-phase-2

Name: asapp/slue-phase-2
Creator: asapp
Published: 2024-01-12 05:14:26
License: 暂无描述

Hugging Face2024-01-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/asapp/slue-phase-2

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: hvb features: - name: issue_id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: speaker_id dtype: string - name: text dtype: string - name: utt_index dtype: int32 - name: channel dtype: int32 - name: role dtype: string - name: start_ms dtype: int32 - name: duration_ms dtype: int32 - name: intent dtype: string - name: dialog_acts sequence: string splits: - name: train num_bytes: 803631533.648 num_examples: 11344 - name: validation num_bytes: 115999281.63 num_examples: 1690 - name: test num_bytes: 413280185.739 num_examples: 6121 download_size: 1287263357 dataset_size: 1332911001.017 - config_name: sqa5 features: - name: question_id dtype: string - name: question_audio dtype: audio: sampling_rate: 16000 - name: question_speaker_id dtype: string - name: raw_question_text dtype: string - name: normalized_question_text dtype: string - name: document_id dtype: string - name: document_audio dtype: audio: sampling_rate: 16000 - name: document_speaker_id dtype: string - name: raw_document_text dtype: string - name: normalized_document_text dtype: string - name: word2time sequence: - name: word dtype: string - name: normalized_word dtype: string - name: start_second dtype: float64 - name: end_second dtype: float64 - name: answer_spans sequence: - name: answer dtype: string - name: start_second dtype: float64 - name: end_second dtype: float64 splits: - name: train num_bytes: 134775904845.04 num_examples: 46186 - name: validation num_bytes: 5686714785.843 num_examples: 1939 - name: test num_bytes: 6967375359.628 num_examples: 2382 - name: verified_test num_bytes: 1182628989.0 num_examples: 408 download_size: 118074473123 dataset_size: 148612623979.511 - config_name: ted features: - name: id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: speaker dtype: string - name: transcript dtype: string - name: title dtype: string - name: abstract dtype: string splits: - name: train num_bytes: 46573026086.984 num_examples: 3384 - name: validation num_bytes: 5694199931.0 num_examples: 425 - name: test num_bytes: 5959094411.0 num_examples: 423 download_size: 58384489268 dataset_size: 58226320428.984 - config_name: vp_nel features: - name: id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: speaker_id dtype: string - name: text dtype: string - name: word_timestamps sequence: - name: word dtype: string - name: start_sec dtype: float64 - name: end_sec dtype: float64 - name: ne_timestamps sequence: - name: ne_label dtype: string - name: start_char_idx dtype: int32 - name: char_offset dtype: int32 - name: start_sec dtype: float64 - name: end_sec dtype: float64 splits: - name: validation num_bytes: 83371882.75 num_examples: 1750 - name: test num_bytes: 85222143.142 num_examples: 1838 download_size: 165119242 dataset_size: 168594025.89200002 configs: - config_name: hvb data_files: - split: train path: hvb/train-* - split: validation path: hvb/validation-* - split: test path: hvb/test-* - config_name: sqa5 data_files: - split: train path: sqa5/train-* - split: validation path: sqa5/validation-* - split: test path: sqa5/test-* - split: verified_test path: sqa5/verified_test-* - config_name: ted data_files: - split: train path: ted/train-* - split: validation path: ted/validation-* - split: test path: ted/test-* - config_name: vp_nel data_files: - split: validation path: vp_nel/validation-* - split: test path: vp_nel/test-* --- ### Dataset description - **(Jan. 8 2024) Test set labels are released** - **Toolkit Repository:** [https://github.com/asappresearch/slue-toolkit/](https://github.com/asappresearch/slue-toolkit/) - **Paper:** [https://arxiv.org/abs/2212.10525](https://arxiv.org/abs/2212.10525) ### Licensing Information #### SLUE-HVB SLUE-HVB dataset contains a subset of the Gridspace-Stanford Harper Valley speech dataset and the copyright of this subset remains the same with the original license, CC-BY-4.0. See also original license notice (https://github.com/cricketclub/gridspace-stanford-harper-valley/blob/master/LICENSE) Additionally, we provide dialog act classification annotation and it is covered with the same license as CC-BY-4.0. #### SLUE-SQA-5 SLUE-SQA-5 Dataset contains question texts and answer strings (question_text, normalized_question_text, and answer_spans column in .tsv files) from these datasets, * SQuAD1.1 (for questions whose question_id starts with ‘squad-’) * Natural Questions (for questions whose question_id starts with ‘nq-’) * WebQuestions (for questions whose question_id starts with ‘wq-’) * CuratedTREC (for questions whose question_id starts with ‘trec-’) * TriviaQA (for questions whose question_id starts with ‘triviaqa-’) Additionally, we provide audio recordings (.wav files in “question” directories) of these questions. For questions from TriviaQA (questions whose question_id starts with ‘triviaqa-’), their question texts, answer strings, and audio recordings are licensed with the same Apache License 2.0 as TriviaQA (for more detail, please refer to https://github.com/mandarjoshi90/triviaqa/blob/master/LICENSE). For questions from the other 4 datasets, their question texts, answer strings, and audio recordings are licensed with Creative Commons Attribution-ShareAlike 4.0 International license. SLUE-SQA-5 also contains a subset of Spoken Wikipedia, including the audios placed in “document” directories and their transcripts (document_text and normalized_document_text column in .tsv files). Additionally, we provide the text-to-speech alignments (.txt files in “word2time” directories).These contents are licensed with the same Creative Commons (CC BY-SA 4.0) license as Spoken Wikipedia. #### SLUE-TED SLUE-TED Dataset contains TED Talk audios along with the associated abstracts and title, which were concatenated to create reference summaries. This corpus is licensed with the same Creative Commons (CC BY–NC–ND 4.0 International) license as TED talks. For further information, please refer to the details provided below. ============================= TED.com We encourage you to share TED Talks, under our Creative Commons license, or ( CC BY–NC–ND 4.0 International, which means it may be shared under the conditions below: CC: means the type of license rights associated with TED Talks, or Creative Commons BY: means the requirement to include an attribution to TED as the owner of the TED Talk and include a link to the talk, but do not include any other TED branding on your website or platform, or language that may imply an endorsement. NC: means you cannot use TED Talks in any commercial context or to gain any type of revenue, payment or fee from the license sublicense, access or usage of TED Talks in an app of any kind for any advertising, or in exchange for payment of any kind, including in any ad supported content or format. ND: means that no derivative works are permitted so you cannot edit, remix, create, modify or alter the form of the TED Talks in any way. This includes using the TED Talks as the basis for another work, including dubbing, voice-overs, or other translations not authorized by TED. You may not add any more restrictions that we have placed on the TED site content, such as additional legal or technological restrictions on accessing the content.

提供机构：

asapp

原始信息汇总

数据集概述

数据集配置：hvb

特征:
- issue_id: 字符串类型
- audio: 音频类型，采样率为16000
- speaker_id: 字符串类型
- text: 字符串类型
- utt_index: 整数类型
- channel: 整数类型
- role: 字符串类型
- start_ms: 整数类型
- duration_ms: 整数类型
- intent: 字符串类型
- dialog_acts: 字符串序列类型
拆分:
- train: 11344个样本，占用803631533.648字节
- validation: 1690个样本，占用115999281.63字节
- test: 6121个样本，占用413280185.739字节
下载大小: 1287263357字节
数据集大小: 1332911001.017字节

数据集配置：sqa5

特征:
- question_id: 字符串类型
- question_audio: 音频类型，采样率为16000
- question_speaker_id: 字符串类型
- raw_question_text: 字符串类型
- normalized_question_text: 字符串类型
- document_id: 字符串类型
- document_audio: 音频类型，采样率为16000
- document_speaker_id: 字符串类型
- raw_document_text: 字符串类型
- normalized_document_text: 字符串类型
- word2time: 字符串序列类型，包含单词、标准化单词、开始时间和结束时间
- answer_spans: 字符串序列类型，包含答案、开始时间和结束时间
拆分:
- train: 46186个样本，占用134775904845.04字节
- validation: 1939个样本，占用5686714785.843字节
- test: 2382个样本，占用6967375359.628字节
- verified_test: 408个样本，占用1182628989.0字节
下载大小: 118074473123字节
数据集大小: 148612623979.511字节

数据集配置：ted

特征:
- id: 字符串类型
- audio: 音频类型，采样率为16000
- speaker: 字符串类型
- transcript: 字符串类型
- title: 字符串类型
- abstract: 字符串类型
拆分:
- train: 3384个样本，占用46573026086.984字节
- validation: 425个样本，占用5694199931.0字节
- test: 423个样本，占用5959094411.0字节
下载大小: 58384489268字节
数据集大小: 58226320428.984字节

数据集配置：vp_nel

特征:
- id: 字符串类型
- audio: 音频类型，采样率为16000
- speaker_id: 字符串类型
- text: 字符串类型
- word_timestamps: 字符串序列类型，包含单词、开始时间和结束时间
- ne_timestamps: 字符串序列类型，包含命名实体标签、开始字符索引、字符偏移、开始时间和结束时间
拆分:
- validation: 1750个样本，占用83371882.75字节
- test: 1838个样本，占用85222143.142字节
下载大小: 165119242字节
数据集大小: 168594025.89200002字节

5,000+

优质数据集

54 个

任务类型

进入经典数据集