J-CHAT
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/sarulab-speech/J-CHAT
下载链接
链接失效反馈官方服务:
资源简介:
J-CHAT is a Japanese large-scale dialogue speech corpus.
For the detailed explanation, please see our [paper](https://arxiv.org/abs/2407.15828)
# PLEASE READ THIS FIRST
>[!IMPORTANT]
> TO USE THIS DATASET, YOU MUST AGREE THAT YOU WILL USE THE DATASET SOLELY FOR THE PURPOSE OF JAPANESE COPYRIGHT ACT ARTICLE 30-4.
# What's new?
>[!NOTE]
> Add transcription of corpus. transcripition are based on [reazonspeech-nemo-v2](https://huggingface.co/reazon-research/reazonspeech-nemo-v2)
# How can I use this data for commercial purposes?
Commercial use is not admitted. ~~If you want to use this data for commercial purposes, Please build one by yourself.
Corpus construction programs are distributed on [Github](https://github.com/sarulab-speech/J-CHAT)~~
The Github repo isn't ready yet. But we will be releasing the code soon. Stay tuned!!
# How to use
## Requirements
The dataet loading will require following python libraries.
* [lhotse](https://github.com/lhotse-speech/lhotse)
* [webdataset](https://github.com/webdataset/webdataset)
* If you need transcription, [smart-open](https://github.com/getcrest/smart-open)
## loading dataset
Please see the following to load the dataset as `lhotse.CutSet`
### Without transcription
```python
import lhotse
# change the following line to the subset and the data domain you like.
# For data domain the options are youtube and podcast.
# For subset, train, valid, test, others are available.
# For instance, if you want to get data from youtube test set. the filepath would be filelists/yotube_test.txt
with open("filelists/podcast_train.txt") as f:
urls = f.read().splitlines()
cutset = lhotse.CutSet.from_webdataset(urls)
```
### With transcription
```python
import json
import lhotse
with open("transcribed_jchat/podcast_train.json") as f:
fields = json.load(f)
cuts = lhotse.CutSet.from_shar(fields=fields)
```
For the info about `lhotse.CutSet` please see the [lhotse documentation](https://lhotse.readthedocs.io/en/latest/)
# LICENSE
CC-BY-NC 4.0
TO USE THIS DATASET, YOU MUST AGREE THAT YOU WILL USE THE DATASET SOLELY FOR THE PURPOSE OF JAPANESE COPYRIGHT ACT ARTICLE 30-4.
# Contact
We have ensured that our dataset does not infringe on any rights of the original data holders.
However, if you wish to request the removal of your data from the dataset, please feel free to contact us at the email address below:
shinnosuke_takamichi [*at*] keio.jp
# Other resources
* [Speech samples generated with dGSLM trained on J-CHAT](https://sarulab-speech.github.io/j-chat/dgslm_speech_sample/)
* [dGSLM model weights](https://github.com/sarulab-speech/j-chat/tree/master/weights)
# Contributors
* [Wataru Nakata/中田 亘](https://wataru-nakata.github.io)
* [Kentaro Seki/関 健太郎](https://trgkpc.github.io/)
* [Hitomi Yanaka/谷中 瞳](https://hitomiyanaka.mystrikingly.com/)
* [Yuki Saito/齋藤 佑樹](https://sython.org/)
* [Shinnosuke Takamichi/高道 慎之介](https://sites.google.com/site/shinnosuketakamichi/home)
* [Hiroshi Saruwatari/猿渡 洋](https://researchmap.jp/read0102891)
# 謝辞/acknowledgements
本研究は、国立研究開発法人産業技術総合研究所事業の令和5年度覚醒プロジェクトの助成を受けたものです。
/This work was supported by AIST KAKUSEI project (FY2023).
J-CHAT是一款日语大规模对话语音语料库。详细说明请参阅我们的[论文](https://arxiv.org/abs/2407.15828)。
# 请务必先阅读以下内容
>[!IMPORTANT]
> 使用本数据集即代表您同意仅将其用于符合日本著作权法第30条之4的用途。
# 更新内容
>[!NOTE]
> 新增语料库转写内容,转写基于[reazonspeech-nemo-v2](https://huggingface.co/reazon-research/reazonspeech-nemo-v2)实现。
# 商业用途说明
> 不允许商业使用。~~如果您希望将本数据集用于商业用途,请自行构建相关内容。语料库构建程序已发布至[Github](https://github.com/sarulab-speech/J-CHAT)~~
> 当前Github仓库尚未就绪,我们将尽快发布相关代码,请持续关注!
# 使用方法
## 环境依赖
加载本数据集需依赖以下Python库:
* [lhotse](https://github.com/lhotse-speech/lhotse)
* [webdataset](https://github.com/webdataset/webdataset)
* 若需使用转写功能,还需安装[smart-open](https://github.com/getcrest/smart-open)
## 数据集加载
请参考以下示例将数据集加载为`lhotse.CutSet`。
### 无转写版本加载
python
import lhotse
# 请将以下路径修改为您所需的子集与数据域
# 数据域可选值为youtube(油管)与podcast(播客)
# 子集可选值包括train(训练集)、valid(验证集)、test(测试集)及others(其他)
# 例如,若需获取油管测试集数据,对应的文件路径应为filelists/youtube_test.txt
with open("filelists/podcast_train.txt") as f:
urls = f.read().splitlines()
cutset = lhotse.CutSet.from_webdataset(urls)
### 带转写版本加载
python
import json
import lhotse
with open("transcribed_jchat/podcast_train.json") as f:
fields = json.load(f)
cuts = lhotse.CutSet.from_shar(fields=fields)
关于`lhotse.CutSet`的详细信息,请参阅[lhotse官方文档](https://lhotse.readthedocs.io/en/latest/)。
# 许可协议
CC-BY-NC 4.0
使用本数据集即代表您同意仅将其用于符合日本著作权法第30条之4的用途。
# 联系方式
我们已确保本数据集未侵犯原始数据持有者的任何权益。若您希望申请移除您的相关数据,请通过以下邮箱联系我们:
shinnosuke_takamichi[at]keio.jp
# 其他资源
* [基于J-CHAT训练的dGSLM生成的语音样本](https://sarulab-speech.github.io/j-chat/dgslm_speech_sample/)
* [dGSLM模型权重](https://github.com/sarulab-speech/j-chat/tree/master/weights)
# 贡献者
* [Wataru Nakata/中田 亘](https://wataru-nakata.github.io)
* [Kentaro Seki/関 健太郎](https://trgkpc.github.io/)
* [Hitomi Yanaka/谷中 瞳](https://hitomiyanaka.mystrikingly.com/)
* [Yuki Saito/齋藤 佑樹](https://sython.org/)
* [Shinnosuke Takamichi/高道 慎之介](https://sites.google.com/site/shinnosuketakamichi/home)
* [Hiroshi Saruwatari/猿渡 洋](https://researchmap.jp/read0102891)
# 致谢/acknowledgements
本研究受国立研究开发法人产业技术综合研究所令和5年度觉醒项目资助。/本研究得到AIST KAKUSEI项目(2023财年)支持。
提供机构:
maas
创建时间:
2025-10-13



