Pile-Ubuntu_IRC
收藏魔搭社区2024-09-02 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/OmniData/Pile-Ubuntu_IRC
下载链接
链接失效反馈官方服务:
资源简介:
displayName: Pile-Ubuntu_IRC
license:
- MIT
taskTypes:
- Natural Language Generation
- Language Modelling
mediaTypes:
- Text
labelTypes:
- English Corpus
tags: []
publisher:
- EleutherAI
publishDate: '2023-07-18'
publishUrl: https://pile.eleuther.ai/
paperUrl: ''
---
# 数据介绍
## 简介
Pile-Ubuntu IRC数据集是The Pile项目的一部分,是一个用于语言模型的数据集,基于Ubuntu IRC日志构建而成。
Ubuntu IRC是一个面向Ubuntu操作系统用户的聊天室,用户在这里进行技术支持、交流和讨论。Pile-Ubuntu IRC数据集利用这些聊天记录,为语言模型提供了大量的对话式文本资源。
该数据集包含了来自Ubuntu IRC聊天室的对话,涵盖了各种技术和操作系统相关的主题。这些对话记录以自然语言的形式呈现,其中包括用户提问、回答、解释和讨论。
通过使用Pile-Ubuntu IRC数据集,研究人员和开发者可以训练语言模型来理解和生成对话式的文本,从而在对话生成、对话系统、技术支持等领域应用中发挥作用。
## 数据内容
### 数据说明
Pile-Ubuntu IRC数据集涵盖了5.5G的数据。
### 数据示例
```
{
"id": "198163177",
"source_id": "",
"doc_id": "71943997",
"data_type": "text",
"data_source": "pile",
"data_url": "enwiki-c4-pile-ccnews",
"content": "#ubuntu-ngo 2010-08-30\n<tt33l3r> Keep looking....\n<dholbach> good morning\n#ubuntu-ngo 2010-08-31\n<dholbach> good morning\n#ubuntu-ngo 2010-09-01\n<Claudinux> good morning!\n<MooDoo> morning\n<dholbach> good morning\n#ubuntu-ngo 2010-09-03\n<dholbach> good morning\n<highvoltage> dholbach: hey, aren't you on holiday already?\n<dholbach> nope, not yet, tuesday is my last working day and I'll leave directly that evening\n<xdatap> dholbach, have fun!\n<dholbach> xdatap, will do :-D\n<xdatap> dholbach, and next place, in order, will be Italy! :P\n<dholbach> hahaha\n * dholbach hugs xdatap\n * xdatap hugs back dholbach\n#ubuntu-ngo 2010-09-04\n<MooDoo> morning\n<Brandie> Spamming is fun! Brought to you by FreeNode. /join #freenode\n#ubuntu-ngo 2010-09-05\n<MooDoo> morning all\n#ubuntu-ngo 2011-08-29\n<dholbach> good morning\n#ubuntu-ngo 2011-08-30\n<dholbach> good morning\n#ubuntu-ngo 2011-09-01\n<dholbach> good morning\n#ubuntu-ngo 2011-09-02\n<dholbach> good morning\n<xdatap1> morning dholbach !\n<dholbach> ciao xdatap1\n#ubuntu-ngo 2012-08-28\n<dholbach> good morning\n#ubuntu-ngo 2012-08-30\n<dholbach> good morning\n#ubuntu-ngo 2012-08-31\n<dholbach> good morning\n#ubuntu-ngo 2013-08-26\n<dholbach> good morning\n#ubuntu-ngo 2013-08-30\n<dholbach> good morning\n#ubuntu-ngo 2014-08-25\n<dholbach> good morning\n#ubuntu-ngo 2014-08-26\n<dholbach> good morning\n#ubuntu-ngo 2014-08-27\n<dholbach> good morning\n#ubuntu-ngo 2014-08-28\n<dholbach> good morning\n#ubuntu-ngo 2014-08-29\n<dholbach> good morning\n",
"remark": {
"pile_set_name": "Ubuntu IRC"
},
"sub_path": "ubuntu-irc/train"
}
```
## 引文
```
@misc{conghui2022opendatalab,
title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets},
author={Conghui He, Wei Li, Zhenjiang Jin, Bin Wang, Chao Xu, Dahua Lin},
journal={https://opendatalab.com/},
year={2022}
}
```
## Download dataset
:modelscope-code[]{type="git"}
数据集名称:Pile-Ubuntu_IRC
许可证:MIT
任务类型:自然语言生成、语言建模
媒体类型:文本
标签类型:英语语料库
标签:无
发布方:EleutherAI
发布日期:2023-07-18
发布地址:https://pile.eleuther.ai/
论文地址:无
---
# 数据集介绍
## 项目简介
Pile-Ubuntu_IRC 数据集隶属于 The Pile 项目,是一款面向大语言模型(Large Language Model)的训练数据集,构建基础为 Ubuntu IRC 聊天日志。
Ubuntu IRC 是面向 Ubuntu 操作系统用户的网络聊天室,用户可在此开展技术支持、交流与讨论活动。本数据集依托上述聊天记录,为大语言模型提供了海量对话式文本资源。
数据集收录了来自 Ubuntu IRC 聊天室的对话内容,涵盖各类技术与操作系统相关主题。这些对话以自然语言形式呈现,包含用户提问、解答、阐释与讨论等多种交互文本。
借助本数据集,研究人员与开发者可训练大语言模型理解并生成对话式文本,可广泛应用于对话生成、对话系统搭建、技术支持等多个领域。
## 数据集内容
### 数据详情
Pile-Ubuntu_IRC 数据集总数据量达 5.5 吉字节(GB)。
### 数据样例
{
"id": "198163177",
"source_id": "",
"doc_id": "71943997",
"data_type": "text",
"data_source": "pile",
"data_url": "enwiki-c4-pile-ccnews",
"content": "#ubuntu-ngo 2010-08-30
<tt33l3r> Keep looking....
<dholbach> good morning
#ubuntu-ngo 2010-08-31
<dholbach> good morning
#ubuntu-ngo 2010-09-01
<Claudinux> good morning!
<MooDoo> morning
<dholbach> good morning
#ubuntu-ngo 2010-09-03
<dholbach> good morning
<highvoltage> dholbach: hey, aren't you on holiday already?
<dholbach> nope, not yet, tuesday is my last working day and I'll leave directly that evening
<xdatap> dholbach, have fun!
<dholbach> xdatap, will do :-D
<xdatap> dholbach, and next place, in order, will be Italy! :P
<dholbach> hahaha
* dholbach hugs xdatap
* xdatap hugs back dholbach
#ubuntu-ngo 2010-09-04
<MooDoo> morning
<Brandie> Spamming is fun! Brought to you by FreeNode. /join #freenode
#ubuntu-ngo 2010-09-05
<MooDoo> morning all
#ubuntu-ngo 2011-08-29
<dholbach> good morning
#ubuntu-ngo 2011-08-30
<dholbach> good morning
#ubuntu-ngo 2011-09-01
<dholbach> good morning
#ubuntu-ngo 2011-09-02
<dholbach> good morning
<xdatap1> morning dholbach !
<dholbach> ciao xdatap1
#ubuntu-ngo 2012-08-28
<dholbach> good morning
#ubuntu-ngo 2012-08-30
<dholbach> good morning
#ubuntu-ngo 2012-08-31
<dholbach> good morning
#ubuntu-ngo 2013-08-26
<dholbach> good morning
#ubuntu-ngo 2013-08-30
<dholbach> good morning
#ubuntu-ngo 2014-08-25
<dholbach> good morning
#ubuntu-ngo 2014-08-26
<dholbach> good morning
#ubuntu-ngo 2014-08-27
<dholbach> good morning
#ubuntu-ngo 2014-08-28
<dholbach> good morning
#ubuntu-ngo 2014-08-29
<dholbach> good morning
",
"remark": {
"pile_set_name": "Ubuntu IRC"
},
"sub_path": "ubuntu-irc/train"
}
## 引用文献
@misc{conghui2022opendatalab,
title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets},
author={Conghui He, Wei Li, Zhenjiang Jin, Bin Wang, Chao Xu, Dahua Lin},
journal={https://opendatalab.com/},
year={2022}
}
## 数据集下载
使用 Git 方式获取:`modelscope-code[]{type="git"}`
提供机构:
maas
创建时间:
2024-07-11



