Pile-Stack_Exchange
收藏魔搭社区2025-10-21 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/OmniData/Pile-Stack_Exchange
下载链接
链接失效反馈官方服务:
资源简介:
displayName: Pile-Stack_Exchange
license:
- MIT
taskTypes:
- Natural Language Generation
- Language Modelling
mediaTypes:
- Text
labelTypes:
- English Corpus
tags: []
publisher:
- EleutherAI
publishDate: '2023-07-18'
publishUrl: https://pile.eleuther.ai/
paperUrl: ''
---
# 数据介绍
## 简介
Pile-Stack Exchange数据集是The Pile项目的一部分,用于语言模型的数据集。它是通过处理Stack Exchange数据转储而得到的,Stack Exchange是一个匿名转储的数据集,包含了Stack Exchange网络上所有用户贡献的内容。
Pile-Stack Exchange数据集可用于各种自然语言处理任务,包括文本生成、问答、情感分析等。它包含来自各种Stack Exchange站点的内容,涵盖了广泛的主题和语言。
## 数据内容
### 数据说明
Pile-Stack Exchange涵盖了34.8G的数据。
### 数据示例
```
{
"id": "208349663",
"source_id": "",
"doc_id": "80099327",
"data_type": "text",
"data_source": "pile",
"data_url": "enwiki-c4-pile-ccnews",
"content": "Q:\n\nHow do I sort a JSON by Date?\n\nI'm trying to loop through a JSON and sort it by the date so I can see the latest date to the oldest date, and then write it to the file.\nHere is my code\nvar reader = JSON.parse(fs.readFileSync('txt.json', 'utf8'));\nfunction sortByDate(a, b) {\n return new Date(a.lastUpdated).toJSON() - new Date(b.lastUpdated).toJSON();\n}\n\nreader.sort(sortByDate)\n\nJSON Data Example\n{\n \"Data\": {\n \"Contents\": [\n {\n \"Key\": [\n \"HelloTest\"\n ],\n \"lastUpdated\": [\n \"2019-10-25T10:30:50.558Z\"\n ]\n },\n {\n \"Key\": [\n \"TestHello\"\n ],\n \"lastUpdated\": [\n \"2019-03-26T10:30:50.558Z\"\n ]\n }\n ]\n }\n}\n\nA:\n\nHere are a couple of errors I found in your code:\n\nYour function name has a typo, it should be sortByDate and not sortbyDate.\nYou need top sort the inner json.Data.Contents array, not the outer json object.\nYou need to reference the first element of your lastUpdated arrays using lastUpdated[0].\nFinally, you do not need to call toJSON() on the date objects in your sorting function, simply convert to date and return the difference.\n\nAlso your inner data fields are arrays, which seems strange for a Key and a lastUpdated value.\nIf you keep your fields as arrays, here is a working example showing how to sort the inner Data.Contents array by date:\n\nconst jsonString = `{\n \"Data\": {\n \"Contents\": [{\n \"Key\": [\"HelloTest\"],\n \"lastUpdated\": [\"2019-10-25T10:30:50.558Z\"]\n }, {\n \"Key\": [\"TestHello\"],\n \"lastUpdated\": [\"2019-03-26T10:30:50.558Z\"]\n }]\n }\n}`;\n\nfunction sortByDate(a, b) {\n return new Date(a.lastUpdated[0]) - new Date(b.lastUpdated[0]);\n}\n\nconst json = JSON.parse(jsonString);\nconst defaultValue = { Data: { Contents: [] } };\nconst sortedContents = [...(json || defaultValue).Data.Contents].sort(sortByDate);\nconst output = { ...json, Data: { Contents: sortedContents } };\n\nconsole.log(output);\n\nIf you change your fields to scalars, which I suggest, here is another example:\n\n }, {\n }]\n }\n}`;\n\n return new Date(a.lastUpdated) - new Date(b.lastUpdated);\n}\n\n\n\n",
"remark": {
"pile_set_name": "StackExchange"
},
"sub_path": "stackexchange/train"
}
```
## 引文
```
@misc{conghui2022opendatalab,
title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets},
author={Conghui He, Wei Li, Zhenjiang Jin, Bin Wang, Chao Xu, Dahua Lin},
journal={https://opendatalab.com/},
year={2022}
}
```
## Download dataset
:modelscope-code[]{type="git"}
显示名称:Pile-Stack Exchange
许可协议:MIT
任务类型:自然语言生成(Natural Language Generation)、语言建模(Language Modelling)
媒体类型:文本
标注类型:英语语料库(English Corpus)
标签:无
发布方:EleutherAI
发布日期:2023-07-18
发布链接:https://pile.eleuther.ai/
论文链接:无
---
# 数据集介绍
## 1. 数据集简介
Pile-Stack Exchange数据集是The Pile项目的组成部分,专为语言模型研发设计。该数据集通过处理Stack Exchange数据转储文件生成:Stack Exchange是一套匿名公开的数据集,收录了Stack Exchange网络平台上所有用户创作的内容。
Pile-Stack Exchange数据集可适配多类自然语言处理任务,涵盖文本生成、问答、情感分析等方向;其数据来源于Stack Exchange旗下各分站,覆盖了丰富的主题与语言类型。
## 2. 数据集内容
### 2.1 数据详情
Pile-Stack Exchange数据集总数据量达34.8GB。
### 2.2 数据样例
{
"id": "208349663",
"source_id": "",
"doc_id": "80099327",
"data_type": "text",
"data_source": "pile",
"data_url": "enwiki-c4-pile-ccnews",
"content": "Q:
How do I sort a JSON by Date?
I'm trying to loop through a JSON and sort it by the date so I can see the latest date to the oldest date, and then write it to the file.
Here is my code
var reader = JSON.parse(fs.readFileSync('txt.json', 'utf8'));
function sortByDate(a, b) {
return new Date(a.lastUpdated).toJSON() - new Date(b.lastUpdated).toJSON();
}
reader.sort(sortByDate)
JSON Data Example
{
"Data": {
"Contents": [
{
"Key": [
"HelloTest"
],
"lastUpdated": [
"2019-10-25T10:30:50.558Z"
]
},
{
"Key": [
"TestHello"
],
"lastUpdated": [
"2019-03-26T10:30:50.558Z"
]
}
]
}
}
A:
Here are a couple of errors I found in your code:
Your function name has a typo, it should be sortByDate and not sortbyDate.
You need top sort the inner json.Data.Contents array, not the outer json object.
You need to reference the first element of your lastUpdated arrays using lastUpdated[0].
Finally, you do not need to call toJSON() on the date objects in your sorting function, simply convert to date and return the difference.
Also your inner data fields are arrays, which seems strange for a Key and a lastUpdated value.
If you keep your fields as arrays, here is a working example showing how to sort the inner Data.Contents array by date:
const jsonString = `{
"Data": {
"Contents": [{
"Key": ["HelloTest"],
"lastUpdated": ["2019-10-25T10:30:50.558Z"]
}, {
"Key": ["TestHello"],
"lastUpdated": ["2019-03-26T10:30:50.558Z"]
}]
}
}`;
function sortByDate(a, b) {
return new Date(a.lastUpdated[0]) - new Date(b.lastUpdated[0]);
}
const json = JSON.parse(jsonString);
const defaultValue = { Data: { Contents: [] } };
const sortedContents = [...(json || defaultValue).Data.Contents].sort(sortByDate);
const output = { ...json, Data: { Contents: sortedContents } };
console.log(output);
If you change your fields to scalars, which I suggest, here is another example:
}, {
}]
}
}`;
return new Date(a.lastUpdated) - new Date(b.lastUpdated);
}
",
"remark": {
"pile_set_name": "StackExchange"
},
"sub_path": "stackexchange/train"
}
## 3. 引用文献
@misc{conghui2022opendatalab,
title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets},
author={Conghui He, Wei Li, Zhenjiang Jin, Bin Wang, Chao Xu, Dahua Lin},
journal={https://opendatalab.com/},
year={2022}
}
## 4. 数据集下载
:modelscope-code[]{type="git"}
提供机构:
maas
创建时间:
2024-07-15



