ShareGPT-X
收藏魔搭社区2026-01-06 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/DSULT-Core/ShareGPT-X
下载链接
链接失效反馈官方服务:
资源简介:
[](https://soundcloud.com/leronhinds/tame-impala-justin-timberlake-the-less-i-know-the-better-x-sexy-back-feat-timbaland-mashup-extended)
### Dataset Summary
ShareGPT-X is an expanded, snapshot of ~92K (ChatGPT) one-to-one human & LLM conversations harvested from X.com (formerly Twitter).
The corpus spans January 2024 → present (last ingest 2025-05) and is built entirely from public "share" links that users posted to their timelines.
Each thread contains the original user prompt plus the assistant’s reply; no system prompts or metadata are exposed.
### Supported Tasks and Leaderboards
- `text-generation`
- `instruction-following`
- `dialogue-style preference modelling`
### Languages
Primarily English; But realistically, mixed.
## Dataset Structure
We give you three ways to suffer, ranked from "meh" to "kill me now":
1. `ChatGPT-Simple_ShareGPT.json` - looks like vintage ShareGPT, tastes like sanity.
- `ChatGPT-Simple_ShareGPT_Sample.json` - teaspoon-sized for the merely curious.
2. `ChatGPT-Simple.jsonl` - keeps timestamps, model tags, finish-reasons, the lot; just polite enough to delete pure NUL spam.
3. `ChatGPT-RawChunked/chunk-*.jsonl` - A more direct HTML→JSON dump of the public share page; structurally inconsistent, no warranty, may cause nausea.
*(Not documented further; bring your own haz-mat suit.)*
Pick your poison; the rest of the README assumes you grabbed door #1 or #2 and still like life.
### Data Instances
**Door #1 (ShareGPT flavour)** - all you need, nothing you don’t:
<details>
```json
{
{
"id": 152,
"length": 1525,
"$modelId": "gpt-4o",
"language": "English",
"consecutive_turns": "Yes",
"conversations": [
{
"from": "human",
"value": "Would removing obamacare remove people's right to healthcare?"
},
{
"from": "gpt",
"value": "The Affordable Care Act (ACA), commonly known as Obamacare, significantly expanded access to healthcare in the United States but did not create a \"right\" to healthcare. Instead, it implemented policies that made healthcare more accessible and affordable, such as:\n\n1. **Medicaid Expansion**: Allowed states to expand Medicaid coverage to more low-income individuals.\n2. **Insurance Marketplaces**: Provided subsidies to help people buy insurance.\n3. **Preexisting Conditions**: Prohibited insurers from denying coverage or charging more based on health conditions.\n4. **Individual Mandate**: Originally required individuals to have insurance or face a penalty, though this penalty was reduced to $0 in 2019.\n\nIf Obamacare were repealed:\n- **Many could lose coverage**: Tens of millions might lose access to Medicaid or subsidies.\n- **Preexisting conditions**: Insurers could potentially resume denying coverage or charging high premiums for these individuals.\n- **Costs might increase**: Without subsidies, insurance could become unaffordable for many.\n\nHowever, repealing Obamacare would not legally remove a \"right\" because the U.S. does not guarantee healthcare as a constitutional or legal right. Instead, it would reduce the mechanisms that currently make healthcare accessible to many.\n\nThis distinction is important—whether healthcare should be considered a universal right is a broader ethical and policy debate that goes beyond the specifics of Obamacare."
}
]
},
{
"id": 153,
"length": 4554,
"$modelId": "gpt-4o",
"language": "English",
"consecutive_turns": "Yes",
"conversations": [
{
"from": "human",
"value": "Did Stephen Harper eliminate the office of ..."
},
{
"from": "gpt",
"value": "**Stephen Harper's Approach to Science and Scientists in Canada:** ..."
}
]
},
}
```
</details>
**Door #2 (Simple)** - full turn metadata, still readable by humans with minor eye-strain:
<details>
```json
{
"id": "6752017c-fbc8-8007-88c5-f4cfff13a0bb",
"conversations": [
{
"role": {
"role": "system",
"name": null
},
"metadata": {
"$hidden": true
},
"created": null,
"kind": "text",
"content": {
"type": "text",
"content": [
""
],
"name": null
},
"turnEnd": true
},
{
"role": {
"role": "user",
"name": null
},
"metadata": {
"citations": [],
"content_references": [],
"search_result_groups": [],
"image_results": [],
"$videoWithAudio": false
},
"created": 1733427554.707668,
"kind": "text",
"content": {
"type": "text",
"content": [
"Would removing obamacare remove people's right to healthcare?"
],
"name": null
},
"turnEnd": false
},
{
"role": {
"role": "assistant",
"name": null
},
"metadata": {
"$modelId": "gpt-4o",
"$redacted": true
},
"created": 1733427570.657326,
"kind": "model_editable_context",
"content": {
"type": "model_editable_context",
"content": "",
"name": null
},
"turnEnd": false
},
{
"role": {
"role": "assistant",
"name": null
},
"metadata": {
"finish_details": {
"type": "stop",
"stop_tokens": [
200002
]
},
"$complete": true,
"citations": [],
"content_references": [],
"$modelId": "gpt-4o"
},
"created": 1733427570.657476,
"kind": "text",
"content": {
"type": "text",
"content": [
"The Affordable Care Act (ACA) ..."
],
"name": null
},
"turnEnd": true
}
]
}
```
</details>
### Data Splits
No canonical split is provided.
## Dataset Creation
### Curation Rationale
People clicked the shiny "Make this chat discoverable" toggle, Google happily slurped the URLs, and—surprise—half the planet’s therapy sessions, API keys, and break-up poetry landed in verbatim search snippets.
OpenAI panic-disabled the switch, but the cat is already out of the bag and still purring in SERP cache.
We’re just the librarians picking up the books everyone left in the town square. Read, laugh, cringe, filter, and for goodness’ sake don’t paste your ID Card in public next time.
### Source Data
#### Initial Data Collection and Normalisation
1. Scraped from twitter search (with a bunch of accounts)
2. Download the page
3. Extract the conversation from React shenanigans
#### Who are the source language producers?
X users (prompt authors) and ChatGPT. / OpenAI
### Annotations
None - data is used exactly as posted.
### Personal and Sensitive Information
Assume presence until proven otherwise. Users routinely paste crash logs, API keys, medical questions, etc. Run your favourite PII scrubber before training.
## Considerations for Using the Data
### Social Impact of Dataset
ShareGPT-X enables open replication of proprietary chat models. The trade-off is higher noise (memes, typos, bait) compared to lab-curated corpora. Filter aggressively for tone, safety, and factual accuracy.
### Discussion of Biases
Inherits every bias exhibited by the underlying LLMs plus Twitter’s popularity skew (tech-bro, Western, male-dominated). Expect over-representation of programming, crypto, and AI-meta chatter.
### Other Known Limitations
- ~~Conversation threads are truncated if the user deleted the tweet.~~ Some Share links have been removed at scrape time and are not included.
- ~~Non-English text is *not* language-tagged.~~
- No ratings or rejection labels - can’t be used directly for RLHF without extra annotation.
## Additional Information
### Dataset Curators
DSULT-Core
### Licensing Information
Public posts on X are "public." The *LLM outputs* contained inside those posts are, under U.S. law, uncopyrightable.
We release the curated bundle under **CC0 1.0 Universal** - use at your own risk, no warranties, no liability, no DMCA takedown theatre.
### Credits
- Dataset card & README overhaul: `Kimi-K2` (+Custom Assistant Prompt) - a language-model assistant who refuses to take the blame for your questionable prompt hygiene.
- Waifu Card: Nano-🍌
- [SicariusSicariiStuff](https://huggingface.co/SicariusSicariiStuff): ShareGPT versions of ChatGPT-Simple
[](https://soundcloud.com/leronhinds/tame-impala-justin-timberlake-the-less-i-know-the-better-x-sexy-back-feat-timbaland-mashup-extended)
### 数据集摘要
ShareGPT-X 是一个扩展后的快照数据集,包含约9.2万条(ChatGPT)一对一人类与大语言模型(Large Language Model)对话,这些对话采集自X.com(前身为Twitter)。该数据集的时间范围为2024年1月至今(最新数据采集时间为2025年5月),所有数据均来自用户在个人时间线上发布的公开“分享”链接。每条对话线程均包含原始用户提示词(prompt)与助手回复,未包含系统提示词或元数据。
### 支持任务与排行榜
- `text-generation`(文本生成)
- `instruction-following`(指令遵循)
- `dialogue-style preference modelling`(对话式偏好建模)
### 语言
以英语为主,但实际包含混合语种。
## 数据集结构
我们提供三种加载方式,按“尚可接受”至“极为繁琐”的顺序排列如下:
1. **ChatGPT-Simple_ShareGPT.json**:风格复刻经典ShareGPT格式,结构清晰易读。
- **ChatGPT-Simple_ShareGPT_Sample.json**:仅含少量样本,适合好奇的测试用户。
2. **ChatGPT-Simple.jsonl**:包含时间戳、模型标签、终止原因等完整元数据,仅过滤了纯空字符垃圾内容。
3. **ChatGPT-RawChunked/chunk-*.jsonl**:直接从公开分享页面的HTML转换为JSON的原始转储文件;结构不一致,无任何质量保证,可能导致阅读不适。
*(未提供进一步文档;请自备防护装备。)*
请根据需求选择;本README其余内容默认你选择了第1或第2种格式,且仍希望正常使用数据集。
### 数据实例
#### 选项1(ShareGPT风格格式):满足核心需求,无冗余内容:
<details>
json
{
{
"id": 152,
"length": 1525,
"$modelId": "gpt-4o",
"language": "English",
"consecutive_turns": "Yes",
"conversations": [
{
"from": "human",
"value": "Would removing obamacare remove people's right to healthcare?"
},
{
"from": "gpt",
"value": "The Affordable Care Act (ACA), commonly known as Obamacare, significantly expanded access to healthcare in the United States but did not create a "right" to healthcare. Instead, it implemented policies that made healthcare more accessible and affordable, such as:
1. **Medicaid Expansion**: Allowed states to expand Medicaid coverage to more low-income individuals.
2. **Insurance Marketplaces**: Provided subsidies to help people buy insurance.
3. **Preexisting Conditions**: Prohibited insurers from denying coverage or charging more based on health conditions.
4. **Individual Mandate**: Originally required individuals to have insurance or face a penalty, though this penalty was reduced to $0 in 2019.
If Obamacare were repealed:
- **Many could lose coverage**: Tens of millions might lose access to Medicaid or subsidies.
- **Preexisting conditions**: Insurers could potentially resume denying coverage or charging high premiums for these individuals.
- **Costs might increase**: Without subsidies, insurance could become unaffordable for many.
However, repealing Obamacare would not legally remove a "right" because the U.S. does not guarantee healthcare as a constitutional or legal right. Instead, it would reduce the mechanisms that currently make healthcare accessible to many.
This distinction is important—whether healthcare should be considered a universal right is a broader ethical and policy debate that goes beyond the specifics of Obamacare."
}
]
},
{
"id": 153,
"length": 4554,
"$modelId": "gpt-4o",
"language": "English",
"consecutive_turns": "Yes",
"conversations": [
{
"from": "human",
"value": "Did Stephen Harper eliminate the office of ..."
},
{
"from": "gpt",
"value": "**Stephen Harper's Approach to Science and Scientists in Canada:** ..."
}
]
},
}
</details>
#### 选项2(简洁格式):包含完整的对话轮次元数据,仅需少量阅读适应成本:
<details>
json
{
"id": "6752017c-fbc8-8007-88c5-f4cfff13a0bb",
"conversations": [
{
"role": {
"role": "system",
"name": null
},
"metadata": {
"$hidden": true
},
"created": null,
"kind": "text",
"content": {
"type": "text",
"content": [
""
],
"name": null
},
"turnEnd": true
},
{
"role": {
"role": "user",
"name": null
},
"metadata": {
"citations": [],
"content_references": [],
"search_result_groups": [],
"image_results": [],
"$videoWithAudio": false
},
"created": 1733427554.707668,
"kind": "text",
"content": {
"type": "text",
"content": [
"Would removing obamacare remove people's right to healthcare?"
],
"name": null
},
"turnEnd": false
},
{
"role": {
"role": "assistant",
"name": null
},
"metadata": {
"$modelId": "gpt-4o",
"$redacted": true
},
"created": 1733427570.657326,
"kind": "model_editable_context",
"content": {
"type": "model_editable_context",
"content": "",
"name": null
},
"turnEnd": false
},
{
"role": {
"role": "assistant",
"name": null
},
"metadata": {
"finish_details": {
"type": "stop",
"stop_tokens": [
200002
]
},
"$complete": true,
"citations": [],
"content_references": [],
"$modelId": "gpt-4o"
},
"created": 1733427570.657476,
"kind": "text",
"content": {
"type": "text",
"content": [
"The Affordable Care Act (ACA) ..."
],
"name": null
},
"turnEnd": true
}
]
}
</details>
### 数据划分
未提供标准数据集划分。
## 数据集创建
### 整理依据
用户点击了显眼的“让此对话可被发现”开关,谷歌顺利抓取了这些链接,结果——令人意外的是——全球各地的心理咨询记录、API密钥、分手感言等内容都原封不动地出现在了搜索结果片段中。OpenAI随后紧急禁用了该开关,但覆水难收,这些数据早已留存于搜索引擎结果页面(Search Engine Results Page, SERP)的缓存中。本数据集的整理者仅相当于在广场上捡拾众人遗留书籍的图书馆员。请读者自行阅读、一笑了之、过滤不适内容,切记下次不要将身份证件等敏感信息公开发布。
### 源数据
#### 初始采集与标准化流程
1. 通过X平台搜索(关联多个账号)抓取数据
2. 下载对应页面
3. 从React框架生成的页面代码中提取对话内容
#### 源语言生成者
X平台用户(提示词编写者)与ChatGPT,即OpenAI开发的模型。
### 标注
无额外标注,数据直接按发布原样使用。
### 个人与敏感信息
默认包含敏感内容,除非经验证清除。用户常粘贴崩溃日志、API密钥、医疗问题等敏感信息,请在训练前使用个人身份信息(Personally Identifiable Information, PII)清洗工具处理数据。
## 数据使用注意事项
### 数据集的社会影响
ShareGPT-X 支持对闭源聊天模型的开源复现。但相较于实验室精心整理的语料库,该数据集存在更多噪声(如梗图、拼写错误、钓鱼内容等)。请严格过滤内容,确保语气合规、安全且事实准确。
### 偏差讨论
该数据集继承了底层大语言模型的所有偏差,同时受到X平台的人气偏向影响(内容多来自科技爱好者、西方用户、男性群体)。预计编程、加密货币、AI相关话题的占比偏高。
### 其他已知局限性
- ~~若用户删除推文,对话线程将被截断。~~ 部分分享链接在抓取时已失效,未被纳入数据集。
- ~~非英语文本未进行语言标注。~~
- 无评分或拒绝标签,若需直接用于基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF),需额外进行标注。
## 补充信息
### 数据集整理者
DSULT-Core
### 授权信息
X平台上的公开帖子属于公开内容。根据美国法律,其中包含的大语言模型输出内容不受版权保护。本整理后的数据集以**CC0 1.0 通用协议**发布——使用风险自负,无任何质量保证,不承担任何责任,无需应对数字千年版权法案(Digital Millennium Copyright Act, DMCA)的移除请求。
### 致谢
- 数据集卡片与README改版:`Kimi-K2`(附带自定义助手提示词)——一位拒绝为你糟糕的提示词规范背锅的大语言模型助手。
- 角色卡片设计:Nano-🍌
- [SicariusSicariiStuff](https://huggingface.co/SicariusSicariiStuff):提供ChatGPT-Simple的ShareGPT格式版本
提供机构:
maas
创建时间:
2025-09-26



