gabtan99/pex-conversations
收藏Hugging Face2022-10-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/gabtan99/pex-conversations
下载链接
链接失效反馈官方服务:
资源简介:
PEx Conversations数据集由来自PinoyExchange.com的线程组成,包含他加禄语、英语或Taglish(他加禄语和英语混合)的回复。该数据集共包含从8个子论坛中抓取的45K个线程,仅包含用户消息,未收集任何图片、视频、链接或嵌入的HTML内容。所有字符已转为其最接近的ASCII表示,并修复了Unicode错误。数据按类别分类,每个对象包含类别和对话列表。对话具有递归结构,包含文本和回复列表。各子论坛的数据量分布如下:Small Talk - 5K对话,1.16M话语;Food & Drinks - 8.2K对话,273K话语;Health & Wellness - 6.3K对话,93K话语;Body & Fitness - 3.9K对话,94K话语;Home & Garden - 3.6K对话,71K话语;Style & Fashion - 9.7K对话,197K话语;Travel & Leisure - 7.3K对话,431K话语;Visas & Immigration - 1.1K对话,99K话语。
The PEx Conversations dataset consists of threads scraped from PinoyExchange.com, containing replies in Tagalog, English, or Taglish (a mixed code of Tagalog and English). This dataset includes a total of 45K threads collected from 8 sub-forums, and only contains user messages, with no images, videos, links, or embedded HTML content gathered. All characters have been converted to their closest ASCII equivalents, and Unicode errors have been rectified. The data is categorized by category, where each object contains a category label and a conversation list. Conversations feature a recursive structure, comprising text content and a list of replies. The data volume distribution across each sub-forum is as follows:
- Small Talk: 5K conversations, 1.16M utterances
- Food & Drinks: 8.2K conversations, 273K utterances
- Health & Wellness: 6.3K conversations, 93K utterances
- Body & Fitness: 3.9K conversations, 94K utterances
- Home & Garden: 3.6K conversations, 71K utterances
- Style & Fashion: 9.7K conversations, 197K utterances
- Travel & Leisure: 7.3K conversations, 431K utterances
- Visas & Immigration: 1.1K conversations, 99K utterances
提供机构:
gabtan99
原始信息汇总
PinoyExchange (PEx) Conversations Dataset
概述
PEx Conversations是一个多语言数据集,主要包含从PinoyExchange.com收集的讨论线程,涵盖Tagalog、English和Taglish三种语言的回复。该数据集包含45K个从8个不同子论坛抓取的线程。
数据结构
数据集按类别分类,每个列表对象包含:
- category:线程类别
- conversations:线程列表
线程内部具有递归结构,包括:
- text:回复/回复/提示
- replies:对当前提示的回复列表,其内部结构与text和replies相同。
子论坛数据分布
数据集在各子论坛的分布如下:
- Small Talk:5K对话,1.16M条发言
- Food & Drinks:8.2K对话,273K条发言
- Health & Wellness:6.3K对话,93K条发言
- Body & Fitness:3.9K对话,94K条发言
- Home & Garden:3.6K对话,71K条发言
- Style & Fashion:9.7K对话,197K条发言
- Travel & Leisure:7.3K对话,431K条发言
- Visas & Immigration:1.1K对话,99K条发言



