five

gabtan99/pex-conversations

收藏
Hugging Face2022-10-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/gabtan99/pex-conversations
下载链接
链接失效反馈
官方服务:
资源简介:
PEx Conversations数据集由来自PinoyExchange.com的线程组成,包含他加禄语、英语或Taglish(他加禄语和英语混合)的回复。该数据集共包含从8个子论坛中抓取的45K个线程,仅包含用户消息,未收集任何图片、视频、链接或嵌入的HTML内容。所有字符已转为其最接近的ASCII表示,并修复了Unicode错误。数据按类别分类,每个对象包含类别和对话列表。对话具有递归结构,包含文本和回复列表。各子论坛的数据量分布如下:Small Talk - 5K对话,1.16M话语;Food & Drinks - 8.2K对话,273K话语;Health & Wellness - 6.3K对话,93K话语;Body & Fitness - 3.9K对话,94K话语;Home & Garden - 3.6K对话,71K话语;Style & Fashion - 9.7K对话,197K话语;Travel & Leisure - 7.3K对话,431K话语;Visas & Immigration - 1.1K对话,99K话语。

The PEx Conversations dataset consists of threads scraped from PinoyExchange.com, containing replies in Tagalog, English, or Taglish (a mixed code of Tagalog and English). This dataset includes a total of 45K threads collected from 8 sub-forums, and only contains user messages, with no images, videos, links, or embedded HTML content gathered. All characters have been converted to their closest ASCII equivalents, and Unicode errors have been rectified. The data is categorized by category, where each object contains a category label and a conversation list. Conversations feature a recursive structure, comprising text content and a list of replies. The data volume distribution across each sub-forum is as follows: - Small Talk: 5K conversations, 1.16M utterances - Food & Drinks: 8.2K conversations, 273K utterances - Health & Wellness: 6.3K conversations, 93K utterances - Body & Fitness: 3.9K conversations, 94K utterances - Home & Garden: 3.6K conversations, 71K utterances - Style & Fashion: 9.7K conversations, 197K utterances - Travel & Leisure: 7.3K conversations, 431K utterances - Visas & Immigration: 1.1K conversations, 99K utterances
提供机构:
gabtan99
原始信息汇总

PinoyExchange (PEx) Conversations Dataset

概述

PEx Conversations是一个多语言数据集,主要包含从PinoyExchange.com收集的讨论线程,涵盖Tagalog、English和Taglish三种语言的回复。该数据集包含45K个从8个不同子论坛抓取的线程。

数据结构

数据集按类别分类,每个列表对象包含:

  • category:线程类别
  • conversations:线程列表

线程内部具有递归结构,包括:

  • text:回复/回复/提示
  • replies:对当前提示的回复列表,其内部结构与text和replies相同。

子论坛数据分布

数据集在各子论坛的分布如下:

  • Small Talk:5K对话,1.16M条发言
  • Food & Drinks:8.2K对话,273K条发言
  • Health & Wellness:6.3K对话,93K条发言
  • Body & Fitness:3.9K对话,94K条发言
  • Home & Garden:3.6K对话,71K条发言
  • Style & Fashion:9.7K对话,197K条发言
  • Travel & Leisure:7.3K对话,431K条发言
  • Visas & Immigration:1.1K对话,99K条发言
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作