five

lang-uk/every_prompt

收藏
Hugging Face2023-04-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/lang-uk/every_prompt
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - question-answering pretty_name: Every Prompt size_categories: - 1M<n<10M multilinguality: - multilingual --- ## Every Prompt Every Prompt is a data-driven approach to mining instructions from the web. It contains over a million FAQs and HowTos from around the world in a structured format. It also has basic pre-processing to calculate the length of the useful text and identify the language of that text with the help of [GCLD3](https://github.com/google/cld3) It relies on the [Web Data Commons](http://webdatacommons.org) dataset (from October 2022) to find the seed list of sites with [**HowTo**](https://schema.org/HowTo) and [**FAQPage**](https://schema.org/FAQPage) items. The general pipeline looks like this: * Download 1.6TB of structured data from webdatacommons to identify the pages with the structured data we need (wget/parallel). That gives us 1,985,925 seed pages * Crawls the seed pages and tries to extract structured data using [extruct](https://pypi.org/project/extruct/#description) package. That left around 1,358,638 pages which are alive and well-formed. * Extracts only the relevant structured data of the HowTo/FAQPage type with the help of jmespath. That boils down to 1,266,926 json documents. * Extracts the textual information out of the structure to identify the text's language, the textual data's length, and the text/data ratio. You can use the resulting dataset by filtering for the language and amount of the text. You need to convert the structured data into instructions yourself. You'll need to apply extra cleansing/evaluation of the instructions you've got because, you know, the internet is still full of crap. **Caveat emptor**: the format of the FAQs and HowTo's in the dataset might vary greatly. Account for that. To understand potential pitfalls, look at the jmespath expression at the `export_structured_data.py`. ## Detailed stats (with breakdown by language and data type) | language | FAQPage count | FAQPage text length | HowTo count | HowTo text length | items count | text length | | --- | --- | --- | --- | --- | --- | --- | | en | 592730 | 1186748927 | 29017 | 77135350 | 621747 | 1263884277 | | de | 83184 | 213931486 | 3370 | 13905977 | 86554 | 227837463 | | es | 63237 | 113906536 | 6466 | 30517773 | 69703 | 144424309 | | fr | 65081 | 141638675 | 3672 | 21632272 | 68753 | 163270947 | | ja | 55439 | 46231152 | 1402 | 1678468 | 56841 | 47909620 | | ru | 41271 | 70947161 | 2403 | 12805308 | 43674 | 83752469 | | nl | 34066 | 102719276 | 2007 | 11078079 | 36073 | 113797355 | | it | 23076 | 43968063 | 2465 | 13696136 | 25541 | 57664199 | | vi | 23115 | 38603954 | 720 | 3224051 | 23835 | 41828005 | | zh | 22496 | 21111729 | 1112 | 1513344 | 23608 | 22625073 | | pl | 19424 | 41446645 | 306 | 419787 | 19730 | 41866432 | | fa | 17263 | 31294557 | 1819 | 1915117 | 19082 | 33209674 | | tr | 13619 | 20040069 | 722 | 418695 | 14341 | 20458764 | | und | 12256 | 1032156 | 322 | 8941 | 12578 | 1041097 | | pt | 10784 | 26163387 | 1775 | 8295306 | 12559 | 34458693 | | ro | 10536 | 16405628 | 75 | 89946 | 10611 | 16495574 | | id | 8256 | 14353165 | 1871 | 13055561 | 10127 | 27408726 | | ko | 8348 | 7624222 | 616 | 1533830 | 8964 | 9158052 | | sv | 8007 | 15926376 | 390 | 638054 | 8397 | 16564430 | | ar | 6950 | 10240266 | 1241 | 7517175 | 8191 | 17757441 | | da | 7691 | 15277244 | 408 | 450176 | 8099 | 15727420 | | cs | 7546 | 13201121 | 480 | 2471544 | 8026 | 15672665 | | fi | 7767 | 14468764 | 199 | 170138 | 7966 | 14638902 | | hi | 4517 | 4307716 | 683 | 4294129 | 5200 | 8601845 | | hu | 4866 | 10639836 | 125 | 61118 | 4991 | 10700954 | | el | 4600 | 10555382 | 103 | 55576 | 4703 | 10610958 | | no | 4357 | 8426887 | 179 | 354796 | 4536 | 8781683 | | uk | 4401 | 6925331 | 90 | 37285 | 4491 | 6962616 | | iw | 4056 | 7723904 | 36 | 35305 | 4092 | 7759209 | | bg | 3620 | 10154727 | 41 | 31268 | 3661 | 10185995 | | sk | 2639 | 4394140 | 65 | 32527 | 2704 | 4426667 | | th | 1877 | 3823867 | 613 | 3171583 | 2490 | 6995450 | | mr | 2002 | 2274197 | 57 | 75906 | 2059 | 2350103 | | mt | 1886 | 3761332 | 14 | 5443 | 1900 | 3766775 | | cy | 1524 | 3171667 | 25 | 11641 | 1549 | 3183308 | | bs | 1366 | 2031881 | 34 | 23298 | 1400 | 2055179 | | et | 1299 | 1694117 | 5 | 2005 | 1304 | 1696122 | | ms | 989 | 1927545 | 174 | 720492 | 1163 | 2648037 | | ca | 1068 | 1614073 | 62 | 34072 | 1130 | 1648145 | | lt | 1056 | 2272916 | 44 | 57169 | 1100 | 2330085 | | ne | 966 | 771410 | 29 | 28569 | 995 | 799979 | | hr | 796 | 1394174 | 15 | 10191 | 811 | 1404365 | | fy | 743 | 633705 | 24 | 5823 | 767 | 639528 | | lb | 703 | 1133527 | 18 | 3985 | 721 | 1137512 | | gl | 628 | 1159618 | 34 | 9049 | 662 | 1168667 | | mn | 644 | 1174921 | 11 | 3592 | 655 | 1178513 | | la | 635 | 363380 | 13 | 2009 | 648 | 365389 | | af | 577 | 444351 | 38 | 14403 | 615 | 458754 | | sl | 451 | 1708497 | 50 | 50361 | 501 | 1758858 | | ht | 455 | 223768 | 13 | 4406 | 468 | 228174 | | lv | 317 | 1017694 | 32 | 31983 | 349 | 1049677 | | gd | 273 | 295170 | 52 | 20374 | 325 | 315544 | | sr | 287 | 367782 | 23 | 5177 | 310 | 372959 | | co | 288 | 284629 | 12 | 3530 | 300 | 288159 | | az | 268 | 273548 | 9 | 13011 | 277 | 286559 | | fil | 210 | 165520 | 63 | 77100 | 273 | 242620 | | jv | 244 | 153411 | 14 | 75932 | 258 | 229343 | | sn | 239 | 175459 | 10 | 8890 | 249 | 184349 | | bn | 190 | 301199 | 42 | 23451 | 232 | 324650 | | ga | 198 | 263174 | 30 | 12905 | 228 | 276079 | | mg | 201 | 53082 | 18 | 6141 | 219 | 59223 | | hi-Latn | 194 | 250495 | 4 | 33091 | 198 | 283586 | | hmn | 173 | 793850 | 16 | 5902 | 189 | 799752 | | ka | 162 | 262305 | 8 | 3427 | 170 | 265732 | | ig | 136 | 129243 | 10 | 2941 | 146 | 132184 | | is | 139 | 236415 | 4 | 1277 | 143 | 237692 | | ta | 129 | 155042 | 12 | 4079 | 141 | 159121 | | kk | 102 | 152629 | 28 | 11885 | 130 | 164514 | | eu | 118 | 130847 | 10 | 3522 | 128 | 134369 | | eo | 121 | 69071 | 6 | 1885 | 127 | 70956 | | ur | 93 | 259680 | 33 | 20499 | 126 | 280179 | | so | 112 | 203877 | 6 | 2151 | 118 | 206028 | | tg | 99 | 73437 | 16 | 5539 | 115 | 78976 | | mk | 29 | 62730 | 84 | 391780 | 113 | 454510 | | be | 100 | 88386 | 8 | 2193 | 108 | 90579 | | sm | 100 | 1309239 | 8 | 2778 | 108 | 1312017 | | uz | 93 | 116820 | 7 | 2987 | 100 | 119807 | | zu | 84 | 136023 | 9 | 2744 | 93 | 138767 | | haw | 81 | 59685 | 6 | 822 | 87 | 60507 | | sq | 74 | 120593 | 12 | 6205 | 86 | 126798 | | ny | 78 | 19403 | 6 | 2046 | 84 | 21449 | | hy | 66 | 81675 | 10 | 3613 | 76 | 85288 | | ha | 44 | 84457 | 19 | 68032 | 63 | 152489 | | ru-Latn | 60 | 40266 | 1 | 61 | 61 | 40327 | | el-Latn | 57 | 55657 | 4 | 342 | 61 | 55999 | | zh-Latn | 58 | 27522 | 1 | 66 | 59 | 27588 | | sd | 52 | 51341 | 7 | 2044 | 59 | 53385 | | su | 50 | 17291 | 7 | 2358 | 57 | 19649 | | ku | 47 | 23147 | 6 | 1998 | 53 | 25145 | | bg-Latn | 48 | 15419 | 1 | 414 | 49 | 15833 | | st | 25 | 65162 | 19 | 6346 | 44 | 71508 | | yo | 37 | 103685 | 6 | 1790 | 43 | 105475 | | ceb | 41 | 72950 | 1 | 107 | 42 | 73057 | | ky | 30 | 23062 | 10 | 3679 | 40 | 26741 | | te | 32 | 42803 | 7 | 2558 | 39 | 45361 | | yi | 32 | 227267 | 7 | 2443 | 39 | 229710 | | mi | 26 | 10132 | 11 | 2915 | 37 | 13047 | | gu | 25 | 37857 | 10 | 4608 | 35 | 42465 | | ja-Latn | 33 | 17560 | 2 | 88 | 35 | 17648 | | sw | 26 | 17579 | 8 | 2726 | 34 | 20305 | | xh | 28 | 46466 | 4 | 1409 | 32 | 47875 | | ml | 16 | 33198 | 6 | 2721 | 22 | 35919 | | ps | 10 | 7671 | 12 | 2642 | 22 | 10313 | | am | 6 | 8017 | 8 | 1987 | 14 | 10004 | | kn | 5 | 22197 | 9 | 3523 | 14 | 25720 | | km | 7 | 8936 | 6 | 1879 | 13 | 10815 | | pa | 10 | 26617 | 3 | 1100 | 13 | 27717 | | si | 5 | 24000 | 5 | 1722 | 10 | 25722 | | lo | 1 | 6204 | 7 | 2115 | 8 | 8319 | | my | 3 | 14663 | 3 | 1179 | 6 | 15842 | ## Recreating the results 1. Clone the repo without the LFS files. 2. Install requirements from `requirements.txt`. 3. Install `pv` and `parallel`. 4. Run `bin/get_seed_urls.sh` to filter urls of interest out of 1.6TB of compressed data. Don't worry about disk space. Worry about the traffic. That will take around 5h on decent connection. 5. Run scrapy spider like this `scrapy crawl webdatacommons_org -s WEB_DATA_COMMONS=web_data_commons_urls_sample.txt -L INFO -o webdatacommons.jsonlines` with `WEB_DATA_COMMONS` pointing to the list of seed URLs from step 4. That might take up to a few weeks. 6. Run `python bin/extract_relevant_structured_data.py --num-threads 12 webdatacommons.jsonlines relevant.jsonlines.bz2`. That's fast, probably around 30 minutes. 7. Run `python bin/export_structured_data.py relevant.jsonlines.bz2 extruct_out.jsonlines.bz2` to obtain the final version of the dataset. 8. Optionally you can calculate the resulting stats like that: `python bin/get_stats.py extruct_out.jsonlines.bz2 every_prompt_stats.csv` ## Advices If you want to recreate the results: * Get yourself a server or VPS with enough space (80GB should be enough). * Look at the code. You'd probably want to make changes here and there. * All the python scripts have extra parameters to control the number of threads and the chunk size. Both accept compressed input and output files with the help of smart_open lib. ## License **Code** of the project has an MIT license. Copyright: [Dmytro Chaplynskyi](https://twitter.com/dchaplinsky), [lang-uk project](https://lang.org.ua), 2023
提供机构:
lang-uk
原始信息汇总

数据集概述

数据集名称

  • Every Prompt

数据集描述

  • Every Prompt 是一个从网络中挖掘指令的数据驱动方法。它包含超过一百万的全球FAQ和HowTo,以结构化格式呈现。数据集经过基本预处理,计算有用文本的长度并识别文本语言,使用 GCLD3 工具。

数据集内容

  • 包含FAQ和HowTo类型的结构化数据。
  • 数据来源于 Web Data Commons 数据集(2022年10月),通过筛选具有 HowToFAQPage 项目的网站。

数据集规模

  • 数据集大小:1M<n<10M。
  • 包含1,266,926个JSON文档。

多语言性

  • 数据集支持多语言。

数据处理流程

  1. 从Web Data Commons下载1.6TB的结构化数据,筛选出1,985,925个种子页面。
  2. 爬取种子页面,使用 extruct 包提取结构化数据,剩余1,358,638个有效页面。
  3. 使用jmespath提取HowTo/FAQPage类型的相关结构化数据。
  4. 从结构中提取文本信息,识别文本语言、文本长度及文本/数据比率。

使用说明

  • 用户需自行将结构化数据转换为指令。
  • 需要额外进行清洗和评估,因为互联网内容质量参差不齐。

详细统计

语言 FAQPage 数量 FAQPage 文本长度 HowTo 数量 HowTo 文本长度 项目总数 文本总长度
en 592730 1186748927 29017 77135350 621747 1263884277
de 83184 213931486 3370 13905977 86554 227837463
es 63237 113906536 6466 30517773 69703 144424309
fr 65081 141638675 3672 21632272 68753 163270947
... ... ... ... ... ... ...

许可证

  • 数据集采用MIT许可证。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作