lang-uk/every_prompt
收藏Hugging Face2023-04-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/lang-uk/every_prompt
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- question-answering
pretty_name: Every Prompt
size_categories:
- 1M<n<10M
multilinguality:
- multilingual
---
## Every Prompt
Every Prompt is a data-driven approach to mining instructions from the web.
It contains over a million FAQs and HowTos from around the world in a structured format.
It also has basic pre-processing to calculate the length of the useful text and identify the language of that text with the help of [GCLD3](https://github.com/google/cld3)
It relies on the [Web Data Commons](http://webdatacommons.org) dataset (from October 2022) to find the seed list of sites with [**HowTo**](https://schema.org/HowTo) and [**FAQPage**](https://schema.org/FAQPage) items.
The general pipeline looks like this:
* Download 1.6TB of structured data from webdatacommons to identify the pages with the structured data we need (wget/parallel). That gives us 1,985,925 seed pages
* Crawls the seed pages and tries to extract structured data using [extruct](https://pypi.org/project/extruct/#description) package. That left around 1,358,638 pages which are alive and well-formed.
* Extracts only the relevant structured data of the HowTo/FAQPage type with the help of jmespath. That boils down to 1,266,926 json documents.
* Extracts the textual information out of the structure to identify the text's language, the textual data's length, and the text/data ratio.
You can use the resulting dataset by filtering for the language and amount of the text. You need to convert the structured data into instructions yourself.
You'll need to apply extra cleansing/evaluation of the instructions you've got because, you know, the internet is still full of crap.
**Caveat emptor**: the format of the FAQs and HowTo's in the dataset might vary greatly. Account for that. To understand potential pitfalls, look at the jmespath expression at the `export_structured_data.py`.
## Detailed stats (with breakdown by language and data type)
| language | FAQPage count | FAQPage text length | HowTo count | HowTo text length | items count | text length |
| --- | --- | --- | --- | --- | --- | --- |
| en | 592730 | 1186748927 | 29017 | 77135350 | 621747 | 1263884277 |
| de | 83184 | 213931486 | 3370 | 13905977 | 86554 | 227837463 |
| es | 63237 | 113906536 | 6466 | 30517773 | 69703 | 144424309 |
| fr | 65081 | 141638675 | 3672 | 21632272 | 68753 | 163270947 |
| ja | 55439 | 46231152 | 1402 | 1678468 | 56841 | 47909620 |
| ru | 41271 | 70947161 | 2403 | 12805308 | 43674 | 83752469 |
| nl | 34066 | 102719276 | 2007 | 11078079 | 36073 | 113797355 |
| it | 23076 | 43968063 | 2465 | 13696136 | 25541 | 57664199 |
| vi | 23115 | 38603954 | 720 | 3224051 | 23835 | 41828005 |
| zh | 22496 | 21111729 | 1112 | 1513344 | 23608 | 22625073 |
| pl | 19424 | 41446645 | 306 | 419787 | 19730 | 41866432 |
| fa | 17263 | 31294557 | 1819 | 1915117 | 19082 | 33209674 |
| tr | 13619 | 20040069 | 722 | 418695 | 14341 | 20458764 |
| und | 12256 | 1032156 | 322 | 8941 | 12578 | 1041097 |
| pt | 10784 | 26163387 | 1775 | 8295306 | 12559 | 34458693 |
| ro | 10536 | 16405628 | 75 | 89946 | 10611 | 16495574 |
| id | 8256 | 14353165 | 1871 | 13055561 | 10127 | 27408726 |
| ko | 8348 | 7624222 | 616 | 1533830 | 8964 | 9158052 |
| sv | 8007 | 15926376 | 390 | 638054 | 8397 | 16564430 |
| ar | 6950 | 10240266 | 1241 | 7517175 | 8191 | 17757441 |
| da | 7691 | 15277244 | 408 | 450176 | 8099 | 15727420 |
| cs | 7546 | 13201121 | 480 | 2471544 | 8026 | 15672665 |
| fi | 7767 | 14468764 | 199 | 170138 | 7966 | 14638902 |
| hi | 4517 | 4307716 | 683 | 4294129 | 5200 | 8601845 |
| hu | 4866 | 10639836 | 125 | 61118 | 4991 | 10700954 |
| el | 4600 | 10555382 | 103 | 55576 | 4703 | 10610958 |
| no | 4357 | 8426887 | 179 | 354796 | 4536 | 8781683 |
| uk | 4401 | 6925331 | 90 | 37285 | 4491 | 6962616 |
| iw | 4056 | 7723904 | 36 | 35305 | 4092 | 7759209 |
| bg | 3620 | 10154727 | 41 | 31268 | 3661 | 10185995 |
| sk | 2639 | 4394140 | 65 | 32527 | 2704 | 4426667 |
| th | 1877 | 3823867 | 613 | 3171583 | 2490 | 6995450 |
| mr | 2002 | 2274197 | 57 | 75906 | 2059 | 2350103 |
| mt | 1886 | 3761332 | 14 | 5443 | 1900 | 3766775 |
| cy | 1524 | 3171667 | 25 | 11641 | 1549 | 3183308 |
| bs | 1366 | 2031881 | 34 | 23298 | 1400 | 2055179 |
| et | 1299 | 1694117 | 5 | 2005 | 1304 | 1696122 |
| ms | 989 | 1927545 | 174 | 720492 | 1163 | 2648037 |
| ca | 1068 | 1614073 | 62 | 34072 | 1130 | 1648145 |
| lt | 1056 | 2272916 | 44 | 57169 | 1100 | 2330085 |
| ne | 966 | 771410 | 29 | 28569 | 995 | 799979 |
| hr | 796 | 1394174 | 15 | 10191 | 811 | 1404365 |
| fy | 743 | 633705 | 24 | 5823 | 767 | 639528 |
| lb | 703 | 1133527 | 18 | 3985 | 721 | 1137512 |
| gl | 628 | 1159618 | 34 | 9049 | 662 | 1168667 |
| mn | 644 | 1174921 | 11 | 3592 | 655 | 1178513 |
| la | 635 | 363380 | 13 | 2009 | 648 | 365389 |
| af | 577 | 444351 | 38 | 14403 | 615 | 458754 |
| sl | 451 | 1708497 | 50 | 50361 | 501 | 1758858 |
| ht | 455 | 223768 | 13 | 4406 | 468 | 228174 |
| lv | 317 | 1017694 | 32 | 31983 | 349 | 1049677 |
| gd | 273 | 295170 | 52 | 20374 | 325 | 315544 |
| sr | 287 | 367782 | 23 | 5177 | 310 | 372959 |
| co | 288 | 284629 | 12 | 3530 | 300 | 288159 |
| az | 268 | 273548 | 9 | 13011 | 277 | 286559 |
| fil | 210 | 165520 | 63 | 77100 | 273 | 242620 |
| jv | 244 | 153411 | 14 | 75932 | 258 | 229343 |
| sn | 239 | 175459 | 10 | 8890 | 249 | 184349 |
| bn | 190 | 301199 | 42 | 23451 | 232 | 324650 |
| ga | 198 | 263174 | 30 | 12905 | 228 | 276079 |
| mg | 201 | 53082 | 18 | 6141 | 219 | 59223 |
| hi-Latn | 194 | 250495 | 4 | 33091 | 198 | 283586 |
| hmn | 173 | 793850 | 16 | 5902 | 189 | 799752 |
| ka | 162 | 262305 | 8 | 3427 | 170 | 265732 |
| ig | 136 | 129243 | 10 | 2941 | 146 | 132184 |
| is | 139 | 236415 | 4 | 1277 | 143 | 237692 |
| ta | 129 | 155042 | 12 | 4079 | 141 | 159121 |
| kk | 102 | 152629 | 28 | 11885 | 130 | 164514 |
| eu | 118 | 130847 | 10 | 3522 | 128 | 134369 |
| eo | 121 | 69071 | 6 | 1885 | 127 | 70956 |
| ur | 93 | 259680 | 33 | 20499 | 126 | 280179 |
| so | 112 | 203877 | 6 | 2151 | 118 | 206028 |
| tg | 99 | 73437 | 16 | 5539 | 115 | 78976 |
| mk | 29 | 62730 | 84 | 391780 | 113 | 454510 |
| be | 100 | 88386 | 8 | 2193 | 108 | 90579 |
| sm | 100 | 1309239 | 8 | 2778 | 108 | 1312017 |
| uz | 93 | 116820 | 7 | 2987 | 100 | 119807 |
| zu | 84 | 136023 | 9 | 2744 | 93 | 138767 |
| haw | 81 | 59685 | 6 | 822 | 87 | 60507 |
| sq | 74 | 120593 | 12 | 6205 | 86 | 126798 |
| ny | 78 | 19403 | 6 | 2046 | 84 | 21449 |
| hy | 66 | 81675 | 10 | 3613 | 76 | 85288 |
| ha | 44 | 84457 | 19 | 68032 | 63 | 152489 |
| ru-Latn | 60 | 40266 | 1 | 61 | 61 | 40327 |
| el-Latn | 57 | 55657 | 4 | 342 | 61 | 55999 |
| zh-Latn | 58 | 27522 | 1 | 66 | 59 | 27588 |
| sd | 52 | 51341 | 7 | 2044 | 59 | 53385 |
| su | 50 | 17291 | 7 | 2358 | 57 | 19649 |
| ku | 47 | 23147 | 6 | 1998 | 53 | 25145 |
| bg-Latn | 48 | 15419 | 1 | 414 | 49 | 15833 |
| st | 25 | 65162 | 19 | 6346 | 44 | 71508 |
| yo | 37 | 103685 | 6 | 1790 | 43 | 105475 |
| ceb | 41 | 72950 | 1 | 107 | 42 | 73057 |
| ky | 30 | 23062 | 10 | 3679 | 40 | 26741 |
| te | 32 | 42803 | 7 | 2558 | 39 | 45361 |
| yi | 32 | 227267 | 7 | 2443 | 39 | 229710 |
| mi | 26 | 10132 | 11 | 2915 | 37 | 13047 |
| gu | 25 | 37857 | 10 | 4608 | 35 | 42465 |
| ja-Latn | 33 | 17560 | 2 | 88 | 35 | 17648 |
| sw | 26 | 17579 | 8 | 2726 | 34 | 20305 |
| xh | 28 | 46466 | 4 | 1409 | 32 | 47875 |
| ml | 16 | 33198 | 6 | 2721 | 22 | 35919 |
| ps | 10 | 7671 | 12 | 2642 | 22 | 10313 |
| am | 6 | 8017 | 8 | 1987 | 14 | 10004 |
| kn | 5 | 22197 | 9 | 3523 | 14 | 25720 |
| km | 7 | 8936 | 6 | 1879 | 13 | 10815 |
| pa | 10 | 26617 | 3 | 1100 | 13 | 27717 |
| si | 5 | 24000 | 5 | 1722 | 10 | 25722 |
| lo | 1 | 6204 | 7 | 2115 | 8 | 8319 |
| my | 3 | 14663 | 3 | 1179 | 6 | 15842 |
## Recreating the results
1. Clone the repo without the LFS files.
2. Install requirements from `requirements.txt`.
3. Install `pv` and `parallel`.
4. Run `bin/get_seed_urls.sh` to filter urls of interest out of 1.6TB of compressed data. Don't worry about disk space. Worry about the traffic. That will take around 5h on decent connection.
5. Run scrapy spider like this `scrapy crawl webdatacommons_org -s WEB_DATA_COMMONS=web_data_commons_urls_sample.txt -L INFO -o webdatacommons.jsonlines` with `WEB_DATA_COMMONS` pointing to the list of seed URLs from step 4. That might take up to a few weeks.
6. Run `python bin/extract_relevant_structured_data.py --num-threads 12 webdatacommons.jsonlines relevant.jsonlines.bz2`. That's fast, probably around 30 minutes.
7. Run `python bin/export_structured_data.py relevant.jsonlines.bz2 extruct_out.jsonlines.bz2` to obtain the final version of the dataset.
8. Optionally you can calculate the resulting stats like that: `python bin/get_stats.py extruct_out.jsonlines.bz2 every_prompt_stats.csv`
## Advices
If you want to recreate the results:
* Get yourself a server or VPS with enough space (80GB should be enough).
* Look at the code. You'd probably want to make changes here and there.
* All the python scripts have extra parameters to control the number of threads and the chunk size. Both accept compressed input and output files with the help of smart_open lib.
## License
**Code** of the project has an MIT license.
Copyright: [Dmytro Chaplynskyi](https://twitter.com/dchaplinsky), [lang-uk project](https://lang.org.ua), 2023
提供机构:
lang-uk
原始信息汇总
数据集概述
数据集名称
- Every Prompt
数据集描述
- Every Prompt 是一个从网络中挖掘指令的数据驱动方法。它包含超过一百万的全球FAQ和HowTo,以结构化格式呈现。数据集经过基本预处理,计算有用文本的长度并识别文本语言,使用 GCLD3 工具。
数据集内容
- 包含FAQ和HowTo类型的结构化数据。
- 数据来源于 Web Data Commons 数据集(2022年10月),通过筛选具有 HowTo 和 FAQPage 项目的网站。
数据集规模
- 数据集大小:1M<n<10M。
- 包含1,266,926个JSON文档。
多语言性
- 数据集支持多语言。
数据处理流程
- 从Web Data Commons下载1.6TB的结构化数据,筛选出1,985,925个种子页面。
- 爬取种子页面,使用 extruct 包提取结构化数据,剩余1,358,638个有效页面。
- 使用jmespath提取HowTo/FAQPage类型的相关结构化数据。
- 从结构中提取文本信息,识别文本语言、文本长度及文本/数据比率。
使用说明
- 用户需自行将结构化数据转换为指令。
- 需要额外进行清洗和评估,因为互联网内容质量参差不齐。
详细统计
| 语言 | FAQPage 数量 | FAQPage 文本长度 | HowTo 数量 | HowTo 文本长度 | 项目总数 | 文本总长度 |
|---|---|---|---|---|---|---|
| en | 592730 | 1186748927 | 29017 | 77135350 | 621747 | 1263884277 |
| de | 83184 | 213931486 | 3370 | 13905977 | 86554 | 227837463 |
| es | 63237 | 113906536 | 6466 | 30517773 | 69703 | 144424309 |
| fr | 65081 | 141638675 | 3672 | 21632272 | 68753 | 163270947 |
| ... | ... | ... | ... | ... | ... | ... |
许可证
- 数据集采用MIT许可证。



