five

ryanrebel/my-dataset

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ryanrebel/my-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - text-generation - text-classification - image-classification - question-answering - summarization - feature-extraction language: - en tags: - web-scraping - dataset - webscraper-pro - curated - citations pretty_name: my-dataset size_categories: - 1K<n<10K --- # my-dataset > Collected with [WebScraper Pro](https://github.com/minerofthesoal/Scraper) v0.7.2.2 ## Dataset Description This dataset was collected using [WebScraper Pro](https://github.com/minerofthesoal/Scraper), an open-source Firefox extension and CLI tool for structured web data collection with automatic scroll-first pagination, MLA/APA citations, and HuggingFace integration. ### Dataset Summary | Metric | Value | |--------|-------| | **Total Records** | 2622 | | **Total Words** | 130692 | | **Images** | 1156 | | **Links** | 3167 | | **Audio Files** | 0 | | **Pages Scraped** | 66 | | **Unique Sources** | 10 | | **Unique Domains** | 7 | | **Unique Authors** | 1 | | **Collection Date** | 20 Apr. 2026 | | **Last Updated** | 2026-04-20T21:23:03.096Z | ### Intended Uses - **Text Generation** — Training or fine-tuning language models on web content - **Text Classification** — Categorizing web content by topic, sentiment, or type - **Summarization** — Generating summaries from scraped articles - **Question Answering** — Building QA datasets from structured web content - **Image Classification** — Training image models on web-sourced images - **Link Analysis** — Web graph construction and analysis - **Audio Transcription** — Processing audio files (converted to .wav) - **Citation Analysis** — Studying citation patterns and web attribution - **Information Retrieval** — Building search indices from web content - **Dataset Curation** — As a base for creating refined, domain-specific datasets ### Out-of-Scope Uses - This dataset should NOT be used to train models for generating deceptive content - Content should not be re-published without proper attribution - Individual source licenses may restrict certain commercial uses ### Data Format | File | Format | Description | |------|--------|-------------| | `data/text_data.jsonl` | JSONL | Scraped text content with full metadata and citations | | `data/images.jsonl` | JSONL | Image references with alt text and dimensions | | `data/links.jsonl` | JSONL | Extracted hyperlinks with anchor text | | `data/audio.jsonl` | JSONL | Audio/video file references | | `data/citations.jsonl` | JSONL | MLA + APA citation records per source | ### Data Fields Each JSONL text record contains: ```json { "id": "unique-record-id", "type": "text", "text": "scraped text content", "tag": "html-element-tag", "source_url": "https://example.com/page", "source_title": "Page Title", "author": "Original Author", "site_name": "example.com", "scraped_at": "2024-01-01T12:00:00Z", "citation_mla": "MLA 9th edition formatted citation", "citation_apa": "APA 7th edition formatted citation" } ``` ## Data Collection Data was collected using WebScraper Pro's scroll-first auto-scan approach: 1. The scraper first scrolls down each page to determine its full length and trigger lazy-loaded content 2. It then scrolls back up and scrapes viewport by viewport, deduplicating across viewports 3. After fully scraping the current page, it looks for "Next" buttons or pagination links 4. All sources are automatically cited in both MLA 9th and APA 7th edition formats ### Collection Configuration - **Scroll-First Mode:** Enabled (checks page length before scraping) - **Auto-scroll:** Enabled - **Auto-next page:** Enabled - **Robots.txt:** Respected - **Export Format:** csv - **Citation Format:** MLA 9th + APA 7th ## Source Domains - www.tampermonkey.net - pastebin.com - www.egifter.com - greasyfork.org - www.giftcardgranny.com - www.walmart.ca - www.google.com ## Sources & Citations ### Source Summary | # | Source | Author | License | Content Type | |---|--------|--------|---------|-------------| | 1 | [Userscripts \| Tampermonkey](https://www.tampermonkey.net/scripts.php) | Unknown | See source | Web page | | 2 | [Pastebin.com - #1 paste tool since 2002!](https://pastebin.com/) | Unknown | See source | Web page | | 3 | [Online Gift Cards, Visa & Group Gifting \| eGifter](https://www.egifter.com/en-ca/) | eGifter | See source | WebPage | | 4 | [Scrape-Giftcards - Pastebin.com](https://pastebin.com/5gbedVh0) | Unknown | See source | Web page | | 5 | [Libraries](https://greasyfork.org/en/scripts/libraries) | Unknown | See source | Web page | | 6 | [Buy Sephora Gift Cards \| GiftCardGranny](https://www.giftcardgranny.com/buy-gift-cards/sephora/) | Unknown | See source | WebPage | | 7 | [FAQ \| Tampermonkey](https://www.tampermonkey.net/faq.php?q=Q600#Q600) | Unknown | See source | Web page | | 8 | [FAQ \| Tampermonkey](https://www.tampermonkey.net/faq.php?q=Q300) | Unknown | See source | Web page | | 9 | [Online Shopping Canada: Everyday Low Prices at Walmart.ca!](https://www.walmart.ca/en) | Unknown | See source | WebSite | | 10 | [webscraper-pro/ryandelawski - Google Search](https://www.google.com/search?client=firefox-b-m&q=webscraper-pro%2Fryandelawski) | Unknown | See source | Web page | ### MLA 9th Edition Citations 1. "Userscripts | Tampermonkey." *www.tampermonkey.net*, https://www.tampermonkey.net/scripts.php. Accessed 20 Apr. 2026. 2. "Pastebin.com - #1 paste tool since 2002!." *Pastebin*, https://pastebin.com/. Accessed 20 Apr. 2026. 3. eGifter. "Online Gift Cards, Visa & Group Gifting | eGifter." *eGifter*, https://www.egifter.com/en-ca/. Accessed 20 Apr. 2026. 4. "Scrape-Giftcards - Pastebin.com." *Pastebin*, https://pastebin.com/5gbedVh0. Accessed 20 Apr. 2026. 5. "Libraries." *greasyfork.org*, https://greasyfork.org/en/scripts/libraries. Accessed 20 Apr. 2026. 6. "Buy Sephora Gift Cards | GiftCardGranny." *GiftCardGranny.com*, https://www.giftcardgranny.com/buy-gift-cards/sephora/. Accessed 20 Apr. 2026. 7. "FAQ | Tampermonkey." *www.tampermonkey.net*, https://www.tampermonkey.net/faq.php?q=Q600#Q600. Accessed 20 Apr. 2026. 8. "FAQ | Tampermonkey." *www.tampermonkey.net*, https://www.tampermonkey.net/faq.php?q=Q300. Accessed 20 Apr. 2026. 9. "Online Shopping Canada: Everyday Low Prices at Walmart.ca!." *Walmart.ca*, https://www.walmart.ca/en. Accessed 20 Apr. 2026. 10. "webscraper-pro/ryandelawski - Google Search." *www.google.com*, https://www.google.com/search?client=firefox-b-m&q=webscraper-pro%2Fryandelawski. Accessed 20 Apr. 2026. ### APA 7th Edition Citations 1. (n.d.). *Userscripts | Tampermonkey*. https://www.tampermonkey.net/scripts.php 2. Pastebin. (n.d.). *Pastebin.com - #1 paste tool since 2002!*. Pastebin. https://pastebin.com/ 3. eGifter. (n.d.). *Online Gift Cards, Visa & Group Gifting | eGifter*. https://www.egifter.com/en-ca/ 4. Pastebin. (n.d.). *Scrape-Giftcards - Pastebin.com*. Pastebin. https://pastebin.com/5gbedVh0 5. (n.d.). *Libraries*. https://greasyfork.org/en/scripts/libraries 6. GiftCardGranny.com. (n.d.). *Buy Sephora Gift Cards | GiftCardGranny*. GiftCardGranny.com. https://www.giftcardgranny.com/buy-gift-cards/sephora/ 7. (n.d.). *FAQ | Tampermonkey*. https://www.tampermonkey.net/faq.php?q=Q600#Q600 8. (n.d.). *FAQ | Tampermonkey*. https://www.tampermonkey.net/faq.php?q=Q300 9. Walmart.ca. (n.d.). *Online Shopping Canada: Everyday Low Prices at Walmart.ca!*. Walmart.ca. https://www.walmart.ca/en 10. (n.d.). *webscraper-pro/ryandelawski - Google Search*. https://www.google.com/search?client=firefox-b-m&q=webscraper-pro%2Fryandelawski ## Licensing ### Uni-S License v3.0 (Universal Scraping License) This dataset and the tool that collected it are governed by the **[Uni-S License v3.0](https://github.com/minerofthesoal/Scraper/blob/main/LICENSE)**. **Key points:** 1. **We do NOT own any of this data.** All rights to scraped content belong to the original authors, creators, publishers, and rights holders. 2. **The Software (WebScraper Pro) is open source.** Standalone scraper forks must stay open source. Library use in other projects is unrestricted. 3. **Compatible with MIT, Apache 2.0, BSD, ISC, and MPL 2.0** — other projects can freely use this code. 4. **Users are solely responsible** for ensuring they have the legal right to scrape, store, and redistribute any content they collect. 5. **Citations are provided to assist attribution**, not to grant permission to use content. ### Source Content Licenses Individual content items retain their original licensing from their respective sources. Users of this dataset MUST verify and comply with the licensing terms of each individual source before use. **The dataset maintainer (minerofthesoal / ray0rf1re) explicitly does NOT claim ownership of any scraped content. All rights remain with original creators.** Source licenses should be verified individually at the original URLs. ### Attribution Requirements - All original authors and sources are cited in both MLA 9th and APA 7th edition formats - When using content from this dataset, you MUST cite the original source - Citation data is available in `data/citations.jsonl` - Any rights holder may request removal of their content by opening an issue at [github.com/minerofthesoal/Scraper](https://github.com/minerofthesoal/Scraper/issues) ## Ethical Considerations - All data was collected from publicly accessible web pages - Original authors and sources are cited using MLA 9th and APA 7th edition formats - This dataset respects `robots.txt` directives - No paywalled or login-required content was collected - Users of this dataset should verify licensing of individual sources - Personal information should be handled in accordance with applicable privacy laws ## Additional Information ### Collection Tool - **Tool:** [WebScraper Pro](https://github.com/minerofthesoal/Scraper) v0.6.6.1 - **Type:** Firefox Extension + Python CLI + GUI - **Features:** Area selection, scroll-first auto-scan, MLA/APA citations, HuggingFace upload - **Owner Dataset:** [ray0rf1re/Site.scraped](https://huggingface.co/datasets/ray0rf1re/Site.scraped) ### Contact For questions about this dataset, please open an issue at [github.com/minerofthesoal/Scraper](https://github.com/minerofthesoal/Scraper/issues). --- *Generated by [WebScraper Pro](https://github.com/minerofthesoal/Scraper) v0.6.6.1*
提供机构:
ryanrebel
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作