five

TheSkullery/Aether-Lite-PurHyDe

收藏
Hugging Face2024-07-22 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TheSkullery/Aether-Lite-PurHyDe
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 --- <style> body, html { height: 100%; /* Ensure the full height of the page is used */ margin: 0; padding: 0; font-family: 'Quicksand', sans-serif; background: linear-gradient(135deg, #2E3440 0%, #1A202C 100%); color: #D8DEE9; font-size: 16px; } .container { width: 100%; /* Full width */ height: 100%; /* Full height */ padding: 20px; margin: 0; /* Remove margin to fill the entire area */ background-color: rgba(255, 255, 255, 0.02); border-radius: 12px; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); backdrop-filter: blur(10px); border: 1px solid rgba(255, 255, 255, 0.1); } .header h1 { font-size: 28px; color: #5F9EA0; margin: 0 0 20px 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); } .update-section h2 { font-size: 24px; color: #88C0D0; } .update-section p { font-size: 16px; line-height: 1.6; color: #ECEFF4; } .info img { width: 100%; border-radius: 10px; margin-bottom: 15px; } a { color: #88C0D0; text-decoration: none; } a:hover { color: #A3BE8C; } .button { display: inline-block; background-color: #5E81AC; color: #E5E9F0; padding: 10px 20px; border-radius: 5px; cursor: pointer; text-decoration: none; } .button:hover { background-color: #81A1C1; } pre { background-color: #2E3440; padding: 10px; border-radius: 5px; overflow-x: auto; } code { font-family: 'Courier New', monospace; color: #D8DEE9; } </style> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Data Card</title> <link href="https://fonts.googleapis.com/css2?family=Quicksand:wght@400;500;600&display=swap" rel="stylesheet"> </head> <body> <div class="container"> <div class="header"> <h1>Aether Lite Dataset</h1> </div> <div class="info"> <img src="https://cdn-uploads.huggingface.co/production/uploads/64545af5ec40bbbd01242ca6/j-qmFohQosE_N5hAIB2dv.webp" alt="Aether Lite Dataset Image"> <p><strong>Creator:</strong> <a href="https://huggingface.co/Steelskull" target="_blank">SteelSkull</a></p> <p><strong>About Aether-Lite-PurHyDe:</strong> The Aether-Lite dataset is designed to balance creative writing, Slop, and intelligence.</p> <p><strong>Whats New?:</strong></p> <p>Aether-Lite-PurHyDe</p> <p>This dataset is basically a HEAVILY cleaned and filtered version of Aether-lite. ONLY english, ANY and all AI-isms (claud, gpt, gemma) were stripped out and agressive fussy dedupe was applied</p> <p>Fuzzy deduplication was set to a 90% threshold</p> <p>Plots not generated as its being reworked</p> <pre><code><strong>Model Name Legend =</strong> "Pur" = Isms-Purged" "HyDe = Hyper Dedupe"</code></pre> <p></p> <p><strong>Dataset Processing Stats:</strong></p> <ul> <li>Max CPU's Used: 22/24 </li> <li>Max RAM Used: 75GB </li> <li>Max Offloaded Mem Used: 100GB</li> <li>Overall Time: ~14 HRs</li> </ul> <p><strong>Dataset Format:</strong></p> <ul> <pre><code> |-- conversations: array |----[from: string] |----[value: string] |-- system: string |-- tools: string |-- origin: string |-- script_version: string |-- human_token_count: int |-- gpt_token_count: int |-- token_distribution: json |-- processing_time_ms: double</code></pre> </ul> <p><strong>Dataset Summary and Used (Processed / Removed / % Used):</strong></p> <ul> <li>jondurbin/airoboros-3.2: 53010 / 5699 / 100%</li> <li>jtatman/medical-sci-instruct-100k-sharegpt: 88996 / 7561 / 30%</li> <li>Doctor-Shotgun/no-robots-sharegpt: 9763 / 237 / 100%</li> <li>QuietImpostor/Sao10K-Claude-3-Opus-Instruct-15K-ShareGPT: 5284 / 4168 / 100%</li> <li>mrfakename/Pure-Dove-ShareGPT: 2379 / 1478 / 100%</li> <li>PJMixers/grimulkan_theory-of-mind-ShareGPT: 533 / 6 / 100%</li> <li>PJMixers/grimulkan_physical-reasoning-ShareGPT: 895 / 4 / 100%</li> <li>TheSkullery/WizardLM_evol_instruct_v2_Filtered_Fuzzy_Dedup_ShareGPT: 117663 / 146 / 30%</li> <li>MinervaAI/Aesir-Preview: 601 / 399 / 100%</li> <li>TheSkullery/Gryphe-Opus-WritingPrompts-merged: 2319 / 3703 / 100%</li> <li>mpasila/LimaRP-PIPPA-Mix-8K-Context: 861 / 1786 / 100%</li> <li>Alignment-Lab-AI/RPGuild-sharegpt-filtered: 5863 / 21190 / 100%</li> </ul> <p><strong>Phrase Lists to Remove:</strong></p> <ul> <li>Phrase List 1: General Dataset</li> <li>Phrase List 2: RP/ERP Dataset</li> </ul> <p><strong>Filtered Datatypes:</strong></p> <ul> <li>function-call</li> <li>function-response</li> <li>assistant</li> </ul> <p><strong>Fuzzy Deduplication Stats:</strong></p> <ul> <li>Starting row count: 143415 </li> <li>Final row count: 107175 </li> <li>Rows removed: 36240</li> </ul> <p><strong>Dataset Creation Process:</strong></p> <p>This dataset was created through a meticulous process involving chunking, processing, cleaning, fuzzy deduplication, and the removal of specific robot phrases. Below is a step-by-step explanation of the entire process:</p> <ol> <li><strong>Model and Tokenizer Preparation:</strong> <ul> <li>Language Model: A pre-trained FastText language model is downloaded and loaded to detect the language of the dataset entries.</li> </ul> </li> <li><strong>Data Filtering and Transformation:</strong> <ul> <li>Token Distribution: Initializes a token distribution dictionary to keep track of token counts in various ranges.</li> <li>Regex Pattern Creation: Generates regular expressions to identify and remove unwanted phrases from the dataset.</li> <li>Text Sanitization: Cleans up text by removing or replacing newline characters.</li> <li>Conversation Filtering: Filters out entire conversations if the language of the first human message is not acceptable, or if any message contains specific filtered data or matches the regex pattern.</li> <li>Record Transformation: Transforms each record by updating token counts and token distribution, and retains only relevant conversations.</li> </ul> </li> <li><strong>Chunk Processing and File Writing:</strong> <ul> <li>Chunk Processing: Processes each data chunk by applying filtering and transformation rules, accumulating token statistics, and writing the processed data to Parquet files.</li> <li>File Writing: Saves the processed chunk data into specified directories for further analysis and merging.</li> </ul> </li> <li><strong>Deduplication and Merging:</strong> <ul> <li>Spark Session Initialization: A Spark session is initialized to handle large-scale data processing.</li> <li>Schema Adaptation: Checks and adapts the schema of the Spark DataFrame if necessary.</li> <li>Text Embeddings: Text data is encoded into embeddings using a pre-trained model, and these embeddings are used to calculate cosine similarity for deduplication.</li> <li>Cosine Similarity Calculation: Calculates cosine similarity between embeddings to identify and remove duplicate entries.</li> <li>Plot Creation: Generates visualizations of the embeddings before and after deduplication using PCA, t-SNE, and UMAP.</li> <li>Data Shuffling: Randomizes the order of the dataset rows to ensure a diverse and unbiased dataset.</li> <li>Data Sampling: Samples a percentage of each dataset based on predefined usage percentages.</li> <li>Schema Inspection: Inspects and prints the schema of the final dataset to ensure it meets the expected format.</li> <li>Final Deduplication: Deduplicates the final dataset based on cosine similarity and saves the cleaned data.</li> </ul> </li> <li><strong>Final Output:</strong> <ul> <li>Merged Dataset: The processed, filtered, deduplicated, and shuffled dataset is saved as a single Parquet file.</li> </ul> </li> </ol> </div> </div> </div> </body> </html>
提供机构:
TheSkullery
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作