Name: TheSkullery/Aether-Lite-PurHyDe
Creator: TheSkullery
Published: 2024-07-22 02:08:38
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/TheSkullery/Aether-Lite-PurHyDe

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 --- <style> body, html { height: 100%; /* Ensure the full height of the page is used */ margin: 0; padding: 0; font-family: 'Quicksand', sans-serif; background: linear-gradient(135deg, #2E3440 0%, #1A202C 100%); color: #D8DEE9; font-size: 16px; } .container { width: 100%; /* Full width */ height: 100%; /* Full height */ padding: 20px; margin: 0; /* Remove margin to fill the entire area */ background-color: rgba(255, 255, 255, 0.02); border-radius: 12px; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); backdrop-filter: blur(10px); border: 1px solid rgba(255, 255, 255, 0.1); } .header h1 { font-size: 28px; color: #5F9EA0; margin: 0 0 20px 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); } .update-section h2 { font-size: 24px; color: #88C0D0; } .update-section p { font-size: 16px; line-height: 1.6; color: #ECEFF4; } .info img { width: 100%; border-radius: 10px; margin-bottom: 15px; } a { color: #88C0D0; text-decoration: none; } a:hover { color: #A3BE8C; } .button { display: inline-block; background-color: #5E81AC; color: #E5E9F0; padding: 10px 20px; border-radius: 5px; cursor: pointer; text-decoration: none; } .button:hover { background-color: #81A1C1; } pre { background-color: #2E3440; padding: 10px; border-radius: 5px; overflow-x: auto; } code { font-family: 'Courier New', monospace; color: #D8DEE9; } </style> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Data Card</title> <link href="https://fonts.googleapis.com/css2?family=Quicksand:wght@400;500;600&display=swap" rel="stylesheet"> </head> <body> <div class="container"> <div class="header"> <h1>Aether Lite Dataset</h1> </div> <div class="info"> <img src="https://cdn-uploads.huggingface.co/production/uploads/64545af5ec40bbbd01242ca6/j-qmFohQosE_N5hAIB2dv.webp" alt="Aether Lite Dataset Image"> Creator: <a href="https://huggingface.co/Steelskull" target="_blank">SteelSkull</a> About Aether-Lite-PurHyDe: The Aether-Lite dataset is designed to balance creative writing, Slop, and intelligence. Whats New?: Aether-Lite-PurHyDe This dataset is basically a HEAVILY cleaned and filtered version of Aether-lite. ONLY english, ANY and all AI-isms (claud, gpt, gemma) were stripped out and agressive fussy dedupe was applied Fuzzy deduplication was set to a 90% threshold Plots not generated as its being reworked <pre><code>Model Name Legend = "Pur" = Isms-Purged" "HyDe = Hyper Dedupe"</code></pre> Dataset Processing Stats: <ul> <li>Max CPU's Used: 22/24 </li> <li>Max RAM Used: 75GB </li> <li>Max Offloaded Mem Used: 100GB</li> <li>Overall Time: ~14 HRs</li> </ul> Dataset Format: <ul> <pre><code> |-- conversations: array |----[from: string] |----[value: string] |-- system: string |-- tools: string |-- origin: string |-- script_version: string |-- human_token_count: int |-- gpt_token_count: int |-- token_distribution: json |-- processing_time_ms: double</code></pre> </ul> Dataset Summary and Used (Processed / Removed / % Used): <ul> <li>jondurbin/airoboros-3.2: 53010 / 5699 / 100%</li> <li>jtatman/medical-sci-instruct-100k-sharegpt: 88996 / 7561 / 30%</li> <li>Doctor-Shotgun/no-robots-sharegpt: 9763 / 237 / 100%</li> <li>QuietImpostor/Sao10K-Claude-3-Opus-Instruct-15K-ShareGPT: 5284 / 4168 / 100%</li> <li>mrfakename/Pure-Dove-ShareGPT: 2379 / 1478 / 100%</li> <li>PJMixers/grimulkan_theory-of-mind-ShareGPT: 533 / 6 / 100%</li> <li>PJMixers/grimulkan_physical-reasoning-ShareGPT: 895 / 4 / 100%</li> <li>TheSkullery/WizardLM_evol_instruct_v2_Filtered_Fuzzy_Dedup_ShareGPT: 117663 / 146 / 30%</li> <li>MinervaAI/Aesir-Preview: 601 / 399 / 100%</li> <li>TheSkullery/Gryphe-Opus-WritingPrompts-merged: 2319 / 3703 / 100%</li> <li>mpasila/LimaRP-PIPPA-Mix-8K-Context: 861 / 1786 / 100%</li> <li>Alignment-Lab-AI/RPGuild-sharegpt-filtered: 5863 / 21190 / 100%</li> </ul> Phrase Lists to Remove: <ul> <li>Phrase List 1: General Dataset</li> <li>Phrase List 2: RP/ERP Dataset</li> </ul> Filtered Datatypes: <ul> <li>function-call</li> <li>function-response</li> <li>assistant</li> </ul> Fuzzy Deduplication Stats: <ul> <li>Starting row count: 143415 </li> <li>Final row count: 107175 </li> <li>Rows removed: 36240</li> </ul> Dataset Creation Process: This dataset was created through a meticulous process involving chunking, processing, cleaning, fuzzy deduplication, and the removal of specific robot phrases. Below is a step-by-step explanation of the entire process: <ol> <li>Model and Tokenizer Preparation: <ul> <li>Language Model: A pre-trained FastText language model is downloaded and loaded to detect the language of the dataset entries.</li> </ul> </li> <li>Data Filtering and Transformation: <ul> <li>Token Distribution: Initializes a token distribution dictionary to keep track of token counts in various ranges.</li> <li>Regex Pattern Creation: Generates regular expressions to identify and remove unwanted phrases from the dataset.</li> <li>Text Sanitization: Cleans up text by removing or replacing newline characters.</li> <li>Conversation Filtering: Filters out entire conversations if the language of the first human message is not acceptable, or if any message contains specific filtered data or matches the regex pattern.</li> <li>Record Transformation: Transforms each record by updating token counts and token distribution, and retains only relevant conversations.</li> </ul> </li> <li>Chunk Processing and File Writing: <ul> <li>Chunk Processing: Processes each data chunk by applying filtering and transformation rules, accumulating token statistics, and writing the processed data to Parquet files.</li> <li>File Writing: Saves the processed chunk data into specified directories for further analysis and merging.</li> </ul> </li> <li>Deduplication and Merging: <ul> <li>Spark Session Initialization: A Spark session is initialized to handle large-scale data processing.</li> <li>Schema Adaptation: Checks and adapts the schema of the Spark DataFrame if necessary.</li> <li>Text Embeddings: Text data is encoded into embeddings using a pre-trained model, and these embeddings are used to calculate cosine similarity for deduplication.</li> <li>Cosine Similarity Calculation: Calculates cosine similarity between embeddings to identify and remove duplicate entries.</li> <li>Plot Creation: Generates visualizations of the embeddings before and after deduplication using PCA, t-SNE, and UMAP.</li> <li>Data Shuffling: Randomizes the order of the dataset rows to ensure a diverse and unbiased dataset.</li> <li>Data Sampling: Samples a percentage of each dataset based on predefined usage percentages.</li> <li>Schema Inspection: Inspects and prints the schema of the final dataset to ensure it meets the expected format.</li> <li>Final Deduplication: Deduplicates the final dataset based on cosine similarity and saves the cleaned data.</li> </ul> </li> <li>Final Output: <ul> <li>Merged Dataset: The processed, filtered, deduplicated, and shuffled dataset is saved as a single Parquet file.</li> </ul> </li> </ol> </div> </div> </div> </body> </html>

应用场景：