five

Synth_Usernames

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/SicariusSicariiStuff/Synth_Usernames
下载链接
链接失效反馈
官方服务:
资源简介:
# Synthetic Usernames ## Dataset Description This dataset comprises **450,000** synthetically generated usernames mimicking Discord naming patterns across various communities. The corpus captures **authentic-appearing, diverse username structures** without any privacy concerns, as all entries are artificially generated using AI rather than scraped from real users. The dataset underwent a rigorous process of **deduplication**, further cleaning and trimming, and augmentation. Subsequently, it was **evaluated by multiple individuals** to differentiate between **authentic** user names and **synthetically** generated ones. Interestingly, the synthetic user names were often perceived as **more human-like** than the **actual** human user names. <img src="https://huggingface.co/datasets/SicariusSicariiStuff/Synth_Usernames/resolve/main/Images/Synth_Usernames.png" alt="Synthetic Usernames" style="width: 120%; min-width: 500px; display: block; margin: auto;"> The dataset contains: - 10,000 general Discord usernames - 70,000 NSFW female-presenting usernames - 70,000 NSFW male-presenting usernames - 80,000 fantasy/roleplay female-presenting usernames - 80,000 fantasy/roleplay male-presenting usernames - 70,000 Linux community male-presenting usernames - 70,000 Linux community female-presenting usernames ## Intended Uses This dataset serves several critical purposes for **improving AI language models**: 1. **Enhanced Identity Recognition**: Contemporary language models often struggle with understanding the sociolinguistic signals embedded in usernames. This dataset provides examples of how identity markers manifest in online self-representation across different communities. 2. **Mitigation of Demographic Bias**: Many current models exhibit biases in how they process and respond to different username types. This dataset helps models develop more balanced response patterns across diverse virtual identities. 3. **Subcultural Contextual Understanding**: The dataset contains rich examples of community-specific naming conventions that represent significant portions of online identity formation but are underrepresented in curated training datasets. 4. **Inclusive Data Creation**: The synthetic nature of this dataset addresses representational gaps in existing datasets, particularly for underrepresented demographics like female Linux users. By providing balanced representation across categories that are typically skewed in real-world data, this dataset enables more inclusive AI development without requiring extraction of data from vulnerable or underrepresented groups. 5. **Domain Knowledge Integration**: The specialized subcategories (Linux users, fantasy roleplay communities) provide valuable training examples for recognizing domain-specific knowledge signals from minimal textual input. 6. **Pattern Recognition Enhancement**: The dataset facilitates improved recognition of naming patterns that indicate specific community affiliations, helping models understand implicit social cues. 7. **Organic Placeholder Population**: Many AI templates and synthetic datasets use generic placeholders like {{char}} or {{user}} that require realistic usernames for testing and development. This dataset provides a privacy-preserving source of authentic-appearing usernames to populate such placeholders without risking exposure of real user identities. ## Ethical Considerations This dataset is entirely synthetic, eliminating privacy concerns while preserving the linguistic structures and signaling patterns that make usernames valuable for understanding online identity formation. Researchers should implement appropriate validation to ensure the synthetic generation process has not inadvertently reproduced existing usernames or created patterns that could be deemed offensive or harmful when deployed in real-world applications. The dataset's synthetic nature provides a solution to the ethical challenges of training on real user data, while still enabling models to develop sophisticated understanding of online identity markers. ## Disclaimer and Limitations of Liability THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS, CONTRIBUTORS, OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET. This dataset, while synthetic, contains username patterns that may reflect sensitive domains. Users of this dataset acknowledge that: 1. No guarantees are made regarding the absence of offensive or inappropriate content in the synthetic generation 2. The creators and distributors assume no responsibility for how this dataset is used or implemented 3. Users bear sole responsibility for ensuring appropriate safeguards when incorporating this data 4. No claims are made regarding the dataset's suitability for any particular application ## Citation Information ``` @dataset{Synth_Usernames, author = {SicariusSicariiStuff}, title = {Synthetic Usernames}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/SicariusSicariiStuff/Synth_Usernames} } ``` ## Dataset Structure The dataset is provided as plain text files with one username per line: - `general.txt`: Contains 10,000 general Discord usernames - `nsfw_female.txt`: Contains 70,000 NSFW female-presenting usernames - `nsfw_male.txt`: Contains 70,000 NSFW male-presenting usernames - `fantasy_female.txt`: Contains 80,000 fantasy/roleplay female-presenting usernames - `fantasy_male.txt`: Contains 80,000 fantasy/roleplay male-presenting usernames - `linux_female.txt`: Contains 70,000 Linux community female-presenting usernames - `linux_male.txt`: Contains 70,000 Linux community male-presenting usernames

# 合成用户名 ## 数据集说明 本数据集包含**45万个**模拟各类社区Discord命名风格的合成用户名。该语料库涵盖**外观真实、样式多样的用户名结构**,且不存在任何隐私隐患——所有条目均通过人工智能生成,而非从真实用户处爬取所得。 本数据集经过严格的**去重**、进一步清洗裁剪与**数据增强**流程。随后,经由多名受试者对**真实**用户名与**合成**生成的用户名进行区分辨别评估。有趣的是,合成用户名往往比**实际**人类用户名更具「类人性」特征。 <img src="https://huggingface.co/datasets/SicariusSicariiStuff/Synth_Usernames/resolve/main/Images/Synth_Usernames.png" alt="Synthetic Usernames" style="width: 120%; min-width: 500px; display: block; margin: auto;"> 本数据集包含: - 10000个通用Discord用户名 - 70000个不宜公开内容(Not Safe For Work)女性风格用户名 - 70000个不宜公开内容(Not Safe For Work)男性风格用户名 - 80000个奇幻/角色扮演女性风格用户名 - 80000个奇幻/角色扮演男性风格用户名 - 70000个Linux社区女性风格用户名 - 70000个Linux社区男性风格用户名 ## 预期用途 本数据集可服务于多项改进**大语言模型(Large Language Model,LLM)**的关键目标: 1. **增强身份识别能力**:当前大语言模型往往难以理解用户名中蕴含的社会语言学信号,本数据集提供了不同社区中身份标识如何在在线自我呈现中体现的示例。 2. **缓解人口统计学偏见**:现有诸多模型在处理不同类型用户名时存在偏见,本数据集可助力模型针对多样化虚拟身份形成更均衡的响应模式。 3. **增强亚文化语境理解能力**:本数据集包含大量社区专属命名规范的示例,这类规范是在线身份构建的重要组成部分,但在精选训练数据集中占比不足。 4. **助力包容性数据构建**:本数据集采用合成生成的方式,弥补了现有数据集的代表性缺口,尤其是针对Linux女性用户这类代表性不足的群体。相较于真实世界数据中各类别常存在的分布失衡问题,本数据集实现了类别间的均衡分布,无需从弱势或代表性不足群体中提取数据,即可推动更具包容性的人工智能开发工作。 5. **整合领域知识**:数据集包含的Linux用户、奇幻角色扮演社区等专业子类别,为从极简文本输入中识别领域专属知识信号提供了极具价值的训练示例。 6. **提升模式识别能力**:本数据集可助力模型更好地识别体现特定社区归属的命名模式,帮助模型理解隐含的社会线索。 7. **实现占位符自然填充**:诸多人工智能模板与合成数据集常使用`{{char}}`或`{{user}}`这类通用占位符,测试与开发环节需要贴合实际的用户名进行填充。本数据集可提供兼顾隐私安全且外观真实的用户名资源,用于填充此类占位符,无需承担泄露真实用户身份的风险。 ## 伦理考量 本数据集完全由合成生成,既消除了隐私隐患,又保留了使用户名成为理解在线身份构建有效载体的语言结构与信号模式。 研究人员应采取适当的验证手段,确保合成生成流程未意外复刻现有用户名,也未生成在实际应用中可能被视为冒犯或有害的模式。 本数据集的合成属性为基于真实用户数据训练所面临的伦理挑战提供了解决方案,同时仍可助力模型深入理解在线身份标识。 ## 免责声明与责任限制 本数据集按「现状」提供,不附带任何明示或暗示的担保,包括但不限于适销性、特定用途适用性以及不侵权的担保。在任何情况下,作者、贡献者或版权持有者均不对任何因本数据集或本数据集的使用或其他相关操作而产生的索赔、损害或其他责任承担责任,无论该责任源于合同、侵权或其他事由。 尽管本数据集为合成生成,但其包含的用户名模式可能涉及敏感领域。数据集使用者应知晓: 1. 不保证合成生成过程中不存在冒犯性或不当内容 2. 数据集的创建者与分发者不对数据集的使用或部署方式承担责任 3. 使用者需独自承担在整合本数据时确保采取适当防护措施的责任 4. 未对本数据集是否适用于任何特定应用作出任何声明 ## 引用信息 @dataset{Synth_Usernames, author = {SicariusSicariiStuff}, title = {Synthetic Usernames}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/SicariusSicariiStuff/Synth_Usernames} } ## 数据集结构 本数据集以纯文本文件形式提供,每行包含一个用户名: - `general.txt`:包含10000个通用Discord用户名 - `nsfw_female.txt`:包含70000个不宜公开内容(Not Safe For Work)女性风格用户名 - `nsfw_male.txt`:包含70000个不宜公开内容(Not Safe For Work)男性风格用户名 - `fantasy_female.txt`:包含80000个奇幻/角色扮演女性风格用户名 - `fantasy_male.txt`:包含80000个奇幻/角色扮演男性风格用户名 - `linux_female.txt`:包含70000个Linux社区女性风格用户名 - `linux_male.txt`:包含70000个Linux社区男性风格用户名
提供机构:
maas
创建时间:
2025-11-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作