five

UBW_Tapestries

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/SicariusSicariiStuff/UBW_Tapestries
下载链接
链接失效反馈
官方服务:
资源简介:
# Underwater Basket Weaving Tapestries ## Dataset Description This dataset comprises **1,000** conversations from various boards on **4chan**, collected between 2018-2024. The corpus captures **authentic, unfiltered communication patterns** from one of the internet's most distinctive subcultural spaces. <img src="https://huggingface.co/datasets/SicariusSicariiStuff/UBW_Tapestries/resolve/main/Images/UBW_Tapestries.GIF" alt="UBW_Tapestries" style="width: 120%; min-width: 500px; display: block; margin: auto;"> ## Intended Uses This dataset serves several critical purposes for **improving AI language models**: 1. **Reduction of Positivity Bias**: Contemporary language models often exhibit excessive positivity and agreeableness. This dataset provides natural examples of disagreement, criticism, and authentic negative sentiment expression that can help models learn more balanced response patterns. 2. **Mitigation of "AI-Generated" Text Markers**: Many current models produce content with telltale stylistic markers that reveal their non-human origin. The linguistic patterns in this dataset help models incorporate more diverse sentence structures, informal language patterns, and natural conversational flow. 3. **Enhanced Vernacular Understanding**: The dataset contains rich examples of internet-specific language, neologisms, slang, and communication patterns that represent significant portions of online discourse but are underrepresented in curated training datasets. 4. **Improved Contextual Understanding**: 4chan's thread structure and communication patterns require sophisticated contextual understanding to follow conversation threads that contain multiple references, callbacks, and implicit knowledge. 5. **Sarcasm and Irony Detection**: The dataset provides numerous examples of complex linguistic phenomena like sarcasm, irony, and hyperbole in natural contexts, helping models develop a better understanding of these challenging language features. 6. **Cultural Reference Recognition**: The corpus contains rich examples of internet culture, memes, and intertextuality that can enhance models' ability to recognize and understand cultural references. 7. **Authentic Argumentation Patterns**: Unlike curated debate datasets, this collection demonstrates how disagreements and persuasion actually occur in unmoderated spaces, providing valuable training examples for understanding real-world argumentation. ## Ethical Considerations This dataset requires careful processing to remove excessive **unwanted** toxcism, illegal content, and extreme content that violates ethical AI training practices. Researchers should implement appropriate content filtering while preserving the linguistic structures and communication patterns that make the dataset valuable. The dataset should be used with a clear awareness of the potential risks and misuse of exposing language models to unfiltered internet content, balanced against the necessity of training models that can understand and appropriately respond to the full spectrum of human communication. ## Disclaimer and Limitations of Liability THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS, CONTRIBUTORS, OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET. This dataset contains content that may be offensive, disturbing, or inappropriate. Users of this dataset acknowledge that: 1. No filtering guarantees are made regarding the absence of harmful, offensive, or illegal content 2. The creators and distributors assume no responsibility for how this dataset is used or implemented 3. Users bear sole responsibility for ensuring appropriate safeguards when incorporating this data 4. No claims are made regarding the dataset's suitability for any particular application ## Citation Information ``` @dataset{UBW_Tapestries, author = {SicariusSicariiStuff}, title = {Underwater Basket Weaving Tapestries}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/SicariusSicariiStuff/UBW_Tapestries} } ``` ## Dataset Structure The dataset is provided in JSON format with the following fields: - `system`: system prompt, includes board and topic (where applicable) specification - `human`: each contains a single message by anon - `gpt`: each contains a single message by anon

# 水下编篮挂毯(Underwater Basket Weaving Tapestries) ## 数据集描述 本数据集收录了2018年至2024年间从4chan各板块抓取的共计**1000组**对话语料。该语料库保留了互联网最具特色的亚文化空间之一的**真实、未经过滤的沟通模式**。 ![UBW_Tapestries](https://huggingface.co/datasets/SicariusSicariiStuff/UBW_Tapestries/resolve/main/Images/UBW_Tapestries.GIF) ## 预期用途 本数据集为**优化大语言模型(Large Language Model,LLM)**提供了多项关键用途: 1. **降低正向偏差**:当前大语言模型常表现出过度的正向性与迎合性。本数据集提供了分歧、批评与真实负面情绪表达的自然示例,可助力模型学习更均衡的回复模式。 2. **缓解AI生成文本特征**:多数现有模型生成的内容带有可识别的风格标记,暴露其非人类生成的属性。本数据集包含的语言模式可帮助模型融入更多样化的句式结构、非正式语言风格与自然对话流程。 3. **增强网络通俗用语理解能力**:本数据集包含大量互联网专属语言、新造词、俚语与沟通模式示例,这些内容是在线话语的重要组成部分,但在精选训练数据集中占比不足。 4. **提升上下文理解能力**:4chan的帖子结构与沟通模式要求具备复杂的上下文理解能力,才能读懂包含多重引用、呼应与隐性知识的对话线程。 5. **讽刺与反语识别**:本数据集提供了大量自然语境下的复杂语言现象示例,包括讽刺、反语与夸张手法,可助力模型更好地理解这类高难度语言特征。 6. **文化典故识别能力**:该语料库包含丰富的互联网文化、梗与互文性示例,可提升模型识别与理解各类文化典故的能力。 7. **还原真实论证模式**:与精选辩论数据集不同,本集合展示了无监管空间中分歧与说服的真实发生方式,为理解现实世界的论证逻辑提供了宝贵的训练示例。 ## 伦理考量 本数据集需经过严格处理,移除过度的**有害**攻击性内容、非法内容与违反伦理AI训练准则的极端内容。研究人员应采用合适的内容过滤手段,同时保留赋予本数据集价值的语言结构与沟通模式。 本数据集的使用需明确意识到:将语言模型暴露于未经过滤的互联网内容中存在潜在风险与滥用可能,需与训练模型理解并恰当回应全范围人类交流的必要性之间取得平衡。 ## 免责声明与责任限制 本数据集按“现状”提供,不附带任何明示或默示的担保,包括但不限于适销性、特定用途适用性与不侵权的担保。在任何情况下,作者、贡献者或版权持有人均不对因本数据集或本数据集的使用或其他相关操作而产生的任何索赔、损害或其他责任承担责任,无论该责任源于合同、侵权或其他事由。 本数据集包含可能具有冒犯性、令人不安或不适当的内容。数据集使用者需知晓: 1. 未对数据集不存在有害、冒犯或非法内容作出任何过滤保证 2. 创作者与分发者不对本数据集的使用或实施方式承担任何责任 3. 使用者需独自承担确保在整合该数据时采取适当防护措施的责任 4. 未对本数据集是否适用于任何特定应用作出任何声明 ## 引用信息 @dataset{UBW_Tapestries, author = {SicariusSicariiStuff}, title = {Underwater Basket Weaving Tapestries}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/SicariusSicariiStuff/UBW_Tapestries} } ## 数据集结构 本数据集以JSON格式提供,包含以下字段: - `system`: 系统提示,包含板块与主题(如适用)说明 - `human`: 每条对应匿名用户的单条消息 - `gpt`: 每条对应匿名用户的单条消息
提供机构:
maas
创建时间:
2025-11-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作