Chat2Find/Chat2Find-Corpus

Name: Chat2Find/Chat2Find-Corpus
Creator: Chat2Find
Published: 2026-04-11 05:11:13
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Chat2Find/Chat2Find-Corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - si - ta - en tags: - conversational - trilingual - sri-lanka - chat2find pretty_name: Chat2Find Corpus size_categories: - 100K<n<1M --- # Chat2Find Corpus The **Chat2Find Corpus** is a high-quality trilingual conversational dataset derived from real-world interactions on the [Chat2Find](https://chat2find.com) platform. It contains approximately **255 Million tokens** in **Sinhala (සිංහල)**, **Tamil (தமிழ்)**, and **English**, including significant instances of **Singlish** and **Tanglish** (transliterated and code-mixed speech). This makes it an ideal resource for Continual Pre-training (CPT) and Supervised Fine-Tuning (SFT) of Large Language Models in low-resource and multilingual contexts. --- ## New: Chat2Find Reasoning & Tool Datasets We have just released the instruction-tuned and reasoning subsets, designed to teach foundation models complex logic, chain-of-thought, and API tool-calling in Sinhala, Tamil, and English. * **[Free Preview (5,000 Records)](https://huggingface.co/datasets/Chat2Find/Chat2Find-Instruct-Reasoning-Sample):** A robust sample containing heavily mixed single-turn instructions and deep multi-turn agentic workflows. * **[Full Dataset (279k Records / 1.8 GB)](https://huggingface.co/datasets/Chat2Find/Chat2Find-Instruct-Reasoning-Dataset):** The massive, complete instruction dataset available under a commercial/advanced research license. --- ## Dataset Details - **Origin:** Real conversations and tool-assisted interactions from Chat2Find. - **Languages:** Trilingual (Sinhala, Tamil, English) + **Singlish** & **Tanglish**. - **Format:** JSON Lines (`.jsonl`). - **Volume:** Approximately 255 Million tokens (~279,248 records) - **Content:** Information seeking, role-playing, and cultural/local knowledge specific to the South Asian region (primarily Sri Lanka and India). ## Key Features - **Non-Synthetic:** Unlike many large-scale datasets, this corpus originates from real platform usage, capturing natural language patterns, code-switching, and local nuances. - **Trilingual & Code-Mixed:** Seamless transitions between Sinhala, Tamil, and English. Naturally includes Singlish and Tanglish as used by the community. - **Foundation for Future Models:** This corpus is the primary training data for Chat2Find's upcoming open-weights model suite. - **Metadata:** Each record includes a `source` tag and a unique `record_id`. ## Intended Use Specifically designed for: 1. **Continual Pre-training (CPT):** To enhance the multilingual capabilities of base models (e.g., Qwen, Llama, Gemma). 2. **Domain Adaptation:** Improving model performance on Sri Lankan/South Asian cultural, logistical, and linguistic queries. 3. **Research:** Exploring code-switching and multilingual information retrieval. ## Dataset Structure The dataset is partitioned into multiple JSON Lines (`.jsonl`) files located in the `data/` directory for better accessibility and streaming. Each line is a JSON object: ```json { "text": "...", "metadata": { "source": "chat2find.com", "record_id": 12345 } } ``` ## Licensing This dataset is released under the **MIT License**. ## Upcoming Model Releases (Coming Soon) Chat2Find is actively developing a suite of models trained on this corpus. These will be released with **Open Weights** to the community: 1. **Chat2Find Base:** A foundational trilingual model. 2. **Chat2Find Instruct:** Optimised for following complex instructions in Sinhala, Tamil, and English. 3. **Chat2Find Reasoning:** A high-logic model designed for complex problem-solving and chain-of-thought reasoning in a multilingual context. **Stay tuned to this repository and chat2find.com for updates.**

--- license: MIT许可证 language: - 僧伽罗语（si） - 泰米尔语（ta） - 英语（en） tags: - 对话式 - 三语 - 斯里兰卡 - chat2find pretty_name: Chat2Find语料库 size_categories: - 100K<n<1M --- # Chat2Find语料库 **Chat2Find语料库**是源自[Chat2Find](https://chat2find.com)平台真实交互场景的高质量三语对话数据集。该数据集包含约**2.55亿个Token**，覆盖**僧伽罗语（සිංහල）**、**泰米尔语（தமிழ்）**与**英语**，同时包含大量**新加坡式英语（Singlish）**与**泰米尔语转写混用语（Tanglish）**实例，非常适合在低资源多语言场景下对大语言模型（Large Language Model, LLM）进行持续预训练（Continual Pre-training, CPT）与监督微调（Supervised Fine-Tuning, SFT）。 --- ## 新增：Chat2Find推理与工具数据集我们刚刚发布了指令微调与推理子集，旨在帮助基础模型掌握僧伽罗语、泰米尔语与英语环境下的复杂逻辑、思维链（Chain-of-Thought）与API工具调用能力。 * **[免费预览（5000条数据）](https://huggingface.co/datasets/Chat2Find/Chat2Find-Instruct-Reasoning-Sample)：** 包含大量混合式单轮指令与深度多轮AI智能体（AI Agent）工作流的优质样本集。 * **[完整数据集（27.9万条数据 / 1.8 GB）](https://huggingface.co/datasets/Chat2Find/Chat2Find-Instruct-Reasoning-Dataset)：** 体量庞大的完整指令数据集，可通过商业/高级研究许可获取。 --- ## 数据集详情 - **来源**：Chat2Find平台的真实对话与工具辅助交互内容。 - **语言**：三语（僧伽罗语、泰米尔语、英语）+ 新加坡式英语（Singlish）与泰米尔语转写混用语（Tanglish）。 - **格式**：JSON Lines（`.jsonl`）格式。 - **规模**：约2.55亿个Token（约279248条数据） - **内容**：面向南亚地区（主要为斯里兰卡与印度）的信息查询、角色扮演与本地化文化知识相关内容。 ## 核心特性 - **非合成生成**：与多数大规模数据集不同，该语料库源自平台真实使用场景，完整保留了自然语言模式、语码转换与地域语言细节。 - **三语与语码混合**：支持僧伽罗语、泰米尔语与英语间的无缝切换，天然包含社区常用的新加坡式英语（Singlish）与泰米尔语转写混用语（Tanglish）。 - **下一代模型训练基石**：该语料库是Chat2Find即将推出的开源权重模型套件的核心训练数据。 - **元数据**：每条数据均包含`source`标签与唯一的`record_id`。 ## 预期用途专为以下场景设计： 1. **持续预训练（CPT）**：增强基础模型（如Qwen、Llama、Gemma）的多语言能力。 2. **领域适配**：提升模型在斯里兰卡/南亚文化、物流与语言查询场景下的性能。 3. **研究**：探索语码转换与多语言信息检索相关课题。 ## 数据集结构为便于访问与流式读取，数据集被拆分为多个存储于`data/`目录下的JSON Lines（`.jsonl`）文件。每行均为一个JSON对象： json { "text": "...", "metadata": { "source": "chat2find.com", "record_id": 12345 } } ## 许可证本数据集采用**MIT许可证**发布。 ## 即将发布的模型套件（即将上线） Chat2Find正在开发基于该语料库训练的模型套件，将以开源权重形式向社区发布： 1. **Chat2Find Base**：基础三语模型。 2. **Chat2Find Instruct**：针对僧伽罗语、泰米尔语与英语下的复杂指令遵循任务优化的模型。 3. **Chat2Find Reasoning**：面向多语言场景下复杂问题求解与思维链推理的高逻辑能力模型。 **敬请关注本仓库与chat2find.com获取最新动态。**

提供机构：

Chat2Find

搜集汇总

数据集介绍

构建方式

Chat2Find-Corpus数据集的构建源于Chat2Find平台上的真实用户对话与工具辅助交互，通过非合成方式采集自然语言数据。该过程捕获了信息寻求、角色扮演及南亚地区文化知识等多样化内容，并以JSON Lines格式组织约2.55亿词元的文本，确保了语言使用的原生性与场景真实性。

使用方法

数据集主要应用于大语言模型的持续预训练与监督微调，特别适用于提升模型对斯里兰卡及南亚地区文化、语言查询的领域适应能力。研究人员可借助其探索代码转换与多语言信息检索机制，数据以分块JSON Lines文件存储，支持流式读取与高效处理。

背景与挑战

背景概述

Chat2Find-Corpus数据集由Chat2Find平台于近年构建，旨在应对南亚地区低资源语言在自然语言处理领域的代表性不足问题。该数据集源自真实对话，涵盖僧伽罗语、泰米尔语和英语的三语交互，并包含Singlish与Tanglish等代码混合变体，为大规模语言模型在跨语言理解与文化适配方面的研究提供了关键资源。其设计聚焦于信息检索、角色扮演及本土知识表达，显著增强了模型对斯里兰卡及印度等区域语言生态的适应能力，推动了多语言人工智能技术的发展。

当前挑战

该数据集致力于解决低资源语言环境下多语言对话系统的构建挑战，包括代码混合现象建模、跨语言语义对齐以及文化特定知识的整合。在构建过程中，面临真实语料采集的隐私与伦理考量、三语平行数据的质量校验以及Singlish与Tanglish等非标准变体的标注难题，这些因素共同增加了数据清洗与标准化的复杂性。

常用场景

经典使用场景

在低资源多语言自然语言处理领域，Chat2Find-Corpus凭借其真实对话来源与三语混合特性，成为模型持续预训练与监督微调的理想资源。该数据集通过捕捉僧伽罗语、泰米尔语和英语之间的自然转换，包括Singlish与Tanglish等混合变体，为研究者提供了模拟真实语言环境的训练素材，尤其适用于提升模型在东南亚地区语言理解与生成任务中的鲁棒性。

解决学术问题

该数据集有效应对了低资源语言模型开发中的核心挑战，如数据稀缺性与语言混合现象建模。通过提供大规模非合成的三语对话实例，它支持跨语言迁移学习、代码切换机制分析以及文化特定知识嵌入等研究方向，为多语言信息检索与领域自适应方法提供了实证基础，推动了语言技术在全球南方的包容性发展。

实际应用

在实际部署中，Chat2Find-Corpus能够赋能面向斯里兰卡及南亚地区的智能对话系统与信息服务平台。其蕴含的区域文化知识与本地化表达模式，可优化虚拟助手、教育工具与客户服务机器人的多语言交互能力，促进技术解决方案在多元语言社区中的落地应用，提升数字服务的可及性与实用性。

数据集最近研究