skML/lexisyn-zh-en-1.0
收藏Hugging Face2024-10-07 更新2025-11-03 收录
下载链接:
https://hf-mirror.com/datasets/skML/lexisyn-zh-en-1.0
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- translation
language:
- zh
- en
pretty_name: LexySyn zh-en 1.0
size_categories:
- 100K<n<1M
---
# LexiSyn 1.0
A synthetic dataset of Chinese-English parallel corpus, released by sparkastML.
146,917 entries in total, which is about 300,000 sentences.
## Synthetic data sources
About 43% are scraped from the Internet by our crawler in September 2024.
The rest comes from random sampling of the `part-00185-6f0afd98-d375-4d7f-8299-ac5e070bf4fc-c000.jsonl` file in [CCI3-HQ](https://huggingface.co/datasets/BAAI/CCI3-HQ).
## Synthetic method
We use LLMs for translation to create a dataset from raw data to sentence pairs.
The prompt of the LLM was:
```
The user will provide some text. Please parse the text into segments, each segment contains 1 to 5 sentences. Translate each sentence into the corresponding language. If the input is in Chinese, return the English translation, and vice versa.
IMPORTANT:
1. Segment should not be too long, each segment should be under 100 English words or 180 Chinese characters.
2. For segments or sentences that appear multiple times in the original text, they are only output **once** in the returned translation.
3. **For content with obvious semantic differences, such as different components on a web page, no matter how short it is, it should be divided into a separate segment.**
4. **Information such as web page headers, footers, and other fixed text, such as copyright notices, website or company names, and conventional link text (such as "About Us", "Privacy Policy", etc.) will be **ignored and not translated**
5. If the provided text lacks proper punctuation, please add proper punctuation to both the source text and the translated text in the output.
EXAMPLE INPUT:
法律之前人人平等,并有权享受法律的平等保护,不受任何歧视。人人有权享受平等保护,以免受违反本宣言的任何歧视行为以及煽动这种歧视的任何行为之害。
EXAMPLE JSON OUTPUT:
{
"segments": [
{"chinese": "法律之前人人平等,并有权享受法律的平等保护,不受任何歧视。", "english": "All are equal before the law and are entitled without any discrimination to equal protection of the law."},
{"chinese": "人人有权享受平等保护,以免受违反本宣言的任何歧视行为以及煽动这种歧视的任何行为之害。", "english": "All are entitled to equal protection against any discrimination in violation of this Declaration and against any incitement to such discrimination."}
]
}
```
提供机构:
skML



