alwaysgood/en-econ-explanation-article-512-32
收藏Hugging Face2026-03-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/alwaysgood/en-econ-explanation-article-512-32
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: other
tags:
- economics
- finance
- education
- news
- cpt
- qwen
size_categories:
- 100K<n<1M
---
# External EN Econ Explanation + Article CPT (512/32)
외부 4개 소스(설명형 + 기사형)를 병합/정제/청킹한 영어 CPT 코퍼스입니다.
## Sources
- `KadamParth/NCERT_Economics_12th` (Explanation)
- `KadamParth/NCERT_Business_Studies_12th` (Explanation)
- `KadamParth/NCERT_Accounting_12th` (Explanation)
- `XJCEO/bloomberg_financial_news_120k` (Article)
## Processing Pipeline
1. Merge target fields only (`Explanation`, `Article`)
2. External cleaning:
- `<|endoftext|>` artifact split/cleanup
- Bloomberg contact boilerplate tail removal
- table/separator noise removal (`====`, `YOY%` 등)
- leading/trailing sentence-fragment trimming
3. Exact dedup
4. Sentence-aware chunking
- tokenizer: `Qwen/Qwen3.5-4B`
- `max_tokens=512`, `overlap_tokens=32`
- docs with any `hard_split` dropped
## Latest Build (2026-03-31)
From `external_econ_clean_summary.json` and `external_econ_chunks_sentence_512_32_summary.json`:
- docs_after_exact_dedup: **123,336**
- docs_chunked: **123,164**
- docs_split: **48,191**
- chunks_total: **205,614**
- avg_chunks_per_doc: **1.6694**
- docs_dropped_hard_split: **172**
- hard_split_chunks in output: **0**
## Files
- `external_econ_dedup.jsonl`
- `external_econ_clean_summary.json`
- `external_econ_chunks_sentence_512_32.jsonl`
- `external_econ_chunks_sentence_512_32_summary.json`
## Notes
- 원본 소스별 라이선스/사용조건은 각각 상이할 수 있습니다.
- 실제 학습/배포 전, 소스별 라이선스와 사용 정책을 반드시 확인하세요.
提供机构:
alwaysgood



