five

alwaysgood/en-econ-explanation-article-512-32

收藏
Hugging Face2026-03-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/alwaysgood/en-econ-explanation-article-512-32
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: other tags: - economics - finance - education - news - cpt - qwen size_categories: - 100K<n<1M --- # External EN Econ Explanation + Article CPT (512/32) 외부 4개 소스(설명형 + 기사형)를 병합/정제/청킹한 영어 CPT 코퍼스입니다. ## Sources - `KadamParth/NCERT_Economics_12th` (Explanation) - `KadamParth/NCERT_Business_Studies_12th` (Explanation) - `KadamParth/NCERT_Accounting_12th` (Explanation) - `XJCEO/bloomberg_financial_news_120k` (Article) ## Processing Pipeline 1. Merge target fields only (`Explanation`, `Article`) 2. External cleaning: - `<|endoftext|>` artifact split/cleanup - Bloomberg contact boilerplate tail removal - table/separator noise removal (`====`, `YOY%` 등) - leading/trailing sentence-fragment trimming 3. Exact dedup 4. Sentence-aware chunking - tokenizer: `Qwen/Qwen3.5-4B` - `max_tokens=512`, `overlap_tokens=32` - docs with any `hard_split` dropped ## Latest Build (2026-03-31) From `external_econ_clean_summary.json` and `external_econ_chunks_sentence_512_32_summary.json`: - docs_after_exact_dedup: **123,336** - docs_chunked: **123,164** - docs_split: **48,191** - chunks_total: **205,614** - avg_chunks_per_doc: **1.6694** - docs_dropped_hard_split: **172** - hard_split_chunks in output: **0** ## Files - `external_econ_dedup.jsonl` - `external_econ_clean_summary.json` - `external_econ_chunks_sentence_512_32.jsonl` - `external_econ_chunks_sentence_512_32_summary.json` ## Notes - 원본 소스별 라이선스/사용조건은 각각 상이할 수 있습니다. - 실제 학습/배포 전, 소스별 라이선스와 사용 정책을 반드시 확인하세요.
提供机构:
alwaysgood
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作