five

daikin-industries-ltd/ja-fineweb-2-hvac-fastText-filtered-v2

收藏
Hugging Face2025-12-22 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/daikin-industries-ltd/ja-fineweb-2-hvac-fastText-filtered-v2
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个关于空调(HVAC:供暖、通风和空调)相关的日文文本数据集。该数据集是从FineWeb2的日文子集中,使用fastText分类器筛选出与空调、通风、冷暖设备相关的文本。数据集包含1,319,300条记录,总字符数为1,789,635,419,总令牌数为1,898,101,943。每条记录的平均字符数为1,356.5,平均令牌数为1,438.7。数据集的fastText分类分数范围为0.5到1.0,平均分数为0.8010。数据集包含的字段有文本内容、原始URL、Common Crawl转储名称、爬取日期、语言代码、语言检测分数和fastText分类分数。数据集的内容涵盖空调设备产品信息、通风系统、冷暖设备安装与维护、隔热节能住宅、ZEH(净零能耗住宅)、热泵技术、商用空调设备、空气净化、除湿和加湿等。

This is a Japanese text dataset related to HVAC (Heating, Ventilation, and Air Conditioning). The dataset was extracted from the Japanese subset of FineWeb2 using a fastText classifier to filter texts related to air conditioning, ventilation, and heating/cooling equipment. The dataset contains 1,319,300 records with a total of 1,789,635,419 characters and 1,898,101,943 tokens. The average number of characters per record is 1,356.5, and the average number of tokens per record is 1,438.7. The fastText classification scores range from 0.5 to 1.0, with an average score of 0.8010. The dataset includes fields such as text content, original URL, Common Crawl dump name, crawl date, language code, language detection score, and fastText classification score. The content of the dataset covers air conditioning equipment product information, ventilation systems, installation and maintenance of heating/cooling equipment, energy-efficient housing, ZEH (Net Zero Energy House), heat pump technology, commercial air conditioning equipment, air purification, dehumidification, and humidification.
提供机构:
daikin-industries-ltd
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作