大模型训练场景公司到访用户洞察数据
收藏浙江省数据知识产权登记平台2025-12-13 更新2025-12-16 收录
下载链接:
https://www.zjip.org.cn/home/announce/info
下载链接
链接失效反馈官方服务:
资源简介:
大模型训练用公司到访用户洞察数据核心价值,十亿级设备使用行为(比如用户手机的app连接上某个公司wifi后的用户行为数据)经去标识化、聚合化处理,可转化为公司到访用户的宏观群体画像。这类高质量统计数据,能让大语言模型高效学习真实用户分布、社会常识与市场趋势,优化推理能力、校准输出并减少幻觉,还可作为 “事实基准” 与 “认知图谱”,复用于大模型预训练、监督微调及效果评估。
其详尽的群体统计维度,是大模型训练优化的关键特征与基准。在预训练与知识增强上,通过年龄等字段,为模型提供社会人口统计学与消费行为学知识,融入预训练语料后,能帮模型更精准理解现实用户,回答 “设计面向年轻人士的产品” 等问题时,输出更符合商业逻辑。
优化对话与推荐能力时,年龄占比、TGI 指标将群体偏好量化,微调垂直领域模型可据此构建指令样本,如指令 “分析理财公司到访典型用户画像”,期望输出 “以一线城市 25-40 岁为主,TGI=135”,助力模型形成 “量化分析” 思维,提升垂直领域对话专业性。
此外,该数据可校准模型输出并评估 “幻觉”,还能集成到 RAG 系统,确保模型回答用户画像问题有真实依据。
The core value of this company visitor user insight dataset for large language model (LLM) training lies in that one billion-scale device usage behaviors (e.g., user behavior data generated after their mobile apps connect to a company's WiFi) are de-identified and aggregated to generate macro-level group portraits of the company's visitors. This high-quality statistical data enables LLMs to efficiently learn real user distributions, societal common sense and market trends, optimize inference capabilities, calibrate model outputs and reduce hallucinations. It can also serve as a "fact benchmark" and "cognitive knowledge graph", and be reused for LLM pre-training, supervised fine-tuning and performance evaluation. Its detailed group statistical dimensions serve as key features and benchmarks for LLM training and optimization. During pre-training and knowledge enhancement, fields such as age provide the model with sociodemographic and consumer behavior knowledge. After being integrated into pre-training corpora, this data helps the model more accurately understand real users, enabling it to generate outputs that align with business logic when answering questions like "design products targeting young people". When optimizing dialogue and recommendation capabilities, age proportions and TGI (Target Group Index) indexes quantify group preferences. Vertical-domain models can be fine-tuned to build instruction samples based on this data. For example, the instruction "analyze the typical user portraits of visitors to a wealth management company" expects an output like "mainly aged 25-40 in first-tier cities, TGI=135", which helps the model develop "quantitative analysis" thinking and improve the professionalism of dialogue in vertical domains. In addition, this data can calibrate model outputs and evaluate hallucinations, and can also be integrated into Retrieval-Augmented Generation (RAG) systems to ensure that the model's answers to user portrait questions are based on real data.
提供机构:
每日互动股份有限公司
创建时间:
2025-12-13
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集实际包含多个医疗检测数据集(如肠镜和胃镜检测),由台州市肿瘤医院(温岭市第二人民医院)申请,数据来源为公共数据授权。这些数据集旨在通过量化评估方法(基于AHP层次法和评分系统)对肠道和胃部病变进行风险分层(高、中、低危险),以支持结直肠癌和直肠癌的筛查、早期诊断和临床决策,提升诊疗管理的标准化水平。数据集名称'大模型训练场景公司到访用户洞察数据'与内容不符,可能为错误或误导性标签。
以上内容由遇见数据集搜集并总结生成



