EcomGPT eval|电子商务数据集|语言模型评估数据集
收藏EcomGPT数据集概述
数据集基本信息
- 名称: EcomInstruct
- 规模: 250万条指令数据
- 特点: 首个电子商务领域指令数据集,通过构建原子任务扩展数据规模和任务多样性
- 数据类型: 电子商务基础数据类型(产品信息、用户评论等)
数据集构成
- 评估数据集: 12个(已开源)
- 语言: 英文(EN)和中文(ZH)
- 任务类型:
- 命名实体识别
- 实体跨度检测
- 抽取式问答
- 评论主题分类
- 属性值识别
- 属性值检测
- 产品选择
- 产品对齐
- 标题属性匹配
- 细粒度产品分类
- 粗粒度产品分类
- 标题生成
文件结构
. ├── [Dataset Name] │ └── tasks │ └── [task name] │ ├── meta-info.json │ └── test.json ...
性能表现
- 评估结果: EcomGPT在12个电子商务保留数据集上的人工评估中,表现优于或与ChatGPT相当
相关模型
- 模型名称: EcomGPT (7b1)
- 模型来源: ModelScope (https://www.modelscope.cn/models/damo/nlp_ecomgpt_multilingual-7B-ecom/summary)
- 基础模型: BLOOMZ
引用
bigquery @article{li2023ecomgpt, title={EcomGPT: Instruction-tuning Large Language Models with Chain-of-Task Tasks for E-commerce}, author={Li, Yangning and Ma, Shirong and Wang, Xiaobin and Huang, Shen and Jiang, Chengyue and Zheng, Hai-Tao and Xie, Pengjun and Huang, Fei and Jiang, Yong}, journal={arXiv preprint arXiv:2308.06966}, year={2023} }

Materials Project 在线材料数据库
Materials Project 是一个由伯克利加州大学和劳伦斯伯克利国家实验室于 2011 年共同发起的大型开放式在线材料数据库。这个项目的目标是利用高通量第一性原理计算,为超过百万种无机材料提供全面的性能数据、结构信息和计算模拟结果,以此加速新材料的发现和创新过程。数据库中的数据不仅包括晶体结构和能量特性,还涵盖了电子结构和热力学性质等详尽信息,为研究人员提供了丰富的材料数据资源。相关论文成果为「Commentary: The Materials Project: A materials genome approach to accelerating materials innovation」。
超神经 收录
yahoo-finance-data
该数据集包含从Yahoo! Finance、Nasdaq和U.S. Department of the Treasury获取的财务数据,旨在用于研究和教育目的。数据集包括公司详细信息、高管信息、财务指标、历史盈利、股票价格、股息事件、股票拆分、汇率和每日国债收益率等。每个数据集都有其来源、简要描述以及列出的列及其数据类型和描述。数据定期更新,并以Parquet格式提供,可通过DuckDB进行查询。
huggingface 收录
Population and Housing Census of 2007 - Ethiopia
Geographic coverage --------------------------- National coverage Analysis unit --------------------------- Household Person Housing unit Universe --------------------------- The census has counted people on dejure and defacto basis. The dejure population comprises all the persons who belong to a given area at a given time by virtue of usual residence, while under defacto approach people were counted as the residents of the place where they found. In the census, a person is said to be a usual resident of a household (and hence an area) if he/she has been residing in the household continuously for at least six months before the census day or intends to reside in the household for six months or longer. Thus, visitors are not included with the usual (dejure) population. Homeless persons were enumerated in the place where they spent the night on the enumeration day. The 2007 census counted foreign nationals who were residing in the city administration. On the other hand all Ethiopians living abroad were not counted. Kind of data --------------------------- Census/enumeration data [cen] Mode of data collection --------------------------- Face-to-face [f2f] Research instrument --------------------------- Two type sof questionnaires were used to collect census data: i) Short questionnaire ii) Long questionnaire Unlike the previous censuses, the contents of the short and long questionnaires were similar both for the urban and rural areas as well as for the entire city. But the short and the long questionnaires differ by the number of variables they contained. That is, the short questionnaire was used to collect basic data on population characteristics, such as population size, sex, age, language, ethnic group, religion, orphanhood and disability. Whereas the long questionnaire includes information on marital status, education, economic activity, migration, fertility, mortality, as well as housing stocks and conditions in addition to those questions contained in a short questionnaire.
catalog.ihsn.org 收录
LinkedIn Salary Insights Dataset
LinkedIn Salary Insights Dataset 提供了全球范围内的薪资数据,包括不同职位、行业、地理位置和经验水平的薪资信息。该数据集旨在帮助用户了解薪资趋势和市场行情,支持职业规划和薪资谈判。
www.linkedin.com 收录
CE-CSL
CE-CSL数据集是由哈尔滨工程大学智能科学与工程学院创建的中文连续手语数据集,旨在解决现有数据集在复杂环境下的局限性。该数据集包含5,988个从日常生活场景中收集的连续手语视频片段,涵盖超过70种不同的复杂背景,确保了数据集的代表性和泛化能力。数据集的创建过程严格遵循实际应用导向,通过收集大量真实场景下的手语视频材料,覆盖了广泛的情境变化和环境复杂性。CE-CSL数据集主要应用于连续手语识别领域,旨在提高手语识别技术在复杂环境中的准确性和效率,促进聋人与听人社区之间的无障碍沟通。
arXiv 收录