LongEval|长文本处理数据集|模型评估数据集
收藏LongChat数据集概述
数据集简介
- 官方仓库:支持训练和评估基于长上下文LLM的聊天机器人
- 包含LongChat和LongEval两个主要组件
- 相关科学发现见博客文章
最新动态
- 2023年8月:发布基于Llama 2的LongChat v1.5版本,支持32K上下文长度
模型资源
- 预训练模型:
训练配置
- 训练脚本示例使用8xA100 GPU
- 关键参数:
- 模型最大长度:16384
- 训练周期:3
- 学习率:2e-5
- 批量大小:1(训练)/4(评估)
- 使用FlashAttention处理超长序列
评估功能
- 提供两种评估任务:
- 粗粒度主题召回(topics)
- 行召回(lines)
- 评估脚本支持自定义模型和任务
- 包含测试用例生成功能
引用格式
bibtex @misc{longchat2023, title = {How Long Can Open-Source LLMs Truly Promise on Context Length?}, url = {https://lmsys.org/blog/2023-06-29-longchat}, author = {Dacheng Li*, Rulin Shao*, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang}, month = {June}, year = {2023} }

OpenPose
OpenPose数据集包含人体姿态估计的相关数据,主要用于训练和评估人体姿态检测算法。数据集包括多视角的图像和视频,标注了人体关键点位置,适用于研究人体姿态识别和动作分析。
github.com 收录
Yahoo Finance
Dataset About finance related to stock market
kaggle 收录
全国 1∶200 000 数字地质图(公开版)空间数据库
As the only one of its kind, China National Digital Geological Map (Public Version at 1∶200 000 scale) Spatial Database (CNDGM-PVSD) is based on China' s former nationwide measured results of regional geological survey at 1∶200 000 scale, and is also one of the nationwide basic geosciences spatial databases jointly accomplished by multiple organizations of China. Spatially, it embraces 1 163 geological map-sheets (at scale 1: 200 000) in both formats of MapGIS and ArcGIS, covering 72% of China's whole territory with a total data volume of 90 GB. Its main sources is from 1∶200 000 regional geological survey reports, geological maps, and mineral resources maps with an original time span from mid-1950s to early 1990s. Approved by the State's related agencies, it meets all the related technical qualification requirements and standards issued by China Geological Survey in data integrity, logic consistency, location acc racy, attribution fineness, and collation precision, and is hence of excellent and reliable quality. The CNDGM-PVSD is an important component of China' s national spatial database categories, serving as a spatial digital platform for the information construction of the State's national economy, and providing informationbackbones to the national and provincial economic planning, geohazard monitoring, geological survey, mineral resources exploration as well as macro decision-making.
DataCite Commons 收录
中指数据库(物业版)
物业版解决物业企业“找项目”、“找行业和企业数据"的迫切需求,提供高效的市场拓展渠道、最新行业动态、竞品企业的多维度数据,助力企业科学决策。
西部数据交易中心 收录
PDT Dataset
PDT数据集是由山东计算机科学中心(国家超级计算济南中心)和齐鲁工业大学(山东省科学院)联合开发的无人机目标检测数据集,专门用于检测树木病虫害。该数据集包含高分辨率和低分辨率两种版本,共计5775张图像,涵盖了健康和受病虫害影响的松树图像。数据集的创建过程包括实地采集、数据预处理和人工标注,旨在为无人机在农业中的精准喷洒提供高精度的目标检测支持。PDT数据集的应用领域主要集中在农业无人机技术,旨在提高无人机在植物保护中的目标识别精度,解决传统检测模型在实际应用中的不足。
arXiv 收录