TaoLi_data|中文教育数据集|自然语言处理数据集
收藏国际中文教育大模型 "桃李"(Taoli)1.0 数据集概述
数据集背景
- 针对国际中文教育领域构建的大语言模型
- 旨在解决通用大模型在垂直领域效果有限的问题
- 基于500余册国际中文教育教材、教辅书、汉语水平考试试题及学习者词典构建
数据集内容
通用指令微调数据
- Alpaca-GPT4数据:52k中文,52k英文
国际中文教育指令微调数据
语法改错数据
- 来源:YACLC开发集(最小改动/流利数据)+ HSK作文评分数据(篇章级)
- 示例:提供学习者文本的语法错误修正
释义生成数据
- 来源:现代汉语词典和对外汉语词典
- 示例:词语在特定上下文中的含义解释
文本简化数据
- 来源:Multi-Reference Chinese Text Simplification Dataset
- 规模:723条复杂结构句子(含多参考简化句)
- 示例:将专业文本简化为适合非专业读者阅读
可控文本生成数据
- 来源:汉语国际教育动态语料库(CTC)
- 示例:展示特定语法点在句子中的应用
数据规模
- 总计88,000条高质量国际中文教育问答数据
- 包含:
- 9k语法改错数据
- 4k释义生成数据
- 6k文本简化数据
- 6k可控文本生成数据
模型信息
- 基座模型:LLaMA 7B
- 当前版本:taoli-llama-7b-1.0
- 训练方式:在Chinese-LLaMA-7B基础上进行指令微调
性能表现
考试能力测试(HSK4-6级)
考试级别 | Taoli 1.0得分 | GPT-4得分 |
---|---|---|
HSK4 | 55 | 78 |
HSK5 | 60 | 85 |
HSK6 | 42 | 76 |
合作单位
- 北京语言大学
- 清华大学
- 东北大学
- 北京交通大学
使用限制
- 仅限学术研究用途
- 禁止商业使用
- 生成内容可能存在误差,需自行验证
引用格式
Plaintext @misc{Taoli-LLama, author={Jingsi Yu et al.}, title={Taoli Llama}, year={2023}, howpublished={url{https://github.com/blcuicall/taoli}}, }

Billboard-Hot-100
该数据集包含了自1958年以来所有Billboard Hot 100榜单的历史数据,详细记录了每首歌曲的排名、日期、表演者等信息。
github 收录
Population and Housing Census of 2007 - Ethiopia
Geographic coverage --------------------------- National coverage Analysis unit --------------------------- Household Person Housing unit Universe --------------------------- The census has counted people on dejure and defacto basis. The dejure population comprises all the persons who belong to a given area at a given time by virtue of usual residence, while under defacto approach people were counted as the residents of the place where they found. In the census, a person is said to be a usual resident of a household (and hence an area) if he/she has been residing in the household continuously for at least six months before the census day or intends to reside in the household for six months or longer. Thus, visitors are not included with the usual (dejure) population. Homeless persons were enumerated in the place where they spent the night on the enumeration day. The 2007 census counted foreign nationals who were residing in the city administration. On the other hand all Ethiopians living abroad were not counted. Kind of data --------------------------- Census/enumeration data [cen] Mode of data collection --------------------------- Face-to-face [f2f] Research instrument --------------------------- Two type sof questionnaires were used to collect census data: i) Short questionnaire ii) Long questionnaire Unlike the previous censuses, the contents of the short and long questionnaires were similar both for the urban and rural areas as well as for the entire city. But the short and the long questionnaires differ by the number of variables they contained. That is, the short questionnaire was used to collect basic data on population characteristics, such as population size, sex, age, language, ethnic group, religion, orphanhood and disability. Whereas the long questionnaire includes information on marital status, education, economic activity, migration, fertility, mortality, as well as housing stocks and conditions in addition to those questions contained in a short questionnaire.
catalog.ihsn.org 收录
Yahoo Finance
Dataset About finance related to stock market
kaggle 收录
PDT Dataset
PDT数据集是由山东计算机科学中心(国家超级计算济南中心)和齐鲁工业大学(山东省科学院)联合开发的无人机目标检测数据集,专门用于检测树木病虫害。该数据集包含高分辨率和低分辨率两种版本,共计5775张图像,涵盖了健康和受病虫害影响的松树图像。数据集的创建过程包括实地采集、数据预处理和人工标注,旨在为无人机在农业中的精准喷洒提供高精度的目标检测支持。PDT数据集的应用领域主要集中在农业无人机技术,旨在提高无人机在植物保护中的目标识别精度,解决传统检测模型在实际应用中的不足。
arXiv 收录
中国气象数据
本数据集包含了中国2023年1月至11月的气象数据,包括日照时间、降雨量、温度、风速等关键数据。通过这些数据,可以深入了解气象现象对不同地区的影响,并通过可视化工具揭示中国的气温分布、降水情况、风速趋势等。
github 收录