vishalp23/subject-classification
收藏数据集卡片 for "subject"
数据集描述
数据集摘要
Subject 是一个从以下来源提取的文本数据集:
支持的任务和排行榜
语言
数据集结构
数据实例
一个示例如下: json { "text": "Getting Started This chapter will be about getting started with Git. We will begin by explaining some background on version control tools, then move on to how to get Git running on your system and finally how to get it set up to start working with. At the end of this chapter you should understand why Git is around, why you should use it and you should be all set up to do so. About Version Control What is “version control”, and why should you care? Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. For the examples in this book, you will use software source code as the files being version controlled, though in reality you can do this with nearly any type of file on a computer. If you are a graphic or web designer and want to keep every version of an image or layout (which you would most certainly want to), a Version Control System (VCS) is a very wise thing to use.", "label": 2 }
数据字段
数据字段包括:
text: 一个string特征。label: 一个分类标签,可能的值包括 biology: 0, chemistry: 1, computer: 2, maths: 3, physics: 4, social sciences: 5。
数据分割
数据集有 1 个配置:
- split: 总共包含 338,683 个示例,分为训练集和测试集。
| name | train | test |
|---|---|---|
| split | 230601 | 108082 |
数据集创建
数据集是通过对教科书进行分块和合并创建的,基于它们的总体类别。例如: 以教科书《Relativity Lite: A Pictorial Translation of Einstein’s Theories of Motion and Gravity by Jack C. Straton, Portland State University》为例:
- 我们去掉封面页、目录和前言。
- 从 PDF 中提取所有文本。
- 使用 chunkipypackage 和 Bert Tokenizer 将文本分块为最大长度 512 的块。
- 根据书籍记录标签。对于上述书籍,标签将是 4,因为这是一本物理教科书。
- 所有分块的文件合并成一个 CSV 文件。
策划理由
附加信息
数据集策展人
许可信息
该数据集仅应用于教育和研究目的。
引用信息
如果您使用此数据集,请引用:
参考文献
正在进行中
贡献
正在进行中




