E-NER|法律文本数据集|命名实体识别数据集
收藏E-NER 数据集
数据集概述
- 数据来源:包含来自美国 SEC EDGAR 数据库的 52 份文件。
- 数据标注:命名实体标签为手工标注。
命名实体分类
- 完整分类:数据集中的命名实体分为 7 个类别:Person, Court, Business, Government, Location, Legislation/Act, Miscellaneous(以及用于非命名实体的类别 "Outside")。
- 简化分类:在 "edgar_4.csv" 文件中,命名实体类别简化为 4 个:Person, Organization, Location, Miscellaneous。其中,Court, Business 和 Government 合并为 Organization,Legislation/Act 和 Miscellaneous 合并为 Miscellaneous。
文件说明
- all.csv:包含标注的文件,每行一个词,后跟命名实体标签,词和标签之间用制表符分隔。
- edgar_4.csv:与 "all.csv" 包含相同的数据,但命名实体类别简化为 4 个。
许可证

- 1E-NER -- An Annotated Named Entity Recognition Corpus of Legal Text伦敦大学学院计算机科学系 · 2022年
LFW
人脸数据集;LFW数据集共有13233张人脸图像,每张图像均给出对应的人名,共有5749人,且绝大部分人仅有一张图片。每张图片的尺寸为250X250,绝大部分为彩色图像,但也存在少许黑白人脸图片。 URL: http://vis-www.cs.umass.edu/lfw/index.html#download
AI_Studio 收录
GME Data
关于2021年GameStop股票活动的数据,包括每日合并的GME短期成交量数据、每日失败交付数据、可借股数、期权链数据以及不同时间框架的开盘/最高/最低/收盘/成交量条形图。
github 收录
URPC系列数据集, S-URPC2019, UDD
URPC系列数据集包括URPC2017至URPC2020DL,主要用于水下目标的检测和分类。S-URPC2019专注于水下环境的特定检测任务。UDD数据集信息未在README中详细描述。
github 收录
Materials Project
材料项目是一组标有不同属性的化合物。数据集链接: MP 2018.6.1(69,239 个材料) MP 2019.4.1(133,420 个材料)
OpenDataLab 收录
全国 1∶200 000 数字地质图(公开版)空间数据库
As the only one of its kind, China National Digital Geological Map (Public Version at 1∶200 000 scale) Spatial Database (CNDGM-PVSD) is based on China' s former nationwide measured results of regional geological survey at 1∶200 000 scale, and is also one of the nationwide basic geosciences spatial databases jointly accomplished by multiple organizations of China. Spatially, it embraces 1 163 geological map-sheets (at scale 1: 200 000) in both formats of MapGIS and ArcGIS, covering 72% of China's whole territory with a total data volume of 90 GB. Its main sources is from 1∶200 000 regional geological survey reports, geological maps, and mineral resources maps with an original time span from mid-1950s to early 1990s. Approved by the State's related agencies, it meets all the related technical qualification requirements and standards issued by China Geological Survey in data integrity, logic consistency, location acc racy, attribution fineness, and collation precision, and is hence of excellent and reliable quality. The CNDGM-PVSD is an important component of China' s national spatial database categories, serving as a spatial digital platform for the information construction of the State's national economy, and providing informationbackbones to the national and provincial economic planning, geohazard monitoring, geological survey, mineral resources exploration as well as macro decision-making.
DataCite Commons 收录