five

WiHArD: Wikipedia based Hierarchical Arabic Dataset

收藏
Mendeley Data2024-03-27 更新2024-06-27 收录
下载链接:
https://data.mendeley.com/datasets/kdkryh5rs2
下载链接
链接失效反馈
官方服务:
资源简介:
WiHArD (Wikipedia based Hierarchical Arabic Dataset) is a hierarchical Arabic dataset of 6027 texts extracted from Wikipedia Web site. WiHArD is structured into three "level 1" classes and nine "level 2" classes: • "Level 1" classes are Culture (ثقافة), History (تاريخ) and Math (رياضيات). Texts in this level describe general notions related to these domains. • "Level 2" classes are Clothes (ملابس), Food_drinks (طعام و شراب), Tourism (سياحة), Events (أحداث), Inventions (اختراعات), Monuments (أثار), Algebra (جبر), Analysis (تحليل) and Geometry (هندسة). Texts in this level describe specific notions related to these sub-domains. Four files are shared for the benefit of the NLP and IA communities, especially researchers working on Arabic language: 1. WiHArD_Directory_Hierarchy.zip contains the directory hierarchy. 2. WiHArD.csv, a CSV file of three columns: "text" column contains the Arabic texts; "category_path" and "category_code" columns contain respectively the category path and the category code. 3. WiHArD_Level1.csv, a CSV file restricted to the texts the first level, namely Culture (ثقافة), History (تاريخ) and Math (رياضيات). 4. WiHArD_Level2.csv, a CSV file restricted to the texts of the second level, namely Clothes (ملابس), Food_drinks (طعام و شراب), Tourism (سياحة), Events (أحداث), Inventions (اختراعات), Monuments (أثار), Algebra (جبر), Analysis (تحليل) and Geometry (هندسة).
创建时间:
2024-01-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作