five

CAMeL-Lab/BAREC-10M

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/CAMeL-Lab/BAREC-10M
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 task_categories: - text-classification language: - ar tags: - readability pretty_name: BAREC-10M Corpus v1.0 size_categories: - 1M<n<10M --- # BAREC-10M Corpus v1.0 ## Corpus Summary **BAREC-10M** is an expanded version of the [Balanced Arabic Readability Evaluation Corpus (BAREC)](https://huggingface.co/datasets/CAMeL-Lab/BAREC-Corpus-v1.0), scaling from 1 million to 10 million words and broadening its scope to include balanced, multi-domain coverage. Each text is labeled by **domain**, **genre**, and **readership level**, and enriched with automatic **morphological**, **syntactic**, and **readability** analysis using state-of-the-art tools. --- ## Available Annotations The corpus includes both document-level and sentence-level annotations. **Document-level annotations** (manually labeled): - **Domain**: `Arts & Humanities`, `Social Sciences`, or `STEM` - **Readership Group**: `Foundational`, `Advanced`, or `Specialized` - **Text Category**: `Educational Materials`, `Literature, Art & Music`, `Media & Culture`, `Academic`, `Encyclopedic`, or `Religion & Philosophy` **Sentence-level annotations** (automatically generated): - **Morphological analysis** - **Syntactic parsing** - **Readability leveling** --- ## Languages - **Arabic** (Modern Standard Arabic) --- ## Corpus Details The structure of the dataset directory is as follows: ``` . ├── Data/ │ ├── Metadata.xlsx │ ├── Raw.zip │ ├── Morphology_and_Readability.zip │ ├── Syntax_CATiB.zip │ └── Syntax_UD.zip └── README.md ``` ### Metadata The metadata file contains the following fields: - **Document**: Document file name (without extension) - **Directory**: Document directory - **Source**: Document source - **Book**: Book title - **Author**: Author name - **Domain** - **Readership Level** - **Text Category** - **Word Count**: Number of words in the document - **Sentence Count**: Number of sentences in the document - **In BAREC Corpus?**: Indicates whether the document originates from the original BAREC corpus (`Yes` or `No`) ### Raw Sentences The corpus includes 20,535 `.txt` files containing raw sentences, organized into multiple directories according to the metadata. ### Morphology and Readability The corpus includes 20,535 `.json` files containing morphological and readability annotations, organized into multiple directories according to the metadata. Each JSON file represents a document and contains the following key-value pairs: **Sentence-level features:** - `raw_sents`: Raw sentences (list of strings) - `sents_word_count`: Number of words per sentence (list of integers) - `sents_RL`: Sentence-level readability scores (list of integers from 1 to 19). The value `###` indicates problematic sentences in documents originating from the BAREC corpus. **Word-level features:** - `word`: Tokenized words for all sentences (list of lists of strings) - `lex`: Lemmas of all words (list of lists of strings) - `pos`: Part-of-speech tags (list of lists of strings) - `RL`: Readability levels of lemmas (list of lists of integers) - `num`, `gen`, `mod`, etc.: Additional [CAMeL Morph](https://github.com/CAMeL-Lab/camel_morph) features of all words (list of lists of strings) ### Syntax We provide syntactic annotations in both the [Columbia Arabic Treebank (CATiB)](https://aclanthology.org/P09-2056/) and [Universal Dependencies (UD)](https://aclanthology.org/W17-1320/) schemes. The corpus includes 20,535 `.conllx` files per annotation scheme, each containing syntactic annotations and organized into multiple directories according to the metadata. We recommend using the [Palmyra tool](https://camel-lab.github.io/palmyra/index.html) for visualization and analysis of these files. --- ## Usage You can download the files manually using the Hub’s user interface, or use `snapshot_download` to download all files at once. ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="CAMeL-Lab/BAREC-10M", repo_type="dataset", local_dir="path/to/local/dir", allow_patterns=["Data/*"] ) ``` --- ## Citation If you use BAREC-10M in your work, please cite the following paper: ``` @inproceedings{elmadani2026large, author = {Elmadani, Khalid N. and Wizani, Adel Mahmoud and Taha-Thomure, Hanada and Habash, Nizar}, title = {A Large and Balanced Multi-Domain Arabic Corpus Annotated for Morphology, Syntax, and Readability}, booktitle = {Proceedings of the International Conference on Language Resources and Evaluation (LREC 2026)}, year = {2026}, address = {Palma, Mallorca, Spain} } ```
提供机构:
CAMeL-Lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作