five

ThatiAR

收藏
魔搭社区2025-08-29 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/QCRI/ThatiAR
下载链接
链接失效反馈
官方服务:
资源简介:
# ThatiAR: Subjectivity Detection in Arabic News Sentences Along with the paper, we release the dataset and other experimental resources. Please find the attached directory structure below. ### Files Description - **data/** - `subjectivity_2024_dev.tsv`: Development set for subjectivity detection in Arabic news sentences. - `subjectivity_2024_test.tsv`: Test set for subjectivity detection in Arabic news sentences. - `subjectivity_2024_train.tsv`: Training set for subjectivity detection in Arabic news sentences. - **instruction_explanation_dataset/** - `subjectivity_2024_instruct_dev.json`: Development set with instruction explanations. - `subjectivity_2024_instruct_test.json`: Test set with instruction explanations. - `subjectivity_2024_instruct_train.json`: Training set with instruction explanations. - `licenses_by-nc-sa_4.0_legalcode.txt`: License information for the dataset, under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. - `README.md`: This readme file containing information about the dataset and its structure. ## License This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can view the full license in the `licenses_by-nc-sa_4.0_legalcode.txt` file. ## Usage To use this dataset, you can load the TSV or JSONL files into your data processing pipeline. ### Example (Python) ```python import pandas as pd import json def load_tsv(file_path): return pd.read_csv(file_path, sep='\t') def load_json(file_path): with open(file_path, 'r', encoding='utf-8') as file: data = json.load(file) # Use json.load() for reading standard JSON files return data # Load training data train_data_tsv = load_tsv('data/subjectivity_2024_train.tsv') train_data_jsonl = load_json('instruction_explanation_dataset/subjectivity_2024_instruct_train.json') # Load development data dev_data_tsv = load_tsv('data/subjectivity_2024_dev.tsv') dev_data_jsonl = load_json('instruction_explanation_dataset/subjectivity_2024_instruct_dev.json') # Load test data test_data_tsv = load_tsv('data/subjectivity_2024_test.tsv') test_data_jsonl = load_json('instruction_explanation_dataset/subjectivity_2024_instruct_test.json') ``` ### Data splits We split the dataset in a stratified manner, allocating 70\%, 10\%, and 20\% for training, development, and testing, respectively. ## Citation ``` @article{ThatiAR2024, title = {{ThatiAR}: Subjectivity Detection in Arabic News Sentences}, author = {Suwaileh, Reem and Hasanain, Maram and Hubail, Fatema and Zaghouani, Wajdi and Alam, Firoj}, year = {2024}, journal = {arXiv: 2406.05559}, } ```

# ThatiAR:阿拉伯语新闻句子主观性检测数据集 本工作随学术论文同步发布了本数据集及相关实验资源。下文为附带的目录结构说明。 ## 文件说明 - **data/** - `subjectivity_2024_dev.tsv`:阿拉伯语新闻句子主观性检测任务开发集 - `subjectivity_2024_test.tsv`:阿拉伯语新闻句子主观性检测任务测试集 - `subjectivity_2024_train.tsv`:阿拉伯语新闻句子主观性检测任务训练集 - **instruction_explanation_dataset/** - `subjectivity_2024_instruct_dev.json`:附带指令说明的开发集 - `subjectivity_2024_instruct_test.json`:附带指令说明的测试集 - `subjectivity_2024_instruct_train.json`:附带指令说明的训练集 - `licenses_by-nc-sa_4.0_legalcode.txt`:本数据集的授权协议文件,采用知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License) - `README.md`:本说明文件,包含数据集及其结构的相关信息 ## 授权协议 本数据集采用知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License)进行授权。完整协议内容可在`licenses_by-nc-sa_4.0_legalcode.txt`文件中查看。 ## 使用方式 您可将数据集的TSV或JSON格式文件加载至数据处理流程中使用。 ### Python示例代码 python import pandas as pd import json def load_tsv(file_path): return pd.read_csv(file_path, sep=' ') def load_json(file_path): with open(file_path, 'r', encoding='utf-8') as file: data = json.load(file) # 读取标准JSON文件时使用json.load()方法 return data # 加载训练集 train_data_tsv = load_tsv('data/subjectivity_2024_train.tsv') train_data_jsonl = load_json('instruction_explanation_dataset/subjectivity_2024_instruct_train.json') # 加载开发集 dev_data_tsv = load_tsv('data/subjectivity_2024_dev.tsv') dev_data_jsonl = load_json('instruction_explanation_dataset/subjectivity_2024_instruct_dev.json') # 加载测试集 test_data_tsv = load_tsv('data/subjectivity_2024_test.tsv') test_data_jsonl = load_json('instruction_explanation_dataset/subjectivity_2024_instruct_test.json') ## 数据划分 本数据集采用分层划分策略,分别将70%、10%和20%的数据划分为训练集、开发集与测试集。 ## 引用方式 @article{ThatiAR2024, title = {{ThatiAR}: 阿拉伯语新闻句子主观性检测}, author = {Suwaileh, Reem and Hasanain, Maram and Hubail, Fatema and Zaghouani, Wajdi and Alam, Firoj}, year = {2024}, journal = {arXiv: 2406.05559}, }
提供机构:
maas
创建时间:
2025-06-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作