ThatiAR
收藏魔搭社区2025-08-29 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/QCRI/ThatiAR
下载链接
链接失效反馈官方服务:
资源简介:
# ThatiAR: Subjectivity Detection in Arabic News Sentences
Along with the paper, we release the dataset and other experimental resources. Please find the attached directory structure below.
### Files Description
- **data/**
- `subjectivity_2024_dev.tsv`: Development set for subjectivity detection in Arabic news sentences.
- `subjectivity_2024_test.tsv`: Test set for subjectivity detection in Arabic news sentences.
- `subjectivity_2024_train.tsv`: Training set for subjectivity detection in Arabic news sentences.
- **instruction_explanation_dataset/**
- `subjectivity_2024_instruct_dev.json`: Development set with instruction explanations.
- `subjectivity_2024_instruct_test.json`: Test set with instruction explanations.
- `subjectivity_2024_instruct_train.json`: Training set with instruction explanations.
- `licenses_by-nc-sa_4.0_legalcode.txt`: License information for the dataset, under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
- `README.md`: This readme file containing information about the dataset and its structure.
## License
This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can view the full license in the `licenses_by-nc-sa_4.0_legalcode.txt` file.
## Usage
To use this dataset, you can load the TSV or JSONL files into your data processing pipeline.
### Example (Python)
```python
import pandas as pd
import json
def load_tsv(file_path):
return pd.read_csv(file_path, sep='\t')
def load_json(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
data = json.load(file) # Use json.load() for reading standard JSON files
return data
# Load training data
train_data_tsv = load_tsv('data/subjectivity_2024_train.tsv')
train_data_jsonl = load_json('instruction_explanation_dataset/subjectivity_2024_instruct_train.json')
# Load development data
dev_data_tsv = load_tsv('data/subjectivity_2024_dev.tsv')
dev_data_jsonl = load_json('instruction_explanation_dataset/subjectivity_2024_instruct_dev.json')
# Load test data
test_data_tsv = load_tsv('data/subjectivity_2024_test.tsv')
test_data_jsonl = load_json('instruction_explanation_dataset/subjectivity_2024_instruct_test.json')
```
### Data splits
We split the dataset in a stratified manner, allocating 70\%, 10\%, and 20\% for training, development, and testing, respectively.
## Citation
```
@article{ThatiAR2024,
title = {{ThatiAR}: Subjectivity Detection in Arabic News Sentences},
author = {Suwaileh, Reem and Hasanain, Maram and Hubail, Fatema and Zaghouani, Wajdi and Alam, Firoj},
year = {2024},
journal = {arXiv: 2406.05559},
}
```
# ThatiAR:阿拉伯语新闻句子主观性检测数据集
本工作随学术论文同步发布了本数据集及相关实验资源。下文为附带的目录结构说明。
## 文件说明
- **data/**
- `subjectivity_2024_dev.tsv`:阿拉伯语新闻句子主观性检测任务开发集
- `subjectivity_2024_test.tsv`:阿拉伯语新闻句子主观性检测任务测试集
- `subjectivity_2024_train.tsv`:阿拉伯语新闻句子主观性检测任务训练集
- **instruction_explanation_dataset/**
- `subjectivity_2024_instruct_dev.json`:附带指令说明的开发集
- `subjectivity_2024_instruct_test.json`:附带指令说明的测试集
- `subjectivity_2024_instruct_train.json`:附带指令说明的训练集
- `licenses_by-nc-sa_4.0_legalcode.txt`:本数据集的授权协议文件,采用知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License)
- `README.md`:本说明文件,包含数据集及其结构的相关信息
## 授权协议
本数据集采用知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License)进行授权。完整协议内容可在`licenses_by-nc-sa_4.0_legalcode.txt`文件中查看。
## 使用方式
您可将数据集的TSV或JSON格式文件加载至数据处理流程中使用。
### Python示例代码
python
import pandas as pd
import json
def load_tsv(file_path):
return pd.read_csv(file_path, sep=' ')
def load_json(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
data = json.load(file) # 读取标准JSON文件时使用json.load()方法
return data
# 加载训练集
train_data_tsv = load_tsv('data/subjectivity_2024_train.tsv')
train_data_jsonl = load_json('instruction_explanation_dataset/subjectivity_2024_instruct_train.json')
# 加载开发集
dev_data_tsv = load_tsv('data/subjectivity_2024_dev.tsv')
dev_data_jsonl = load_json('instruction_explanation_dataset/subjectivity_2024_instruct_dev.json')
# 加载测试集
test_data_tsv = load_tsv('data/subjectivity_2024_test.tsv')
test_data_jsonl = load_json('instruction_explanation_dataset/subjectivity_2024_instruct_test.json')
## 数据划分
本数据集采用分层划分策略,分别将70%、10%和20%的数据划分为训练集、开发集与测试集。
## 引用方式
@article{ThatiAR2024,
title = {{ThatiAR}: 阿拉伯语新闻句子主观性检测},
author = {Suwaileh, Reem and Hasanain, Maram and Hubail, Fatema and Zaghouani, Wajdi and Alam, Firoj},
year = {2024},
journal = {arXiv: 2406.05559},
}
提供机构:
maas
创建时间:
2025-06-17



