five

CIIRC-NLP/czech-subjectivity-en

收藏
Hugging Face2024-09-03 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/CIIRC-NLP/czech-subjectivity-en
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string - name: label dtype: int64 - name: label_text dtype: string splits: - name: train num_bytes: 1081305 num_examples: 7443 - name: validation num_bytes: 72570 num_examples: 500 - name: test num_bytes: 288457 num_examples: 2000 download_size: 862284 dataset_size: 1442332 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* task_categories: - text-classification language: - en pretty_name: Czech Subjectivity Dataset - English translation license: cc-by-nc-sa-4.0 size_categories: - 1K<n<10K --- # Czech Subjectivity Dataset - English translation This is an English translation of the original [Subj-CS](https://huggingface.co/datasets/pauli31/czech-subjectivity-dataset) dataset, created using the [WMT 21 X-En](https://huggingface.co/facebook/wmt21-dense-24-wide-x-en) model. The translation was completed for use within the [Czech-Bench](https://gitlab.com/jirkoada/czech-bench) evaluation framework. The script used for translation can be reviewed [here](https://gitlab.com/jirkoada/czech-bench/-/blob/main/benchmarks/dataset_translation.py?ref_type=heads). ## Citation Original dataset: ```bibtex @article{pib2022czech, title={Czech Dataset for Cross-lingual Subjectivity Classification}, author={Pavel Přibáň and Josef Steinberger}, year={2022}, eprint={2204.13915}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` English translation: ```bibtex @masterthesis{jirkovsky-thesis, author = {Jirkovský, Adam}, title = {Benchmarking Techniques for Evaluation of Large Language Models}, school = {Czech Technical University in Prague, Faculty of Electrical Engineering}, year = 2024, URL = {https://dspace.cvut.cz/handle/10467/115227} } ```
提供机构:
CIIRC-NLP
原始信息汇总

数据集概述

数据集名称

Czech Subjectivity Dataset - English translation

数据集特征

  • text:字符串类型
  • label:整数类型(int64)
  • label_text:字符串类型

数据集划分

  • 训练集:7443个样本,1081305字节
  • 验证集:500个样本,72570字节
  • 测试集:2000个样本,288457字节

数据集大小

  • 下载大小:862284字节
  • 数据集总大小:1442332字节

配置

  • 默认配置:包含训练、验证和测试数据文件的路径

任务类别

  • 文本分类

语言

  • 英语

许可证

  • cc-by-nc-sa-4.0

大小类别

  • 1K<n<10K
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作