five

CZLC/sumeczech_downsampled

收藏
Hugging Face2024-08-21 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/CZLC/sumeczech_downsampled
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other license_name: mixed license_link: LICENSE task_categories: - summarization pretty_name: SumeCzech language: - cs --- # Dataset Card for Dataset Name <!-- Provide a quick summary of the dataset. --> This is a downsampled version of the SumeCzech summarization dataset. The train, dev, and test sets each have 1500 samples. Original SumeCzech is a 1-million-document dataset of Czech news, each consisting of: - headline; - abstract (visually distinguished first paragraph); - rest of the text. developed by https://ufal.mff.cuni.cz ## Citation If you use this resource, please cite the following work: ```bibtex @inproceedings{straka-etal-2018-sumeczech, title = "{S}ume{C}zech: Large {C}zech News-Based Summarization Dataset", author = "Straka, Milan and Mediankin, Nikita and Kocmi, Tom and {\v{Z}}abokrtsk{\'y}, Zden{\v{e}}k and Hude{\v{c}}ek, Vojt{\v{e}}ch and Haji{\v{c}}, Jan", editor = "Calzolari, Nicoletta and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Hasida, Koiti and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{\'e}l{\`e}ne and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios and Tokunaga, Takenobu", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", month = may, year = "2018", address = "Miyazaki, Japan", publisher = "European Language Resources Association (ELRA)", url = "https://aclanthology.org/L18-1551", } ``` ## Licensing Information This dataset is collected by downloading specific webpages of Czech News outlets from CommonCrawl (CC). Please refer to [CC Terms of Use](https://commoncrawl.org/terms-of-use) for specific details. The scripts used to extract the dataset are published under [MPL 2.0](https://opensource.org/license/MPL-2.0). Members of CZLC does not own the copyright of the news contents (headlines, abstracts, and contents) included in SumeCzech. We are not responsible for their content or meaning. ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Language(s) (NLP):** Czech - **License:** code for download and eval Mozilla Public License 2.0, data not-for-distribution ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2615 - **Repository:** home https://ufal.mff.cuni.cz/sumeczech
提供机构:
CZLC
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作