DDSC/partial-danish-gigaword-small-test-sample

Name: DDSC/partial-danish-gigaword-small-test-sample
Creator: DDSC
Published: 2023-01-09 13:11:16
License: 暂无描述

Hugging Face2023-01-09 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/DDSC/partial-danish-gigaword-small-test-sample

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string - name: source dtype: string - name: doc_id dtype: string - name: LICENSE dtype: string - name: uri dtype: string - name: date_built dtype: string splits: - name: train num_bytes: 23816547.04337273 num_examples: 2411 download_size: 11686492 dataset_size: 23816547.04337273 language: - da pretty_name: Danish Gigaword Test Sample --- # Dataset Card for "Danish Gigaword Test Sample" This is a small sample of the dataset `DDSC/partial-danish-gigaword-no-twitter`. It is meant as a small dataset for testing code. It is constructed using the following code: ```python from datasets import concatenate_datasets, load_dataset # download dataset from huggingface dataset = load_dataset("DDSC/partial-danish-gigaword-no-twitter") # All of the dataset is available in the train split - we can simply: dataset = dataset["train"] # downsample it to three domains legal = dataset.filter(lambda x: x["source"] == "retsinformationdk") news = dataset.filter(lambda x: x["source"] == "tv2r") speech = dataset.filter(lambda x: x["source"] == "spont") # downsample to 1000 samples legal = legal.select(range(1000)) news = news.select(range(1000)) # combine the three domains dataset = concatenate_datasets([legal, news, speech]) # upload to hub dataset.push_to_hub("DDSC/partial-danish-gigaword-small-test-sample") ```

提供机构：

DDSC

5,000+

优质数据集

54 个

任务类型

进入经典数据集