five

danish-foundation-models/danish-gigaword

收藏
Hugging Face2024-12-14 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/danish-foundation-models/danish-gigaword
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other configs: - config_name: default data_files: - split: train path: '*/*.parquet' - config_name: retsinformationdk data_files: - split: train path: retsinformationdk/*.parquet - config_name: ep data_files: - split: train path: ep/*.parquet - config_name: ft data_files: - split: train path: ft/*.parquet - config_name: wikisource data_files: - split: train path: wikisource/*.parquet - config_name: spont data_files: - split: train path: spont/*.parquet - config_name: tv2r data_files: - split: train path: tv2r/*.parquet - config_name: adl data_files: - split: train path: adl/*.parquet - config_name: hest data_files: - split: train path: hest/*.parquet - config_name: skat data_files: - split: train path: skat/*.parquet - config_name: dannet data_files: - split: train path: dannet/*.parquet - config_name: retspraksis data_files: - split: train path: retspraksis/*.parquet - config_name: wikibooks data_files: - split: train path: wikibooks/*.parquet - config_name: jvj data_files: - split: train path: jvj/*.parquet - config_name: gutenberg data_files: - split: train path: gutenberg/*.parquet - config_name: botxt data_files: - split: train path: botxt/*.parquet - config_name: depbank data_files: - split: train path: depbank/*.parquet - config_name: naat data_files: - split: train path: naat/*.parquet - config_name: synne data_files: - split: train path: synne/*.parquet - config_name: wiki data_files: - split: train path: wiki/*.parquet - config_name: relig data_files: - split: train path: relig/*.parquet annotations_creators: - no-annotation language_creators: - crowdsourced language: - da multilinguality: - monolingual source_datasets: - original task_categories: - text-generation task_ids: - language-modeling pretty_name: Danish Gigaword language_bcp47: - da - da-bornholm - da-synnejyl --- # Danish Gigaword Corpus *Version*: 1.0.0 *License*: See the respective dataset ## Table of Contents - [Danish Gigaword Corpus](#danish-gigaword-corpus) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Loading the dataset](#loading-the-dataset) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Additional Information](#additional-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://gigaword.dk - **Paper:** http://www.derczynski.com/papers/dagw.pdf ### Dataset Summary The Danish Gigaword Corpus contains text spanning several domains and forms. This version does *not* include the sections containing tweets ("General Discussions" and "Parliament Elections"), "danavis", "Common Crawl" and "OpenSubtitles" due to potential privacy, quality and copyright concerns. ### Loading the dataset ```py from datasets import load_dataset name = "danish-foundation-models/danish-gigaword" ds = load_dataset(name, split = "train") sample = ds[1] # see "Data Instances" below # or load by streaming the data ds = load_dataset(name, split = "train", streaming=True) sample = next(iter(ds)) ``` ## Dataset Structure The dataset contains text from different sources which are thoroughly defined in [Source Data](#source-data). See the [homepage](https://gigaword.dk) or [paper](https://aclanthology.org/2021.nodalida-main.46.pdf) for more information. ### Data Instances Each entry in the dataset consists of a single text with associated metadata ```py { 'text': 'Vimoutiers er en kommune i departementet Orne i Basse-Normandie regionen i det nordvestlige Frankrig.\nCykelløbet Paris-Camembert slutter i Vimoutiers.\nHistorie.\nDen 14. juni 1944, under invasionen i Normandiet blev Vimoutiers bombarderet af allierede styrker. Landsbyen blev ødelagt og 220 civile dræbt.\nPersonligheder.\nPolitikeren Joseph Laniel (1889-1975) var født i Vomoutiers.', 'source': 'wiki', 'id': 'wiki_366127', 'added': '2021-03-28', 'created': '2019-01-01, 2021-01-01', 'metadata': {'domain': 'Wiki & Books', 'license': 'Creative Commons Legal Code\n\nCC0 1.0 Universal', 'source-pretty': 'Wikipedia' } } ``` ### Data Fields An entry in the dataset consists of the following fields: - `text`(`str`): The content of the document. - `source` (`str`): The source of the document (see [Source Data](#source-data)). - `id` (`str`): An unique identifer for each document. - `added` (`str`): An date for when the document was added to this collection. - `created` (`str`): An date range for when the document was originally created. - `metadata/license` (`str`): The license of the document. The licenses vary according to the source. - `metadata/domain` (`str`): The domain of the source - `metadata/source-pretty` (`str`): The longform version of the short-form source name ### Data Splits The entire corpus is provided in the `train` split. ## Dataset Creation ### Source Data Below follows a brief overview of the sources in the corpus along with their individual license. | Source | License | | ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | adl | Creative Commons Legal Code 1.0 Universal | | botxt | Creative Commons Legal Code 1.0 Universal | | dannet | [dannet license](https://cst.ku.dk/projekter/dannet/license.txt) | | depbank | Attribution-ShareAlike 4.0 International | | ep | Creative Commons Legal Code 1.0 Universal | | ft | Creative Commons Legal Code 1.0 Universal | | gutenberg | [gutenberg license](https://www.gutenberg.org/policy/license.html) | | hest | Creative Commons Legal Code 1.0 Universal | | jvj | Attribution-ShareAlike 4.0 International | | naat | Creative Commons Legal Code 1.0 Universal | | relig | Creative Commons Legal Code 1.0 Universal | | retsinformationdk | Danish Copyright law at https://www.retsinformation.dk/forms/r0710.aspx?id=164796 states "§ 9. Love, administrative forskrifter, retsafgørelser og lignende offentlige aktstykker er ikke genstand for ophavsret. Stk. 2. Bestemmelsen i stk. 1 gælder ikke for værker, der fremtræder som selvstændige bidrag i de i stk. 1 nævnte aktstykker. Sådanne værker må dog gengives i forbindelse med aktstykket. Retten til videre udnyttelse afhænger af de i øvrigt gældende regler." | | retspraksis | Creative Commons Legal Code 1.0 Universal | | skat | Creative Commons Legal Code 1.0 Universal | | spont | Creative Commons Legal Code 1.0 Universal | | synne | Creative Commons Legal Code 1.0 Universal | | tv2r | The owner of this content is TV2 Regionerne, Denmark. Creative Commons Attribution 4.0 International | | wiki | Creative Commons Legal Code 1.0 Universal | | wikibooks | Creative Commons Legal Code 1.0 Universal | | wikisource | Creative Commons Legal Code 1.0 Universal | These sources corresponds to the following top-level domains in the dataset: ```python # mapping from domain to top-level domain domain_mapping_dict = { "retsinformationdk": "Legal", "skat": "Legal", "retspraksis": "Legal", "hest": "Social Media", "cc": "Web", "adl": "Wiki & Books", "botxt": "Other", "danavis": "News", "dannet": "dannet", "depbank": "Other", "ep": "Conversation", "ft": "Conversation", "gutenberg": "Wiki & Books", "jvj": "Wiki & Books", "naat": "Conversation", "opensub": "Conversation", "relig": "Wiki & Books", "spont": "Conversation", "synne": "Other", "tv2r": "News", "wiki": "Wiki & Books", "wikibooks": "Wiki & Books", "wikisource": "Wiki & Books", "twfv19": "Social Media", # not present in this version of the dataset } ``` And the following mapping translates between the short form and the long form of the source name ```python # mapping from domain to its long name format longname_mapping_dict = { "retsinformationdk": "retsinformation.dk (Danish legal information)", "skat": "Skat (Danish tax authority)", "retspraksis": "retspraksis (Danish legal information)", "hest": "Hestenettet (Danish debate forum)", "cc": "Common Crawl", "adl": " Archive for Danish Literature", "botxt": "Bornholmsk (Danish dialect)", "danavis": "Danish daily newspapers", "dannet": "DanNet (Danish WordNet)", "depbank": "Danish Dependency Treebank", "ep": "European Parliament", "ft": "Folketinget (Danish Parliament)", "gutenberg": "Gutenberg", "jvj": "Johannes V. Jensen (Danish poet)", "naat": "NAAT", "opensub": "Open Subtitles", "relig": "Religious texts", "spont": "Spontaneous speech", "synne": "Synderjysk (Danish dialect)", "tv2r": "TV 2 Radio (Danish news)", "wiki": "Wikipedia", "wikibooks": "Wikibooks", "wikisource": "Wikisource", "twfv19": "Twitter Folketingsvalget 2019 (Danish election tweets)", # not present in this version of the dataset } ``` ## Additional Information ### Citation Information Sample attributions: In a press release: > Modellen er præ-trænet på et datasæt fra The Danish Gigaword Project (https://gigaword.dk), der er udviklet af forskere fra IT-Universitetet i København > The model is pre-trained using the Danish Gigaword Corpus (https://gigaword.dk), developed at the IT University of Copenhagen In academic writing: > Derczynski, L., Ciosici, M. R., et al. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021). ``` @inproceedings{dagw, title = {{The Danish Gigaword Corpus}}, author = {Leon Derczynski and Manuel R. Ciosici and Rebekah Baglini and Morten H. Christiansen and Jacob Aarup Dalsgaard and Riccardo Fusaroli and Peter Juel Henrichsen and Rasmus Hvingelby and Andreas Kirkedal and Alex Speed Kjeldsen and Claus Ladefoged and Finn Årup Nielsen and Jens Madsen and Malte Lau Petersen and Jonathan Hvithamar Rystrøm and Daniel Varab}, year = 2021, booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics}, publisher = {NEALT} } ``` In a software product, tool, or service: > Denne service er lavet med data fra The Danish Gigaword Corpus ### Contributions Dataset created by Derczynski et al. (2021). Thanks to [@HLasse](https://github.com/HLasse), [@KennethEnevoldsen](https://github.com/kennethenevoldsen), and [Jan Kostkan](https://github.com/jankounchained) for adding this dataset to the Hugging Face Hub.

许可证:其他 配置项: - 配置名称:default(默认配置) 数据文件: - 拆分方式:训练集(train) 路径:*/*.parquet - 配置名称:retsinformationdk 数据文件: - 拆分方式:训练集(train) 路径:retsinformationdk/*.parquet - 配置名称:ep 数据文件: - 拆分方式:训练集(train) 路径:ep/*.parquet - 配置名称:ft 数据文件: - 拆分方式:训练集(train) 路径:ft/*.parquet - 配置名称:wikisource 数据文件: - 拆分方式:训练集(train) 路径:wikisource/*.parquet - 配置名称:spont 数据文件: - 拆分方式:训练集(train) 路径:spont/*.parquet - 配置名称:tv2r 数据文件: - 拆分方式:训练集(train) 路径:tv2r/*.parquet - 配置名称:adl 数据文件: - 拆分方式:训练集(train) 路径:adl/*.parquet - 配置名称:hest 数据文件: - 拆分方式:训练集(train) 路径:hest/*.parquet - 配置名称:skat 数据文件: - 拆分方式:训练集(train) 路径:skat/*.parquet - 配置名称:dannet 数据文件: - 拆分方式:训练集(train) 路径:dannet/*.parquet - 配置名称:retspraksis 数据文件: - 拆分方式:训练集(train) 路径:retspraksis/*.parquet - 配置名称:wikibooks 数据文件: - 拆分方式:训练集(train) 路径:wikibooks/*.parquet - 配置名称:jvj 数据文件: - 拆分方式:训练集(train) 路径:jvj/*.parquet - 配置名称:gutenberg 数据文件: - 拆分方式:训练集(train) 路径:gutenberg/*.parquet - 配置名称:botxt 数据文件: - 拆分方式:训练集(train) 路径:botxt/*.parquet - 配置名称:depbank 数据文件: - 拆分方式:训练集(train) 路径:depbank/*.parquet - 配置名称:naat 数据文件: - 拆分方式:训练集(train) 路径:naat/*.parquet - 配置名称:synne 数据文件: - 拆分方式:训练集(train) 路径:synne/*.parquet - 配置名称:wiki 数据文件: - 拆分方式:训练集(train) 路径:wiki/*.parquet - 配置名称:relig 数据文件: - 拆分方式:训练集(train) 路径:relig/*.parquet annotations_creators: - 无标注 language_creators: - 众包构建 language: - 丹麦语(da) multilinguality: - 单语语料库 source_datasets: - 原创数据集 task_categories: - 文本生成 task_ids: - 语言建模 pretty_name: 丹麦千兆词库(Danish Gigaword Corpus) language_bcp47: - da(丹麦语) - da-bornholm(丹麦语博恩霍尔姆方言变体) - da-synnejyl(丹麦语叙讷朱利方言变体) # 丹麦千兆词库(Danish Gigaword Corpus) *版本*: 1.0.0 *许可证*: 详见各子数据集 ## 目录 - [丹麦千兆词库(Danish Gigaword Corpus)](#danish-gigaword-corpus) - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [数据集加载](#loading-the-dataset) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据拆分](#data-splits) - [数据集构建](#dataset-creation) - [源数据](#source-data) - [附加信息](#additional-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页**: https://gigaword.dk - **论文**: http://www.derczynski.com/papers/dagw.pdf ### 数据集概述 丹麦千兆词库(Danish Gigaword Corpus)包含多领域、多形式的文本。本版本**未**包含涉及推文的板块(“通用讨论”与“议会选举”)、`danavis`、Common Crawl(通用爬虫)以及OpenSubtitles(开放字幕),原因是存在隐私、质量与版权方面的潜在风险。 ### 数据集加载 py from datasets import load_dataset name = "danish-foundation-models/danish-gigaword" ds = load_dataset(name, split = "train") sample = ds[1] # 详见下文“数据实例” # 或以流式方式加载数据 ds = load_dataset(name, split = "train", streaming=True) sample = next(iter(ds)) ## 数据集结构 本数据集包含来自不同来源的文本,各来源的详细定义见[源数据](#source-data)章节。更多信息可参见[主页](https://gigaword.dk)或[论文](https://aclanthology.org/2021.nodalida-main.46.pdf)。 ### 数据实例 数据集中的每条条目均由一段独立文本与相关元数据组成: py { 'text': 'Vimoutiers er en kommune i departementet Orne i Basse-Normandie regionen i det nordvestlige Frankrig. Cykelløbet Paris-Camembert slutter i Vimoutiers. Historie. Den 14. juni 1944, under invasionen i Normandiet blev Vimoutiers bombarderet af allierede styrker. Landsbyen blev ødelagt og 220 civile dræbt. Personligheder. Politikeren Joseph Laniel (1889-1975) var født i Vomoutiers.', 'source': 'wiki', 'id': 'wiki_366127', 'added': '2021-03-28', 'created': '2019-01-01, 2021-01-01', 'metadata': {'domain': 'Wiki & Books', 'license': 'Creative Commons Legal Code CC0 1.0 Universal', 'source-pretty': 'Wikipedia' } } ### 数据字段 数据集中的每条条目包含以下字段: - `text`(字符串类型):文档的具体内容。 - `source`(字符串类型):文档的来源(详见[源数据](#source-data)章节)。 - `id`(字符串类型):每份文档的唯一标识符。 - `added`(字符串类型):文档被加入本数据集的日期。 - `created`(字符串类型):文档最初创作的日期范围。 - `metadata/license`(字符串类型):文档的版权许可证,各来源的许可证有所不同。 - `metadata/domain`(字符串类型):来源所属的领域。 - `metadata/source-pretty`(字符串类型):短名称来源对应的完整名称。 ### 数据拆分 整个语料库仅提供`train`(训练集)拆分。 ## 数据集构建 ### 源数据 下文简要介绍了数据集中的各来源及其各自的版权许可证: | 来源缩写 | 许可证 | | ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | adl | Creative Commons Legal Code 1.0 Universal | | botxt | Creative Commons Legal Code 1.0 Universal | | dannet | [dannet许可证](https://cst.ku.dk/projekter/dannet/license.txt) | | depbank | Attribution-ShareAlike 4.0 International | | ep | Creative Commons Legal Code 1.0 Universal | | ft | Creative Commons Legal Code 1.0 Universal | | gutenberg | [古腾堡(Gutenberg)许可证](https://www.gutenberg.org/policy/license.html) | | hest | Creative Commons Legal Code 1.0 Universal | | jvj | Attribution-ShareAlike 4.0 International | | naat | Creative Commons Legal Code 1.0 Universal | | relig | Creative Commons Legal Code 1.0 Universal | | retsinformationdk | 根据丹麦版权法https://www.retsinformation.dk/forms/r0710.aspx?id=164796,其第9条规定:"§ 9. 法律、行政法规、司法判决及类似官方文件不受版权保护。第2款:本条第1款的规定不适用于作为上述官方文件中独立贡献的作品。此类作品仅可在与官方文件相关的场景下复制。对作品的进一步使用需遵循其他现行规则。" | | retspraksis | Creative Commons Legal Code 1.0 Universal | | skat | Creative Commons Legal Code 1.0 Universal | | spont | Creative Commons Legal Code 1.0 Universal | | synne | Creative Commons Legal Code 1.0 Universal | | tv2r | 本内容的所有者为丹麦TV2区域媒体(TV2 Regionerne, Denmark),采用Creative Commons Attribution 4.0 International许可 | | wiki | Creative Commons Legal Code 1.0 Universal | | wikibooks | Creative Commons Legal Code 1.0 Universal | | wikisource | Creative Commons Legal Code 1.0 Universal | 这些来源对应数据集中的以下顶级领域: python # 领域到顶级领域的映射字典 domain_mapping_dict = { "retsinformationdk": "法律", "skat": "法律", "retspraksis": "法律", "hest": "社交媒体", "cc": "网络", "adl": "维基与图书", "botxt": "其他", "danavis": "新闻", "dannet": "dannet", "depbank": "其他", "ep": "会话文本", "ft": "会话文本", "gutenberg": "维基与图书", "jvj": "维基与图书", "naat": "会话文本", "opensub": "开放字幕", "relig": "维基与图书", "spont": "会话文本", "synne": "其他", "tv2r": "新闻", "wiki": "维基与图书", "wikibooks": "维基与图书", "wikisource": "维基与图书", "twfv19": "社交媒体", # 本版本数据集未包含该来源 } 以下映射关系用于实现来源短名称与完整名称的转换: python # 短名称到完整名称的映射字典 longname_mapping_dict = { "retsinformationdk": "retsinformation.dk(丹麦法律信息平台)", "skat": "Skat(丹麦税务总局)", "retspraksis": "retspraksis(丹麦法律信息平台)", "hest": "Hestenettet(丹麦辩论论坛)", "cc": "Common Crawl(通用爬虫)", "adl": "丹麦文学档案馆", "botxt": "博恩霍尔姆方言", "danavis": "丹麦日报", "dannet": "DanNet(丹麦词网)", "depbank": "丹麦依存树库(Danish Dependency Treebank)", "ep": "欧洲议会", "ft": "Folketinget(丹麦议会)", "gutenberg": "古腾堡(Gutenberg)", "jvj": "约翰内斯·V·延森", "naat": "NAAT", "opensub": "Open Subtitles(开放字幕)", "relig": "宗教文本", "spont": "自发语音文本", "synne": "叙讷朱利方言", "tv2r": "TV 2 Radio(丹麦新闻广播)", "wiki": "Wikipedia(维基百科)", "wikibooks": "Wikibooks(维基教科书)", "wikisource": "Wikisource(维基文库)", "twfv19": "2019年丹麦议会选举推文", # 本版本数据集未包含该来源 } ## 附加信息 ### 引用信息 示例引用格式: 在新闻稿中: > 该模型基于丹麦千兆词库项目(https://gigaword.dk)的数据集进行预训练,该数据集由哥本哈根信息技术大学的研究人员开发。 > 本模型使用丹麦千兆词库(Danish Gigaword Corpus)进行预训练,该语料库由哥本哈根信息技术大学开发。 在学术写作中: > Derczynski, L., Ciosici, M. R., et al. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021). @inproceedings{dagw, title = {{The Danish Gigaword Corpus}}, author = {Leon Derczynski and Manuel R. Ciosici and Rebekah Baglini and Morten H. Christiansen and Jacob Aarup Dalsgaard and Riccardo Fusaroli and Peter Juel Henrichsen and Rasmus Hvingelby and Andreas Kirkedal and Alex Speed Kjeldsen and Claus Ladefoged and Finn Årup Nielsen and Jens Madsen and Malte Lau Petersen and Jonathan Hvithamar Rystrøm and Daniel Varab}, year = 2021, booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics}, publisher = {NEALT} } 在软件产品、工具或服务中: > 本服务使用了丹麦千兆词库(Danish Gigaword Corpus)的数据。 ### 贡献者 本数据集由Derczynski等人(2021)构建。感谢[@HLasse](https://github.com/HLasse)、[@KennethEnevoldsen](https://github.com/kennethenevoldsen)以及[Jan Kostkan](https://github.com/jankounchained)将本数据集添加至Hugging Face Hub。
提供机构:
danish-foundation-models
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作