five

danish-foundation-models/swedish-dynaword

收藏
Hugging Face2026-04-14 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/danish-foundation-models/swedish-dynaword
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - crowdsourced language: - sv - swe license: cc0-1.0 multilinguality: - monolingual source_datasets: - original task_categories: - text-generation task_ids: - language-modeling tags: - text-corpus - continual-development - community-collaboration pretty_name: Swedish Dynaword configs: - config_name: default data_files: - split: train path: data/*/*.parquet - config_name: dalpilen-1860 data_files: - split: train path: data/dalpilen-1860/*.parquet - config_name: lag1800 data_files: - split: train path: data/lag1800/*.parquet language_bcp47: - swe --- # 🧨 Swedish Dynaword <!-- START README TABLE --> | | | | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Version** | 0.0.2 ([Changelog](/CHANGELOG.md)) | | **Language** | Swedish (sv, swe) | | **License** | Openly Licensed, See the respective dataset | | **Models** | Currently there is no models trained on this dataset | | **Contact** | If you have question about this project please create an issue [here](https://huggingface.co/datasets/danish-foundation-models/swedish-dynaword/discussions) | <!-- END README TABLE --> ## Table of Contents - [🧨 Swedish Dynaword](#-swedish-dynaword) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Loading the dataset](#loading-the-dataset) - [Languages](#languages) - [Domains](#domains) - [Licensing](#licensing) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Annotations](#annotations) - [Source Data](#source-data) - [Data Collection and Processing](#data-collection-and-processing) - [Dataset Statistics](#dataset-statistics) - [Contributing to the dataset](#contributing-to-the-dataset) - [Citation Information](#citation-information) - [License information](#license-information) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Bias, Risks, and Limitations](#bias-risks-and-limitations) - [Notice and takedown policy](#notice-and-takedown-policy) ## Dataset Description <!-- START-DESC-STATS --> - **Number of samples**: 2.63K - **Number of tokens (Llama 3)**: 18.43M - **Average document length in tokens (min, max)**: 7.01K (21, 80.24K) <!-- END-DESC-STATS --> ### Dataset Summary The Swedish dynaword is a collection of Swedish free-form text datasets from various domains. All of the datasets in the Swedish Dynaword are openly licensed and deemed permissible for training large language models. Swedish dynaword is continually developed, which means that the dataset will actively be updated as new datasets become available. If you would like to contribute a dataset see the [contribute section](#contributing-to-the-dataset). ### Loading the dataset ```py from datasets import load_dataset name = "danish-foundation-models/swedish-dynaword" ds = load_dataset(name, split = "train") sample = ds[1] # see "Data Instances" below ``` or load it by streaming the data ```py ds = load_dataset(name, split = "train", streaming=True) dataset_iter = iter(ds) sample = next(iter(dataset_iter)) ``` You can also load a single subset at a time: ```py ds = load_dataset(name, "lag1800", split = "train") ``` As Swedish dynaword is continually expanding and curated you can make sure that you get the same dataset every time by specifying the revision: You can also load a single subset at a time: ```py ds = load_dataset(name, revision="{desired revision}") ``` ### Languages This dataset includes the following languages: - Swedish (swe-Latn) In addition it likely contains small amounts of English due to code-switching and Scandinavian languages due to language misclassificaitons due to their similarity. Language is denoted using [BCP-47](https://en.wikipedia.org/wiki/IETF_language_tag), using the langauge code ISO [639-3](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes) and the script code [ISO 15924](https://en.wikipedia.org/wiki/ISO_15924). <!-- START-LANGUAGE TABLE --> | Language | Sources | N. Tokens | |:-----------|:---------------------------|:------------| | sv | [dalpilen-1860], [lag1800] | 18.43M | | **Total** | | 18.43M | [dalpilen-1860]: data/dalpilen-1860/dalpilen-1860.md [lag1800]: data/lag1800/lag1800.md <!-- END-LANGUAGE TABLE --> ### Domains This dynaword consist of data from various domains (e.g., legal, books, social media). The following table and figure give an overview of the relative distributions of these domains. To see a full overview of the source check out the [source data section](#source-data) <div style="display: flex; gap: 20px; align-items: flex-start;"> <div style="flex: 1;"> <!-- START-DOMAIN TABLE --> | Domain | Sources | N. Tokens | |:----------|:----------------|:------------| | News | [dalpilen-1860] | 17.51M | | Legal | [lag1800] | 921.60K | | **Total** | | 18.43M | [dalpilen-1860]: data/dalpilen-1860/dalpilen-1860.md [lag1800]: data/lag1800/lag1800.md <!-- END-DOMAIN TABLE --> </div> <div style="flex: 1;"> <p align="center"> <img src="./images/domain_distribution.png" width="400" style="margin-right: 10px;" /> </p> </div> </div> ### Licensing The following gives an overview of the licensing in the Dynaword. To get the exact license of the individual datasets check out the [overview table](#source-data). These license is applied to the constituent data, i.e., the text. The collection of datasets (metadata, quality control, etc.) is licensed under [CC-0](https://creativecommons.org/publicdomain/zero/1.0/legalcode.en). <!-- START-LICENSE TABLE --> | License | Sources | N. Tokens | |:----------|:---------------------------|:------------| | CC-BY 4.0 | [dalpilen-1860], [lag1800] | 18.43M | | **Total** | | 18.43M | [dalpilen-1860]: data/dalpilen-1860/dalpilen-1860.md [lag1800]: data/lag1800/lag1800.md <!-- END-LICENSE TABLE --> ## Dataset Structure The dataset contains text from different sources which are thoroughly defined in [Source Data](#source-data). ### Data Instances Each entry in the dataset consists of a single text with associated metadata <!-- START-SAMPLE --> ```py { "id": "lag1800_0001", "text": "Författningssamling 1800 Låssa kyrkas arkiv\nSvensk författningssamling 1800 Författningssamling för [...]", "source": "lag1800", "added": "2026-04-10", "created": "1800-01-01, 1800-12-31", "token_count": 19244 } ``` ### Data Fields An entry in the dataset consists of the following fields: - `id` (`str`): A unique identifier for each document. - `text` (`str`): The content of the document. - `source` (`str`): The source of the document (see [Source Data](#source-data)). - `added` (`str`): The date when the document was added to this collection. - `created` (`str`): The date range when the document was originally created. - `token_count` (`int`): The number of tokens in the sample computed using the Llama 3 tokenizer. <!-- END-SAMPLE --> ### Data Splits The entire corpus is provided in the `train` split. ## Dataset Creation ### Curation Rationale These datasets were collected and curated with the intention of making openly licensed Swedish data available. While this was collected with the intention of developing language models it is likely to have multiple other uses such as examining language development and differences across domains. ### Annotations This data generally contains no annotation besides the metadata attached to each sample such as what domain it belongs to. ### Source Data Below follows a brief overview of the sources in the corpus along with their individual license. To get more information about the individual dataset click the hyperlink in the table. <details> <summary><b>Overview Table (click to unfold)</b></summary> You can learn more about each dataset by pressing the link in the first column. <!-- START-MAIN TABLE --> | Source | Description | Domain | N. Tokens | License | |:----------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:---------|:------------|:------------| | [dalpilen-1860] | Historical Swedish newspaper issues from Språkbanken Text's [Dalpilen 1860's](https://spraakbanken.gu.se/en/resources/kubhist2-dalpilen-1860) | News | 17.51M | [CC-BY 4.0] | | [lag1800] | Historical Swedish legal texts from Språkbanken Text's [Laws from the 1800's](https://spraakbanken.gu.se/en/resources/lag1800) | Legal | 921.60K | [CC-BY 4.0] | | **Total** | | | 18.43M | | [dalpilen-1860]: data/dalpilen-1860/dalpilen-1860.md [lag1800]: data/lag1800/lag1800.md [CC-0]: https://creativecommons.org/publicdomain/zero/1.0/legalcode.en [CC-BY-SA 4.0]: https://creativecommons.org/licenses/by-sa/4.0/deed.en [CC-BY 4.0]: https://creativecommons.org/licenses/by/4.0/deed.en [Apache 2.0]: https://www.apache.org/licenses/LICENSE-2.0 <!-- END-MAIN TABLE --> </details> ### Data Collection and Processing This dynaword is continually developed, which means that the dataset will actively be updated as new datasets become available. This means that the size of Dynaword increases over time as seen in the following plot: <p align="center"> <img src="./images/tokens_over_time.svg" width="600" style="margin-right: 10px;" /> </p> The data collection and processing varies depending on the dataset and is documentationed the individual datasheets, which is linked in the above table. If possible the collection is documented both in the datasheet and in the reproducible script (`data/{dataset}/create.py`). In addition to data specific processing we also run a series automated quality checks to ensure formatting (e.g. ensuring correctly formatted columns and unique IDs), quality checks (e.g. duplicate and empty string detection) and datasheet documentation checks. These checks are there to ensure a high quality of documentation and a minimal level of quality. To allow for the development of novel cleaning methodologies we do not provide more extensive cleaning. ### Dataset Statistics The following plot(s) are intended to give an overview of docuements length in the various sources. <p align="center"> <img src="./images/dataset_size_plot.svg" width="600" style="margin-right: 10px;" /> </p> ### Contributing to the dataset We welcome contributions to the dataset, including new sources, improved data filtering, and other enhancements. To get started on contributing, please see [the contribution guidelines](CONTRIBUTING.md) ## Citation Information If you use this work, please cite the [scientific article](https://arxiv.org/abs/2508.02271) introducing the Dynaword approach and with the [swedish gigaword](https://spraakbanken.gu.se/en/resources/gigaword) which provides large parts of the datasets: > Enevoldsen, K.C., Jensen, K.N., Kostkan, J., Szab'o, B.I., Kardos, M., Vad, K., Heinsen, J., N'unez, A.B., Barmina, G., Nielsen, J., Larsen, R., Vahlstrup, P.B., Dalum, P.M., Elliott, D., Galke, L., Schneider-Kamp, P., & Nielbo, K.L. (2025). Dynaword: From One-shot to Continuously Developed Datasets. > > Rødven-Eide, Stian (2016). The Swedish Culturomics Gigaword Corpus (updated: 2016-06-07). [Data set]. Språkbanken Text. https://doi.org/10.23695/3wmv-1z09 ``` @article{enevoldsen2025dynaword, title={Dynaword: From One-shot to Continuously Developed Datasets}, author={Enevoldsen, Kenneth and Jensen, Kristian N{\o}rgaard and Kostkan, Jan and Szab{\'o}, Bal{\'a}zs and Kardos, M{\'a}rton and Vad, Kirten and N{\'u}{\~n}ez, Andrea Blasi and Barmina, Gianluca and Nielsen, Jacob and Larsen, Rasmus and others}, journal={arXiv preprint arXiv:2508.02271}, year={2025} } @misc{gigaword, doi = {10.23695/3wmv-1z09}, url = {https://spraakbanken.gu.se/resurser/gigaword}, author = {Rødven-Eide, Stian}, keywords = {Language Technology (Computational Linguistics)}, language = {swe}, title = {The Swedish Culturomics Gigaword Corpus}, publisher = {Språkbanken Text}, year = {2016} } ``` Additionally, we recommend citing the relevant source datasets as well. See the individual datasheets for more information. ## License information The license for each constituent dataset is supplied in the [Source data](#source-data) table. This license is applied to the constituent data, i.e., the text. The collection of datasets (metadata, quality control, etc.) is licensed under [CC-0](https://creativecommons.org/publicdomain/zero/1.0/legalcode.en). ### Personal and Sensitive Information As far as we are aware the dataset does not contain information identifying sexual orientation, political beliefs, religion, or health connected along with a personal identifier of any non-public or non-historic figures. ### Bias, Risks, and Limitations Certain works in this collection are historical works and thus reflect the linguistic, cultural, and ideological norms of their time. As such, it includes perspectives, assumptions, and biases characteristic of the period, which may be considered offensive or exclusionary by contemporary standards. ### Notice and takedown policy We redistribute files shared with us under a license permitting such redistribution. If you have concerns about the licensing of these files, please [contact us](https://huggingface.co/datasets/danish-foundation-models/swedish-dynaword/discussions/new). If you consider that the data contains material that infringe your copyright, please: - Clearly identify yourself with detailed contact information such as an address, a telephone number, or an email address at which you can be contacted. - Clearly reference the original work claimed to be infringed - Clearly identify the material claimed to be infringing and information reasonably sufficient to allow us to locate the material. You can contact us through this channel. We will comply with legitimate requests by removing the affected sources from the next release of the corpus --- <h3 style="display: flex; align-items: center;"> <a href="https://www.foundationmodels.dk"> <img src="./docs/icon.png" width="30" style="margin-right: 10px;" /> </a> A&nbsp;<a href=https://www.foundationmodels.dk>Danish Foundation Models</a>&nbsp;dataset </h3>
提供机构:
danish-foundation-models
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作