five

scoris/en-lt-merged-data

收藏
Hugging Face2024-02-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/scoris/en-lt-merged-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - lt - en license: cc-by-2.5 size_categories: - 1M<n<10M dataset_info: features: - name: translation struct: - name: en dtype: string - name: lt dtype: string - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 850721410 num_examples: 4948879 - name: validation num_bytes: 8586743 num_examples: 49989 download_size: 643159722 dataset_size: 859308153 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* --- ![Scoris logo](https://scoris.lt/logo_smaller.png) The data set is a merge of other open datasets: - [wmt19](https://huggingface.co/datasets/wmt19) (lt-en) - [opus100](https://huggingface.co/datasets/opus100) (en-lt) - [sentence-transformers/parallel-sentences](https://huggingface.co/datasets/sentence-transformers/parallel-sentences) - Europarl-en-lt-train.tsv.gz - JW300-en-lt-train.tsv.gz - OpenSubtitles-en-lt-train.tsv.gz - Talks-en-lt-train.tsv.gz - Tatoeba-en-lt-train.tsv.gz - WikiMatrix-en-lt-train.tsv.gz - Custom [Scoris](https://scoris.lt) data set translated using Deepl. Basic clean-up and deduplication was applied when creating this set This can be used to train Lithuanian-English-Lithuanian MT Seq2Seq models. Made by [Scoris](https://scoris.lt) team You can use this in the following way: ```python from datasets import load_dataset dataset_name = "scoris/en-lt-merged-data" # Load the dataset dataset = load_dataset(dataset_name) # Accessing data # Display the first example from the training set print("First training example:", dataset['train'][0]) # Display the first example from the validation set print("First validation example:", dataset['validation'][0]) # Iterate through a few examples from the training set for i, example in enumerate(dataset['train']): if i < 5: print(f"Training example {i}:", example) else: break # If you want to use the dataset in a machine learning model, you can directly # iterate over the dataset or convert it to a pandas DataFrame for analysis import pandas as pd # Convert the training set to a pandas DataFrame train_df = pd.DataFrame(dataset['train']) print(train_df.head()) ```
提供机构:
scoris
原始信息汇总

数据集概述

语言

  • 立陶宛语 (lt)
  • 英语 (en)

许可

  • CC-BY-2.5

数据集大小分类

  • 1M < n < 10M

数据集信息

特征

  • translation
    • en: 字符串类型
    • lt: 字符串类型
  • index_level_0: 整数类型 (int64)

分割

  • train
    • 字节数: 850721410
    • 样本数: 4948879
  • validation
    • 字节数: 8586743
    • 样本数: 49989

下载和数据集大小

  • 下载大小: 643159722
  • 数据集大小: 859308153

配置

  • default
    • train
      • 路径: data/train-*
    • validation
      • 路径: data/validation-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作