scoris/en-lt-merged-data
收藏Hugging Face2024-02-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/scoris/en-lt-merged-data
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- lt
- en
license: cc-by-2.5
size_categories:
- 1M<n<10M
dataset_info:
features:
- name: translation
struct:
- name: en
dtype: string
- name: lt
dtype: string
- name: __index_level_0__
dtype: int64
splits:
- name: train
num_bytes: 850721410
num_examples: 4948879
- name: validation
num_bytes: 8586743
num_examples: 49989
download_size: 643159722
dataset_size: 859308153
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
---

The data set is a merge of other open datasets:
- [wmt19](https://huggingface.co/datasets/wmt19) (lt-en)
- [opus100](https://huggingface.co/datasets/opus100) (en-lt)
- [sentence-transformers/parallel-sentences](https://huggingface.co/datasets/sentence-transformers/parallel-sentences)
- Europarl-en-lt-train.tsv.gz
- JW300-en-lt-train.tsv.gz
- OpenSubtitles-en-lt-train.tsv.gz
- Talks-en-lt-train.tsv.gz
- Tatoeba-en-lt-train.tsv.gz
- WikiMatrix-en-lt-train.tsv.gz
- Custom [Scoris](https://scoris.lt) data set translated using Deepl.
Basic clean-up and deduplication was applied when creating this set
This can be used to train Lithuanian-English-Lithuanian MT Seq2Seq models.
Made by [Scoris](https://scoris.lt) team
You can use this in the following way:
```python
from datasets import load_dataset
dataset_name = "scoris/en-lt-merged-data"
# Load the dataset
dataset = load_dataset(dataset_name)
# Accessing data
# Display the first example from the training set
print("First training example:", dataset['train'][0])
# Display the first example from the validation set
print("First validation example:", dataset['validation'][0])
# Iterate through a few examples from the training set
for i, example in enumerate(dataset['train']):
if i < 5:
print(f"Training example {i}:", example)
else:
break
# If you want to use the dataset in a machine learning model, you can directly
# iterate over the dataset or convert it to a pandas DataFrame for analysis
import pandas as pd
# Convert the training set to a pandas DataFrame
train_df = pd.DataFrame(dataset['train'])
print(train_df.head())
```
提供机构:
scoris
原始信息汇总
数据集概述
语言
- 立陶宛语 (lt)
- 英语 (en)
许可
- CC-BY-2.5
数据集大小分类
- 1M < n < 10M
数据集信息
特征
- translation
- en: 字符串类型
- lt: 字符串类型
- index_level_0: 整数类型 (int64)
分割
- train
- 字节数: 850721410
- 样本数: 4948879
- validation
- 字节数: 8586743
- 样本数: 49989
下载和数据集大小
- 下载大小: 643159722
- 数据集大小: 859308153
配置
- default
- train
- 路径: data/train-*
- validation
- 路径: data/validation-*
- train



