recursal/Europarl-Translation-Instruct
收藏Dataset Card for Europarl-Translation-Instruct
Dataset Details
Dataset Description
- Curated by: M8than
- Funded by: Recursal.ai
- Shared by: M8than
- Language(s) (NLP): English instruct (but various languages in)
- License: cc-by-sa-4.0
Dataset Sources
- Source Data: https://www.statmt.org/europarl/ (Transcript source)
Processing and Filtering
- Prerequisite: Download the source dataset from https://www.statmt.org/europarl/.
- Scripts: Extract every translation of the europarl transcripts and match them together to create various translation instruct datasets.
Format
-
Dataset files: JSONL with each line representing one conversation.
-
Example: json {"conversation":[{"sender":"system","message":"You will be given some text and you must respond only with the text if spoken by someone who speaks en"},{"sender":"user","message":"Ich halte dies für ein ganz legitimes Ansinnen"},{"sender":"assistant","message":"I think it is a fairly legitimate request"}]}
-
Structure: Each line is keyed by the word "conversation" which contains an array of message dictionaries with sender and message keys.
Data Splits
- sentences: Contains sentence translation conversations.
- paragraphs: Contains paragraph translation conversations.
- full: Contains full transcript translations.
Licensing Information
- Content: This release contains content from europarl transformed into a conversational instruction dataset.
- Waifus: Recursal Waifus (The banner image) are licensed under CC-BY-SA. They do not represent the related websites in any official capacity unless otherwise or announced by the website. You may use them as a banner image. However, you must always link back to the dataset.
Citation Information
@ONLINE{europarl-translation-instruct, title = {europarl-translation-instruct}, author = {M8than, recursal.ai}, year = {2024}, howpublished = {url{https://huggingface.co/datasets/recursal/europarl-translation-instruct}}, }



