five

openfoodfacts/spellcheck-dataset

收藏
Hugging Face2024-07-30 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/openfoodfacts/spellcheck-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - fr - en - ro - de - es - it - bg - du - el - gr - pl - pt - sk size_categories: 1K<n<10K task_categories: - text2text-generation pretty_name: Spellcheck Training Dataset tags: - natural-language-processing - spellcheck - v5.2 dataset_size: 3000 dataset_info: features: - name: original dtype: string id: field - name: reference dtype: string - name: is_truncated dtype: int64 - name: lang dtype: string - name: code dtype: 'null' splits: - name: train num_bytes: 3003786.9 num_examples: 5391 - name: test num_bytes: 333754.1 num_examples: 599 download_size: 2142379 dataset_size: 3337541.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- # Spellcheck dataset This dataset is used to train a Seq2Seq model designed to fix ingredient lists of Open Food Facts products. Products were extracted from the Open Food Facts database ([JSONL](https://world.openfoodfacts.org/data)) along the lang and the list of ingredients. These products were selected in respect of some criteria: * 20 to 40% unknown ingredients computed during the [Ingredient Extraction Analysis](https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis), * No duplicate in the list of ingredients, * No duplicate with the [spellcheck-benchmark](https://huggingface.co/datasets/openfoodfacts/spellcheck-benchmark) Once extracted, correction are generated using the best closed-sourced LLM on the Benchmark: **OpenAI-GPT-3.5-Turbo**. The following prompt was used to generate the corrections: ```plaintext You are a spellcheck assistant designed to fix typos and errors in a list \ of ingredients in different languages extracted from product packagings. We want to \ extract the ingredients from this list using our algorithms. However, it is possible some typos or \ errors slipped into the list. Your task is to correct those errors following a guideline I provide you. Correction guideline: * If you recognize an ingredient and notice a typo, fix the typo. If you're not sure, don't correct; * Line breaks in the package list of ingredients leads to this error: "<subword1> - <subword2>". Join them into a single <word>; * Some ingredients are enclosed within underscores, such as _milk_ or _cacahuetes_, to denote ingredients that are allergens. Keep them! * In the same way, some ingredients are characterized with *, such as "cane sugar*". You need to keep them as well; * Punctuation such as "," is used to separate 2 ingredients from the list. If the punctuation is missing between 2 ingredients, add one. Otherwise, don't; * Perform uppercase to lowercase changes, and vice-versa, only after a period (.) or for proper names; * Never try to predict percentages in case of OCR bad parsing. Just keep it as it is; * Some additives (such as E124, E150c, etc...) are badly parsed by the OCR. Don't try to correct them; * Keep the same structure, words and whitespaces as much as possible. Focus only on the previous cited rules; Here's a list of examples: ###List of ingredients: 24, 36 % chocolat noir 63% origine non ue (cacao, scre, beurre de cacao, émulsifiant léci - thine de colza, vanille bourbon gousse), œuf, farine de blé, beurre , sucre, mel, sucre perlé, levure chimique, zeste de citron ###Corrected list of ingredients: 24, 36 % chocolat noir 63% origine non ue (cacao, sucre, beurre de cacao, émulsifiant lécithine de colza, vanille bourbon gousse), œuf, farine de blé, beurre , sucre, miel, sucre perlé, levure chimique, zeste de citron ###List of ingredients: eau de source,Potassium Calcium Sulfates Magnesium Sodium Chlorures Nitrates Nitrites Fluor Résidus secs Bicarbonate. ###Corrected list of ingredients: eau de source,Potassium, Calcium, Sulfates, Magnesium, Sodium, Chlorures, Nitrates, Nitrites, Fluor, Résidus secs, Bicarbonate. ###List of ingredients: BASIL (50%), EXTRA VIRGIN OLIVE OIL (32 %), PINE NUTS (4%), Bamboo Fibre, Sugar, Garlic,PECORINO ROMANO PDO CHEESE (196) (Milk), Salt, ###Corrected list of ingredients: BASIL (50%), EXTRA VIRGIN OLIVE OIL (32 %), PINE NUTS (4%), Bamboo Fibre, Sugar, Garlic,PECORINO ROMANO PDO CHEESE (196) (Milk), Salt, ###List of ingredients: Κάθετη μονάδα παραγωγής και επεξεργασίας Συστατικά: Πολτός ελληνικού φυστικιού (70%), υδρογονωμένο φοινικέλαιο, ζάχαρη, καραμέλα (396, κομμάτια ψημένου ελληνικού φυστικιού (2%) , αρωματικές ύλες, αλάτι. Διατηρείται σε δροσερό και σκιερό μέρος. Η παρουσία λαδιού στην επιφάνεια είναι φυσικό φαινόμενο. Ανα κατέψτε καλά Πριν από κάθε χρήση. Παράγεται και συσκευάζεται στην Ελλάδα από : Χρήστος Αγριανίδης. φυστικιού . Αμμουδιά Σερρν, ###Corrected list of ingredients: Κάθετη μονάδα παραγωγής και επεξεργασίας Συστατικά: Πολτός ελληνικού φυστικιού (70%), υδρογονωμένο φοινικέλαιο, ζάχαρη, καραμέλα (396, κομμάτια ψημένου ελληνικού φυστικιού (2%) , αρωματικές ύλες, αλάτι. Διατηρείται σε δροσερό και σκιερό μέρος. Η παρουσία λαδιού στην επιφάνεια είναι φυσικό φαινόμενο. Ανα κατέψτε καλά Πριν από κάθε χρήση. Παράγεται και συσκευάζεται στην Ελλάδα από : Χρήστος Αγριανίδης. φυστικιού . Αμμουδιά Σερρν, ###List of ingredients: Bauturà racoritoare carbogazoasă cu aroma de cata, Ingrediente: apa, zahär, dioxid de carbon, colorant (caramel (ETSod)), acidifiant (acid fosforic), arome, cafeina, Declaratie nytritionala per 100 ml: Valoare energetica 182 kJ/ ###Corrected list of ingredients: Bautură răcoritoare carbogazoasă cu aromă de cata, Ingrediente: apă, zahăr, dioxid de carbon, colorant (caramel (ETSod)), acidifiant (acid fosforic), arome, cafeină, Declaratie nutritionala per 100 ml: Valoare energetică 182 kJ/ ###List of ingredients: Eau œufs entiers, sucre, farine de blé, sirop de gfucose-fructose, beurre concentré (506) (contient du lait), lait en poudre, pâte de cacao, odifié de manioc, cacao maigre en poudre, beurre de bres végétales, poudre de cacao, émulsifiant : lécithine de el, gélifiant : pectine arômes. Dont chocolat 8%. races éventuelles de fruits à coques. nballage avec un sachet absorbeur d'oxygène : ne pas consommer rapidement aprè.souverture. consommer jusqu'au : voir? liledessus de l'emballage ###Corrected list of ingredients: Eau, œufs entiers, sucre, farine de blé, sirop de glucose-fructose, beurre concentré (506) (contient du lait), lait en poudre, pâte de cacao, odifié de manioc, cacao maigre en poudre, beurre de bres végétales, poudre de cacao, émulsifiant : lécithine de sel, gélifiant : pectine arômes. Dont chocolat 8%. Traces éventuelles de fruits à coques. emballage avec un sachet absorbeur d'oxygène : ne pas consommer rapidement après ouverture. consommer jusqu'au : voir sur le dessus de l'emballage ###List of ingredients: mand beans (black eyed beans, chickpeas, pea beans, pinto beans, red kidney beans, adzuki beans, water, frming agent: calcium chloride, ###Corrected list of ingredients: mand beans (black eyed beans, chickpeas, pea beans, pinto beans, red kidney beans, adzuki beans, water, firming agent: calcium chloride, ###List of ingredients: Lait écrémé pasteurisé, crème pasteurisée, ferments lactiques présure. Lait et crème origine France. ###Corrected list of ingredients: Lait écrémé pasteurisé, crème pasteurisée, ferments lactiques, présure. Lait et crème origine France. ###List of ingredients: Farine de BLE, eau, ernmental (LAIT) 19%, margarine (huiles ct matières grasses végétales (palme, colza), eau, émulsifiants : EOI, acidifiant : E320), levure, sel, GLUTEN de BLE, levain de BLE, herbes de Provence, érnuj%ifiant F471, farine de BLE malté, agent de traitement de la farine [300 ###Corrected list of ingredients: Farine de BLÉ, eau, emmental (LAIT) 19%, margarine (huiles et matières grasses végétales (palme, colza), eau, émulsifiants : EOI, acidifiant : E320), levure, sel, GLUTEN de BLÉ, levain de BLÉ, herbes de Provence, émulsifiant F471, farine de BLÉ malté, agent de traitement de la farine [300 ``` ## Version notes ### v5 * **v5.2** * 1850 examples reviewed on Argilla * **v5.1** * 1316 examples reviewed on Argilla * **v5.0** * Reviewed exemples on Argilla (~1000 exemples) + Non reviewed exemples (GPT-3.5 generated) ### v4 * **v4.1** * Adding reviewed exemples: ~1000 exemples * **v4.0** * Manuel correction of the dataset in Argilla. Dataset contains only reviewed exemples: ~400 ### v3 * **v3.1** * Alignement oe, œ between text & reference * Example: * text = "oeuf, bœuf", * prediction = "œuf, boeuf", * aligned = "oeuf, bœuf" * Alignement whitespace between number and % * Example: * text = "escargot 14 %, olives 28 %", * prediction = "escargot 14%, olives 28%", * aligned = "escargot 14 %, olives 28 %"

拼写检查训练数据集
提供机构:
openfoodfacts
原始信息汇总

数据集概述

基本信息

  • 名称: Spellcheck Training Dataset
  • 版本: v3
  • 标签: 自然语言处理, 拼写检查
  • 语言: 法语, 英语, 罗马尼亚语, 德语, 西班牙语, 意大利语, 保加利亚语, 荷兰语, 希腊语, 波兰语, 葡萄牙语, 斯洛伐克语
  • 大小类别: 1K<n<10K
  • 任务类别: 文本到文本生成

数据集详情

  • 数据集大小: 3000条记录
  • 下载大小: 2209605字节
  • 总数据集大小: 3402319.0字节

数据结构

  • 特征:
    • code: 整数类型 (int64)
    • lang: 字符串类型 (string)
    • text: 字符串类型 (string)
    • known_ingredients_n: 整数类型 (int64)
    • label: 字符串类型 (string)

数据分割

  • 训练集:
    • 示例数量: 5400
    • 字节数: 3062087.1
  • 测试集:
    • 示例数量: 600
    • 字节数: 340231.9

配置

  • 默认配置:
    • 训练数据路径: data/train-*
    • 测试数据路径: data/test-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作