ifeval-th

Name: ifeval-th
Creator: maas
Published: 2025-07-16 16:29:50
License: 暂无描述

魔搭社区2025-07-16 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/scb10x/ifeval-th

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset Card for IFEval-TH IFEval-TH is a Thai version of IFEval. The original English instructions (https://huggingface.co/datasets/google/IFEval) were translated into Thai using GPT-4, followed by a manual verification and correction process to ensure accuracy and content consistency. Rows with poor translation quality or irrelevant context in Thai were removed from the dataset. ### IFEval code modification To use this dataset, you need to modify the IFEval code (https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/ifeval/instructions_util.py) to include support for the Thai language. ```python from pythainlp import word_tokenize, sent_tokenize import langdetect def count_words(text): """Counts the number of words.""" try: # word_tokenize if langdetect.detect(text) == 'th': tokens = word_tokenize(text) else: tokenizer = nltk.tokenize.RegexpTokenizer(r"\w+") tokens = tokenizer.tokenize(text) num_words = len(tokens) return num_words except Exception as e: return 0 def count_sentences(text): """Count the number of sentences.""" try: if langdetect.detect(text) == 'th': tokenized_sentences = sent_tokenize(text) else: tokenizer = _get_en_sentence_tokenizer() tokenized_sentences = tokenizer.tokenize(text) return len(tokenized_sentences) except Exception: return 0 ``` ### Licensing Information The dataset is available under the Apache 2.0 license.

IFEval-TH 数据集卡片 IFEval-TH 是 IFEval 的泰语版本。其原始英文指令集（https://huggingface.co/datasets/google/IFEval）通过GPT-4完成泰语翻译，随后经人工校验与修正流程以确保翻译准确性与内容一致性。数据集中已剔除翻译质量不佳或泰语语境无关的样本条目。 ### IFEval 代码修改若要使用本数据集，需修改IFEval代码（https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/ifeval/instructions_util.py）以添加泰语支持。 python from pythainlp import word_tokenize, sent_tokenize import langdetect def count_words(text): """统计单词数量。""" try: # 语种检测与分词处理 if langdetect.detect(text) == 'th': tokens = word_tokenize(text) else: tokenizer = nltk.tokenize.RegexpTokenizer(r"w+") tokens = tokenizer.tokenize(text) num_words = len(tokens) return num_words except Exception as e: return 0 def count_sentences(text): """统计句子数量。""" try: if langdetect.detect(text) == 'th': tokenized_sentences = sent_tokenize(text) else: tokenizer = _get_en_sentence_tokenizer() tokenized_sentences = tokenizer.tokenize(text) return len(tokenized_sentences) except Exception: return 0 ### 许可信息本数据集采用 Apache 2.0 开源许可协议发布。

提供机构：

maas

创建时间：

2025-05-23

搜集汇总

数据集介绍