five

ifeval-th

收藏
魔搭社区2025-07-16 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/scb10x/ifeval-th
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset Card for IFEval-TH IFEval-TH is a Thai version of IFEval. The original English instructions (https://huggingface.co/datasets/google/IFEval) were translated into Thai using GPT-4, followed by a manual verification and correction process to ensure accuracy and content consistency. Rows with poor translation quality or irrelevant context in Thai were removed from the dataset. ### IFEval code modification To use this dataset, you need to modify the IFEval code (https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/ifeval/instructions_util.py) to include support for the Thai language. ```python from pythainlp import word_tokenize, sent_tokenize import langdetect def count_words(text): """Counts the number of words.""" try: # word_tokenize if langdetect.detect(text) == 'th': tokens = word_tokenize(text) else: tokenizer = nltk.tokenize.RegexpTokenizer(r"\w+") tokens = tokenizer.tokenize(text) num_words = len(tokens) return num_words except Exception as e: return 0 def count_sentences(text): """Count the number of sentences.""" try: if langdetect.detect(text) == 'th': tokenized_sentences = sent_tokenize(text) else: tokenizer = _get_en_sentence_tokenizer() tokenized_sentences = tokenizer.tokenize(text) return len(tokenized_sentences) except Exception: return 0 ``` ### Licensing Information The dataset is available under the Apache 2.0 license.

IFEval-TH 数据集卡片 IFEval-TH 是 IFEval 的泰语版本。其原始英文指令集(https://huggingface.co/datasets/google/IFEval)通过GPT-4完成泰语翻译,随后经人工校验与修正流程以确保翻译准确性与内容一致性。数据集中已剔除翻译质量不佳或泰语语境无关的样本条目。 ### IFEval 代码修改 若要使用本数据集,需修改IFEval代码(https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/ifeval/instructions_util.py)以添加泰语支持。 python from pythainlp import word_tokenize, sent_tokenize import langdetect def count_words(text): """统计单词数量。""" try: # 语种检测与分词处理 if langdetect.detect(text) == 'th': tokens = word_tokenize(text) else: tokenizer = nltk.tokenize.RegexpTokenizer(r"w+") tokens = tokenizer.tokenize(text) num_words = len(tokens) return num_words except Exception as e: return 0 def count_sentences(text): """统计句子数量。""" try: if langdetect.detect(text) == 'th': tokenized_sentences = sent_tokenize(text) else: tokenizer = _get_en_sentence_tokenizer() tokenized_sentences = tokenizer.tokenize(text) return len(tokenized_sentences) except Exception: return 0 ### 许可信息 本数据集采用 Apache 2.0 开源许可协议发布。
提供机构:
maas
创建时间:
2025-05-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作