ifeval-th
收藏魔搭社区2025-07-16 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/scb10x/ifeval-th
下载链接
链接失效反馈官方服务:
资源简介:
Dataset Card for IFEval-TH
IFEval-TH is a Thai version of IFEval. The original English instructions (https://huggingface.co/datasets/google/IFEval)
were translated into Thai using GPT-4, followed by a manual verification and correction process to ensure accuracy and content consistency.
Rows with poor translation quality or irrelevant context in Thai were removed from the dataset.
### IFEval code modification
To use this dataset, you need to modify the IFEval code (https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/ifeval/instructions_util.py) to include support for the Thai language.
```python
from pythainlp import word_tokenize, sent_tokenize
import langdetect
def count_words(text):
"""Counts the number of words."""
try:
# word_tokenize
if langdetect.detect(text) == 'th':
tokens = word_tokenize(text)
else:
tokenizer = nltk.tokenize.RegexpTokenizer(r"\w+")
tokens = tokenizer.tokenize(text)
num_words = len(tokens)
return num_words
except Exception as e:
return 0
def count_sentences(text):
"""Count the number of sentences."""
try:
if langdetect.detect(text) == 'th':
tokenized_sentences = sent_tokenize(text)
else:
tokenizer = _get_en_sentence_tokenizer()
tokenized_sentences = tokenizer.tokenize(text)
return len(tokenized_sentences)
except Exception:
return 0
```
### Licensing Information
The dataset is available under the Apache 2.0 license.
IFEval-TH 数据集卡片
IFEval-TH 是 IFEval 的泰语版本。其原始英文指令集(https://huggingface.co/datasets/google/IFEval)通过GPT-4完成泰语翻译,随后经人工校验与修正流程以确保翻译准确性与内容一致性。数据集中已剔除翻译质量不佳或泰语语境无关的样本条目。
### IFEval 代码修改
若要使用本数据集,需修改IFEval代码(https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/ifeval/instructions_util.py)以添加泰语支持。
python
from pythainlp import word_tokenize, sent_tokenize
import langdetect
def count_words(text):
"""统计单词数量。"""
try:
# 语种检测与分词处理
if langdetect.detect(text) == 'th':
tokens = word_tokenize(text)
else:
tokenizer = nltk.tokenize.RegexpTokenizer(r"w+")
tokens = tokenizer.tokenize(text)
num_words = len(tokens)
return num_words
except Exception as e:
return 0
def count_sentences(text):
"""统计句子数量。"""
try:
if langdetect.detect(text) == 'th':
tokenized_sentences = sent_tokenize(text)
else:
tokenizer = _get_en_sentence_tokenizer()
tokenized_sentences = tokenizer.tokenize(text)
return len(tokenized_sentences)
except Exception:
return 0
### 许可信息
本数据集采用 Apache 2.0 开源许可协议发布。
提供机构:
maas
创建时间:
2025-05-23



