agentlans/c4-en-tokenized
收藏Hugging Face2025-01-19 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/agentlans/c4-en-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
C4英文分词样本数据集包含从C4(Colossal Clean Crawled Corpus)数据集中提取的英文样本,并使用spaCy的en_core_web_sm模型进行了分词处理。该数据集提供了原始文本、分词后的文本、分词数量和标点符号数量等特征,适用于文本分类、语言建模、情感分析等自然语言处理任务。
This dataset contains tokenized English samples from the C4 (Colossal Clean Crawled Corpus) dataset for natural language processing (NLP) tasks, tokenized using the spaCy en_core_web_sm model. It provides features such as original text, tokenized text, number of tokens, and number of punctuation tokens, suitable for tasks like text classification, language modeling, sentiment analysis, and other NLP applications.
提供机构:
agentlans



