mlfoundations-dev/organic_chemistry_train_fasttext
收藏Hugging Face2025-03-27 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mlfoundations-dev/organic_chemistry_train_fasttext
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个特征字段,如去重前的n-gram计数、页面整体的语言ID(使用fasttext向量表示)、元数据(包括内容长度、内容类型、Warc相关信息等)、前一个单词的计数、文本内容、URL和Warc信息。数据集被划分为训练集,共有1048个样本。数据集的下载大小为1,227,764字节,总大小为4,371,914.642901542字节。
The dataset includes multiple feature fields such as n-gram count before deduplication, language ID for the whole page represented by fasttext vectors, metadata (including content length, content type, Warc-related information, etc.), count of the previous word, text content, URL, and Warc information. The dataset is split into a training set with a total of 1,048 samples. The download size of the dataset is 1,227,764 bytes, and the total size is 4,371,914.642901542 bytes.
提供机构:
mlfoundations-dev



