julia-lukasiewicz-pater/GPT-wiki-intro-features
收藏Hugging Face2023-06-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/julia-lukasiewicz-pater/GPT-wiki-intro-features
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc
task_categories:
- text-classification
language:
- en
size_categories:
- 100K<n<1M
---
# Small-GPT-wiki-intro-features dataset
This dataset is based on [aadityaubhat/GPT-wiki-intro](https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro).
It contains 150k short texts from Wikipedia (label 0) and corresponding texts generated by ChatGPT (label 1) (together 300k texts).
For each text, various complexity measures were calculated, including e.g. readability, lexical diversity etc.
It can be used for text classification or analysis of linguistic features of human-generated and ChatGPT-generated texts.
For a smaller version, check out [julia-lukasiewicz-pater/small-GPT-wiki-intro-features](https://huggingface.co/datasets/julia-lukasiewicz-pater/small-GPT-wiki-intro-features).
## Dataset structure
Features were calculated using various Python libraries, i.e. NLTK, [readability-metrics](https://pypi.org/project/py-readability-metrics/), [lexical-diversity](https://pypi.org/project/lexical-diversity/),
and [TextDescriptives](https://hlasse.github.io/TextDescriptives/). The list of all features and their corresponding sources can be found below:
| Column | Description |
| ------ | ----------- |
| text | human- or ChatGPT-generated text; taken from aadityaubhat/GPT-wiki-intro |
| normalized_bigram_entropy | bigram entropy normalized with estimated maximum entropy; nltk |
| mean_word_length | mean word length; nltk |
| mean_sent_length | mean sentence length; nltk |
| fog | Gunning-Fog; readability-metrics |
| ari | Automated Readability Index; readability-metrics |
| dale_chall | Dale Chall Readability; readability-metrics |
| hdd | Hypergeometric Distribution; lexical-diversity |
| mtld | Measure of lexical textual diversity; lexical-diversity |
| mattr | Moving average type-token ratio; lexical-diversity |
| number_of_ADJ | proportion of adjectives per word; nltk |
| number_of_ADP | proportion of adpositions per word; nltk |
| number_of_ADV | proportion of adverbs per word; nltk |
| number_of_CONJ | proportion of conjunctions per word; nltk |
| number_of_DET | proportion of determiners per word; nltk |
| number_of_NOUN | proportion of nouns per word; nltk |
| number_of_NUM | proportion of numerals per word; nltk |
| number_of_PRT | proportion of particles per word; nltk |
| number_of_PRON | proportion of pronuns per word; nltk |
| number_of_VERB | proportion of verbs per word; nltk |
| number_of_DOT | proportion of punctuation marks per word; nltk |
| number_of_X | proportion of POS tag 'Other' per word; nltk |
| class | binary class, 0 stands for Wikipedia, 1 stands for ChatGPT |
| spacy_perplexity | text perplexity; TextDescriptives |
| entropy | text entropy; TextDescriptives |
| automated_readability_index | Automated Readability Index; TextDescriptives |
| per_word_spacy_perplexity | text perplexity per word; TextDescriptives |
| dependency_distance_mean | mean distance from each token to their dependent; TextDescriptives |
| dependency_distance_std | standard deviation of distance from each token to their dependent; TextDescriptives |
| first_order_coherence | cosine similarity between consecutive sentences; TextDescriptives |
| second_order_coherence | cosine similarity between sentences that are two sentences apart; TextDescriptives |
| smog |SMOG; TextDescriptives |
| prop_adjacent_dependency_relation_mean | mean proportion adjacent dependency relations; TextDescriptives |
| prop_adjacent_dependency_relation_std | standard deviation of proportion adjacent dependency relations; TextDescriptives |
| syllables_per_token_mean | mean of syllables per token; TextDescriptives |
| syllables_per_token_median | median of syllables per token; TextDescriptives |
| token_length_std | standard deviation of token length; TextDescriptives |
| token_length_median | median of token length; TextDescriptives |
| sentence_length_median | median of sentence length; TextDescriptives |
| syllables_per_token_std | standard deviation of syllables per token; TextDescriptives |
| proportion_unique_tokens | proportion of unique tokens; TextDescriptives |
| top_ngram_chr_fraction_3 | fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range; TextDescriptives |
| top_ngram_chr_fraction_2 | fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range; TextDescriptives |
| top_ngram_chr_fraction_4 | fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range; TextDescriptives |
| proportion_bullet_points | fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range; TextDescriptives |
| flesch_reading_ease | Flesch Reading ease ; TextDescriptives |
| flesch_kincaid_grade | Flesch Kincaid grade; TextDescriptives |
| gunning_fog | Gunning-Fog; TextDescriptives |
| coleman_liau_index | Coleman-Liau Index; TextDescriptives |
| oov_ratio| out-of-vocabulary ratio; TextDescriptives |
## Code
Code that was used to generate this dataset can be found on [Github](https://github.com/julia-lukasiewicz-pater/gpt-wiki-features/tree/main).
提供机构:
julia-lukasiewicz-pater
原始信息汇总
Small-GPT-wiki-intro-features 数据集
概述
该数据集基于 aadityaubhat/GPT-wiki-intro,包含 150k 篇来自 Wikipedia(标签 0)和相应由 ChatGPT 生成的文本(标签 1),共计 300k 篇文本。每篇文本计算了多种复杂度指标,如可读性、词汇多样性等。可用于文本分类或分析人类生成与 ChatGPT 生成文本的语言特征。
数据集结构
特征计算使用了多种 Python 库,包括 NLTK、readability-metrics、lexical-diversity 和 TextDescriptives。所有特征及其对应来源如下:
| 列名 | 描述 |
|---|---|
| text | 人类或 ChatGPT 生成的文本;取自 aadityaubhat/GPT-wiki-intro |
| normalized_bigram_entropy | 归一化双词熵;nltk |
| mean_word_length | 平均词长;nltk |
| mean_sent_length | 平均句子长度;nltk |
| fog | Gunning-Fog 指数;readability-metrics |
| ari | 自动可读性指数;readability-metrics |
| dale_chall | Dale Chall 可读性;readability-metrics |
| hdd | 超几何分布;lexical-diversity |
| mtld | 词汇文本多样性测量;lexical-diversity |
| mattr | 移动平均类型-标记比;lexical-diversity |
| number_of_ADJ | 每词形容词比例;nltk |
| number_of_ADP | 每词介词比例;nltk |
| number_of_ADV | 每词副词比例;nltk |
| number_of_CONJ | 每词连词比例;nltk |
| number_of_DET | 每词限定词比例;nltk |
| number_of_NOUN | 每词名词比例;nltk |
| number_of_NUM | 每词数词比例;nltk |
| number_of_PRT | 每词小品词比例;nltk |
| number_of_PRON | 每词代词比例;nltk |
| number_of_VERB | 每词动词比例;nltk |
| number_of_DOT | 每词标点符号比例;nltk |
| number_of_X | 每词 POS 标签 Other 比例;nltk |
| class | 二元类别,0 代表 Wikipedia,1 代表 ChatGPT |
| spacy_perplexity | 文本困惑度;TextDescriptives |
| entropy | 文本熵;TextDescriptives |
| automated_readability_index | 自动可读性指数;TextDescriptives |
| per_word_spacy_perplexity | 每词文本困惑度;TextDescriptives |
| dependency_distance_mean | 每个词到其依赖项的平均距离;TextDescriptives |
| dependency_distance_std | 每个词到其依赖项距离的标准差;TextDescriptives |
| first_order_coherence | 连续句子间的余弦相似度;TextDescriptives |
| second_order_coherence | 相隔两个句子的句子间的余弦相似度;TextDescriptives |
| smog | SMOG 指数;TextDescriptives |
| prop_adjacent_dependency_relation_mean | 相邻依赖关系平均比例;TextDescriptives |
| prop_adjacent_dependency_relation_std | 相邻依赖关系比例标准差;TextDescriptives |
| syllables_per_token_mean | 每词音节平均数;TextDescriptives |
| syllables_per_token_median | 每词音节中位数;TextDescriptives |
| token_length_std | 词长标准差;TextDescriptives |
| token_length_median | 词长中位数;TextDescriptives |
| sentence_length_median | 句子长度中位数;TextDescriptives |
| syllables_per_token_std | 每词音节标准差;TextDescriptives |
| proportion_unique_tokens | 唯一词比例;TextDescriptives |
| top_ngram_chr_fraction_3 | 文档中包含在最高 n-gram 中的字符比例;TextDescriptives |
| top_ngram_chr_fraction_2 | 文档中包含在最高 n-gram 中的字符比例;TextDescriptives |
| top_ngram_chr_fraction_4 | 文档中包含在最高 n-gram 中的字符比例;TextDescriptives |
| proportion_bullet_points | 文档中包含在最高 n-gram 中的字符比例;TextDescriptives |
| flesch_reading_ease | Flesch 阅读易度;TextDescriptives |
| flesch_kincaid_grade | Flesch Kincaid 年级;TextDescriptives |
| gunning_fog | Gunning-Fog 指数;TextDescriptives |
| coleman_liau_index | Coleman-Liau 指数;TextDescriptives |
| oov_ratio | 词汇外比率;TextDescriptives |



