julia-lukasiewicz-pater/GPT-wiki-intro-features

Name: julia-lukasiewicz-pater/GPT-wiki-intro-features
Creator: julia-lukasiewicz-pater
Published: 2023-06-11 14:41:17
License: 暂无描述

Hugging Face2023-06-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/julia-lukasiewicz-pater/GPT-wiki-intro-features

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc task_categories: - text-classification language: - en size_categories: - 100K<n<1M --- # Small-GPT-wiki-intro-features dataset This dataset is based on [aadityaubhat/GPT-wiki-intro](https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro). It contains 150k short texts from Wikipedia (label 0) and corresponding texts generated by ChatGPT (label 1) (together 300k texts). For each text, various complexity measures were calculated, including e.g. readability, lexical diversity etc. It can be used for text classification or analysis of linguistic features of human-generated and ChatGPT-generated texts. For a smaller version, check out [julia-lukasiewicz-pater/small-GPT-wiki-intro-features](https://huggingface.co/datasets/julia-lukasiewicz-pater/small-GPT-wiki-intro-features). ## Dataset structure Features were calculated using various Python libraries, i.e. NLTK, [readability-metrics](https://pypi.org/project/py-readability-metrics/), [lexical-diversity](https://pypi.org/project/lexical-diversity/), and [TextDescriptives](https://hlasse.github.io/TextDescriptives/). The list of all features and their corresponding sources can be found below: | Column | Description | | ------ | ----------- | | text | human- or ChatGPT-generated text; taken from aadityaubhat/GPT-wiki-intro | | normalized_bigram_entropy | bigram entropy normalized with estimated maximum entropy; nltk | | mean_word_length | mean word length; nltk | | mean_sent_length | mean sentence length; nltk | | fog | Gunning-Fog; readability-metrics | | ari | Automated Readability Index; readability-metrics | | dale_chall | Dale Chall Readability; readability-metrics | | hdd | Hypergeometric Distribution; lexical-diversity | | mtld | Measure of lexical textual diversity; lexical-diversity | | mattr | Moving average type-token ratio; lexical-diversity | | number_of_ADJ | proportion of adjectives per word; nltk | | number_of_ADP | proportion of adpositions per word; nltk | | number_of_ADV | proportion of adverbs per word; nltk | | number_of_CONJ | proportion of conjunctions per word; nltk | | number_of_DET | proportion of determiners per word; nltk | | number_of_NOUN | proportion of nouns per word; nltk | | number_of_NUM | proportion of numerals per word; nltk | | number_of_PRT | proportion of particles per word; nltk | | number_of_PRON | proportion of pronuns per word; nltk | | number_of_VERB | proportion of verbs per word; nltk | | number_of_DOT | proportion of punctuation marks per word; nltk | | number_of_X | proportion of POS tag 'Other' per word; nltk | | class | binary class, 0 stands for Wikipedia, 1 stands for ChatGPT | | spacy_perplexity | text perplexity; TextDescriptives | | entropy | text entropy; TextDescriptives | | automated_readability_index | Automated Readability Index; TextDescriptives | | per_word_spacy_perplexity | text perplexity per word; TextDescriptives | | dependency_distance_mean | mean distance from each token to their dependent; TextDescriptives | | dependency_distance_std | standard deviation of distance from each token to their dependent; TextDescriptives | | first_order_coherence | cosine similarity between consecutive sentences; TextDescriptives | | second_order_coherence | cosine similarity between sentences that are two sentences apart; TextDescriptives | | smog |SMOG; TextDescriptives | | prop_adjacent_dependency_relation_mean | mean proportion adjacent dependency relations; TextDescriptives | | prop_adjacent_dependency_relation_std | standard deviation of proportion adjacent dependency relations; TextDescriptives | | syllables_per_token_mean | mean of syllables per token; TextDescriptives | | syllables_per_token_median | median of syllables per token; TextDescriptives | | token_length_std | standard deviation of token length; TextDescriptives | | token_length_median | median of token length; TextDescriptives | | sentence_length_median | median of sentence length; TextDescriptives | | syllables_per_token_std | standard deviation of syllables per token; TextDescriptives | | proportion_unique_tokens | proportion of unique tokens; TextDescriptives | | top_ngram_chr_fraction_3 | fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range; TextDescriptives | | top_ngram_chr_fraction_2 | fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range; TextDescriptives | | top_ngram_chr_fraction_4 | fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range; TextDescriptives | | proportion_bullet_points | fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range; TextDescriptives | | flesch_reading_ease | Flesch Reading ease ; TextDescriptives | | flesch_kincaid_grade | Flesch Kincaid grade; TextDescriptives | | gunning_fog | Gunning-Fog; TextDescriptives | | coleman_liau_index | Coleman-Liau Index; TextDescriptives | | oov_ratio| out-of-vocabulary ratio; TextDescriptives | ## Code Code that was used to generate this dataset can be found on [Github](https://github.com/julia-lukasiewicz-pater/gpt-wiki-features/tree/main).

提供机构：

julia-lukasiewicz-pater

原始信息汇总

Small-GPT-wiki-intro-features 数据集

概述

该数据集基于 aadityaubhat/GPT-wiki-intro，包含 150k 篇来自 Wikipedia（标签 0）和相应由 ChatGPT 生成的文本（标签 1），共计 300k 篇文本。每篇文本计算了多种复杂度指标，如可读性、词汇多样性等。可用于文本分类或分析人类生成与 ChatGPT 生成文本的语言特征。

数据集结构

特征计算使用了多种 Python 库，包括 NLTK、readability-metrics、lexical-diversity 和 TextDescriptives。所有特征及其对应来源如下：

列名	描述
text	人类或 ChatGPT 生成的文本；取自 aadityaubhat/GPT-wiki-intro
normalized_bigram_entropy	归一化双词熵；nltk
mean_word_length	平均词长；nltk
mean_sent_length	平均句子长度；nltk
fog	Gunning-Fog 指数；readability-metrics
ari	自动可读性指数；readability-metrics
dale_chall	Dale Chall 可读性；readability-metrics
hdd	超几何分布；lexical-diversity
mtld	词汇文本多样性测量；lexical-diversity
mattr	移动平均类型-标记比；lexical-diversity
number_of_ADJ	每词形容词比例；nltk
number_of_ADP	每词介词比例；nltk
number_of_ADV	每词副词比例；nltk
number_of_CONJ	每词连词比例；nltk
number_of_DET	每词限定词比例；nltk
number_of_NOUN	每词名词比例；nltk
number_of_NUM	每词数词比例；nltk
number_of_PRT	每词小品词比例；nltk
number_of_PRON	每词代词比例；nltk
number_of_VERB	每词动词比例；nltk
number_of_DOT	每词标点符号比例；nltk
number_of_X	每词 POS 标签 Other 比例；nltk
class	二元类别，0 代表 Wikipedia，1 代表 ChatGPT
spacy_perplexity	文本困惑度；TextDescriptives
entropy	文本熵；TextDescriptives
automated_readability_index	自动可读性指数；TextDescriptives
per_word_spacy_perplexity	每词文本困惑度；TextDescriptives
dependency_distance_mean	每个词到其依赖项的平均距离；TextDescriptives
dependency_distance_std	每个词到其依赖项距离的标准差；TextDescriptives
first_order_coherence	连续句子间的余弦相似度；TextDescriptives
second_order_coherence	相隔两个句子的句子间的余弦相似度；TextDescriptives
smog	SMOG 指数；TextDescriptives
prop_adjacent_dependency_relation_mean	相邻依赖关系平均比例；TextDescriptives
prop_adjacent_dependency_relation_std	相邻依赖关系比例标准差；TextDescriptives
syllables_per_token_mean	每词音节平均数；TextDescriptives
syllables_per_token_median	每词音节中位数；TextDescriptives
token_length_std	词长标准差；TextDescriptives
token_length_median	词长中位数；TextDescriptives
sentence_length_median	句子长度中位数；TextDescriptives
syllables_per_token_std	每词音节标准差；TextDescriptives
proportion_unique_tokens	唯一词比例；TextDescriptives
top_ngram_chr_fraction_3	文档中包含在最高 n-gram 中的字符比例；TextDescriptives
top_ngram_chr_fraction_2	文档中包含在最高 n-gram 中的字符比例；TextDescriptives
top_ngram_chr_fraction_4	文档中包含在最高 n-gram 中的字符比例；TextDescriptives
proportion_bullet_points	文档中包含在最高 n-gram 中的字符比例；TextDescriptives
flesch_reading_ease	Flesch 阅读易度；TextDescriptives
flesch_kincaid_grade	Flesch Kincaid 年级；TextDescriptives
gunning_fog	Gunning-Fog 指数；TextDescriptives
coleman_liau_index	Coleman-Liau 指数；TextDescriptives
oov_ratio	词汇外比率；TextDescriptives

5,000+

优质数据集

54 个

任务类型

进入经典数据集