five

julia-lukasiewicz-pater/GPT-wiki-intro-features

收藏
Hugging Face2023-06-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/julia-lukasiewicz-pater/GPT-wiki-intro-features
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc task_categories: - text-classification language: - en size_categories: - 100K<n<1M --- # Small-GPT-wiki-intro-features dataset This dataset is based on [aadityaubhat/GPT-wiki-intro](https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro). It contains 150k short texts from Wikipedia (label 0) and corresponding texts generated by ChatGPT (label 1) (together 300k texts). For each text, various complexity measures were calculated, including e.g. readability, lexical diversity etc. It can be used for text classification or analysis of linguistic features of human-generated and ChatGPT-generated texts. For a smaller version, check out [julia-lukasiewicz-pater/small-GPT-wiki-intro-features](https://huggingface.co/datasets/julia-lukasiewicz-pater/small-GPT-wiki-intro-features). ## Dataset structure Features were calculated using various Python libraries, i.e. NLTK, [readability-metrics](https://pypi.org/project/py-readability-metrics/), [lexical-diversity](https://pypi.org/project/lexical-diversity/), and [TextDescriptives](https://hlasse.github.io/TextDescriptives/). The list of all features and their corresponding sources can be found below: | Column | Description | | ------ | ----------- | | text | human- or ChatGPT-generated text; taken from aadityaubhat/GPT-wiki-intro | | normalized_bigram_entropy | bigram entropy normalized with estimated maximum entropy; nltk | | mean_word_length | mean word length; nltk | | mean_sent_length | mean sentence length; nltk | | fog | Gunning-Fog; readability-metrics | | ari | Automated Readability Index; readability-metrics | | dale_chall | Dale Chall Readability; readability-metrics | | hdd | Hypergeometric Distribution; lexical-diversity | | mtld | Measure of lexical textual diversity; lexical-diversity | | mattr | Moving average type-token ratio; lexical-diversity | | number_of_ADJ | proportion of adjectives per word; nltk | | number_of_ADP | proportion of adpositions per word; nltk | | number_of_ADV | proportion of adverbs per word; nltk | | number_of_CONJ | proportion of conjunctions per word; nltk | | number_of_DET | proportion of determiners per word; nltk | | number_of_NOUN | proportion of nouns per word; nltk | | number_of_NUM | proportion of numerals per word; nltk | | number_of_PRT | proportion of particles per word; nltk | | number_of_PRON | proportion of pronuns per word; nltk | | number_of_VERB | proportion of verbs per word; nltk | | number_of_DOT | proportion of punctuation marks per word; nltk | | number_of_X | proportion of POS tag 'Other' per word; nltk | | class | binary class, 0 stands for Wikipedia, 1 stands for ChatGPT | | spacy_perplexity | text perplexity; TextDescriptives | | entropy | text entropy; TextDescriptives | | automated_readability_index | Automated Readability Index; TextDescriptives | | per_word_spacy_perplexity | text perplexity per word; TextDescriptives | | dependency_distance_mean | mean distance from each token to their dependent; TextDescriptives | | dependency_distance_std | standard deviation of distance from each token to their dependent; TextDescriptives | | first_order_coherence | cosine similarity between consecutive sentences; TextDescriptives | | second_order_coherence | cosine similarity between sentences that are two sentences apart; TextDescriptives | | smog |SMOG; TextDescriptives | | prop_adjacent_dependency_relation_mean | mean proportion adjacent dependency relations; TextDescriptives | | prop_adjacent_dependency_relation_std | standard deviation of proportion adjacent dependency relations; TextDescriptives | | syllables_per_token_mean | mean of syllables per token; TextDescriptives | | syllables_per_token_median | median of syllables per token; TextDescriptives | | token_length_std | standard deviation of token length; TextDescriptives | | token_length_median | median of token length; TextDescriptives | | sentence_length_median | median of sentence length; TextDescriptives | | syllables_per_token_std | standard deviation of syllables per token; TextDescriptives | | proportion_unique_tokens | proportion of unique tokens; TextDescriptives | | top_ngram_chr_fraction_3 | fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range; TextDescriptives | | top_ngram_chr_fraction_2 | fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range; TextDescriptives | | top_ngram_chr_fraction_4 | fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range; TextDescriptives | | proportion_bullet_points | fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range; TextDescriptives | | flesch_reading_ease | Flesch Reading ease ; TextDescriptives | | flesch_kincaid_grade | Flesch Kincaid grade; TextDescriptives | | gunning_fog | Gunning-Fog; TextDescriptives | | coleman_liau_index | Coleman-Liau Index; TextDescriptives | | oov_ratio| out-of-vocabulary ratio; TextDescriptives | ## Code Code that was used to generate this dataset can be found on [Github](https://github.com/julia-lukasiewicz-pater/gpt-wiki-features/tree/main).
提供机构:
julia-lukasiewicz-pater
原始信息汇总

Small-GPT-wiki-intro-features 数据集

概述

该数据集基于 aadityaubhat/GPT-wiki-intro,包含 150k 篇来自 Wikipedia(标签 0)和相应由 ChatGPT 生成的文本(标签 1),共计 300k 篇文本。每篇文本计算了多种复杂度指标,如可读性、词汇多样性等。可用于文本分类或分析人类生成与 ChatGPT 生成文本的语言特征。

数据集结构

特征计算使用了多种 Python 库,包括 NLTK、readability-metricslexical-diversityTextDescriptives。所有特征及其对应来源如下:

列名 描述
text 人类或 ChatGPT 生成的文本;取自 aadityaubhat/GPT-wiki-intro
normalized_bigram_entropy 归一化双词熵;nltk
mean_word_length 平均词长;nltk
mean_sent_length 平均句子长度;nltk
fog Gunning-Fog 指数;readability-metrics
ari 自动可读性指数;readability-metrics
dale_chall Dale Chall 可读性;readability-metrics
hdd 超几何分布;lexical-diversity
mtld 词汇文本多样性测量;lexical-diversity
mattr 移动平均类型-标记比;lexical-diversity
number_of_ADJ 每词形容词比例;nltk
number_of_ADP 每词介词比例;nltk
number_of_ADV 每词副词比例;nltk
number_of_CONJ 每词连词比例;nltk
number_of_DET 每词限定词比例;nltk
number_of_NOUN 每词名词比例;nltk
number_of_NUM 每词数词比例;nltk
number_of_PRT 每词小品词比例;nltk
number_of_PRON 每词代词比例;nltk
number_of_VERB 每词动词比例;nltk
number_of_DOT 每词标点符号比例;nltk
number_of_X 每词 POS 标签 Other 比例;nltk
class 二元类别,0 代表 Wikipedia,1 代表 ChatGPT
spacy_perplexity 文本困惑度;TextDescriptives
entropy 文本熵;TextDescriptives
automated_readability_index 自动可读性指数;TextDescriptives
per_word_spacy_perplexity 每词文本困惑度;TextDescriptives
dependency_distance_mean 每个词到其依赖项的平均距离;TextDescriptives
dependency_distance_std 每个词到其依赖项距离的标准差;TextDescriptives
first_order_coherence 连续句子间的余弦相似度;TextDescriptives
second_order_coherence 相隔两个句子的句子间的余弦相似度;TextDescriptives
smog SMOG 指数;TextDescriptives
prop_adjacent_dependency_relation_mean 相邻依赖关系平均比例;TextDescriptives
prop_adjacent_dependency_relation_std 相邻依赖关系比例标准差;TextDescriptives
syllables_per_token_mean 每词音节平均数;TextDescriptives
syllables_per_token_median 每词音节中位数;TextDescriptives
token_length_std 词长标准差;TextDescriptives
token_length_median 词长中位数;TextDescriptives
sentence_length_median 句子长度中位数;TextDescriptives
syllables_per_token_std 每词音节标准差;TextDescriptives
proportion_unique_tokens 唯一词比例;TextDescriptives
top_ngram_chr_fraction_3 文档中包含在最高 n-gram 中的字符比例;TextDescriptives
top_ngram_chr_fraction_2 文档中包含在最高 n-gram 中的字符比例;TextDescriptives
top_ngram_chr_fraction_4 文档中包含在最高 n-gram 中的字符比例;TextDescriptives
proportion_bullet_points 文档中包含在最高 n-gram 中的字符比例;TextDescriptives
flesch_reading_ease Flesch 阅读易度;TextDescriptives
flesch_kincaid_grade Flesch Kincaid 年级;TextDescriptives
gunning_fog Gunning-Fog 指数;TextDescriptives
coleman_liau_index Coleman-Liau 指数;TextDescriptives
oov_ratio 词汇外比率;TextDescriptives
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作