five

melll-uff/diplomatrixbr-gen

收藏
Hugging Face2025-09-28 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/melll-uff/diplomatrixbr-gen
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - pt --- # Dataset Card for Diplomatrix-BR ## Dataset Summary Diplomatrix-BR is a parallel corpus containing argumentative essays written based on questions from the Brazilian Diplomatic Career Admission Contest (CACD) from 2013 to 2023. The corpus includes 478 essays total: 390 generated by 13 different Language Models (LLMs) and 88 from approved human candidates. This dataset enables systematic comparisons between texts produced by human candidates and various language models in the context of Brazilian diplomatic writing. The dataset is developed as part of the research work **"Diplomatrix-BR: A Parallel Corpus of Human-Authored and LLM Essays in the Brazilian Diplomacy Contest"** and includes comprehensive linguistic metrics (BLEU, BERT-Score, ROUGE, CTC) and correlation analyses between automatic metrics and human evaluation. ## Supported Tasks and Leaderboards - **text_generation**: The dataset can be used for training and evaluating text generation models on diplomatic writing tasks - **text_classification**: Can be used for quality assessment and automatic scoring of argumentative essays - **comparative_analysis**: Enables comparison between human and AI-generated texts in academic contexts - **linguistic_analysis**: Contains extensive linguistic metrics for stylometric and complexity analysis ## Languages Portuguese (pt-BR) - Brazilian Portuguese ## Dataset Structure ### Data Instances An instance from the human candidates dataset: ```json { "Name": "author1", "Score": 52.5, "Essay": "Ao final da Guerra Fria, o economista Samuel Huntington previu a ocorrência de um \"choque de civilizações\" na sociedade internacional...", "Linguistic_Metrics": { "adjective_ratio": 0.12719, "adverbs": 0.04769, "content_words": 0.60413, "flesch": 4.9116, "function_words": 0.39587, "sentences_per_paragraph": 26.0, "syllables_per_content_word": 3.48684, "words_per_sentence": 24.19231, "words": 629, "sentences": 26, "paragraphs": 1 } } ``` An instance from the LLM-generated dataset: ```json { "Essay": "A discussão sobre o comércio internacional frequentemente revela um paradoxo intrínseco...", "Linguistic_Metrics": { "adjective_ratio": 0.11311, "adverbs": 0.0395, "content_words": 0.58169, "flesch": 0.46486, "function_words": 0.41831, "words": 557, "sentences": 21, "paragraphs": 6 } } ``` ### Data Fields #### For Human Candidates Essays (`Candidates_Essays`): - `Maximum_Score`: Maximum possible score for the essay (float) - `Question_Statement`: The original question/prompt from the CACD exam (string) - `Candidates`: List of candidate essays, each containing: - `Name`: Candidate's name (string) - `Score`: Score assigned by human evaluators (float, 0-60) - `Essay`: The complete essay text (string) - `Linguistic_Metrics`: Comprehensive linguistic analysis (dict) #### For LLM-Generated Essays (`Models_Essays`): - `Prompt`: The structured prompt used for LLM generation (string) - `Models`: Dictionary containing essays from different models: - Model names include: `gpt4o`, `command-r-plus`, `gemma-27b`, `gemma-9b`, `llama-405b`, `llama-8b`, `mixtral-8x22b`, `mixtral-8x7b`, `phi-3-mini`, `phi-4`, `qwen2-72b`, `qwen2-7b`, `sabia` - Each model has essays generated at different temperatures (0.3, 0.5, 0.7) - Each essay contains: - `Essay`: The generated essay text (string) - `Linguistic_Metrics`: Comprehensive linguistic analysis (dict) #### Linguistic Metrics Include: - **Basic Statistics**: word count, sentence count, paragraph count - **Lexical Diversity**: TTR (Type-Token Ratio), Brunet Index, Honoré Statistic - **Syntactic Complexity**: words per sentence, sentences per paragraph - **Readability**: Flesch Reading Ease score - **Word Frequency**: content word frequency, minimum word frequency - **Part-of-Speech Ratios**: nouns, verbs, adjectives, adverbs, pronouns - **Discourse Markers**: connectives, logical operators, negation markers - **Semantic Features**: hypernym ratios, personal pronouns usage ### Data Splits The dataset covers essays from 2013 to 2023 (10 years) organized by: - **Years**: 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020-2021, 2022, 2023 - **Source Type**: Human candidates vs. LLM-generated - **Models**: 13 different language models for generated essays - **Temperatures**: Multiple generation temperatures (0.3, 0.5, 0.7) for LLM essays ## Dataset Creation ### Source Data #### Human Essays - Collected from approved candidates in the Brazilian Diplomatic Career Admission Contest (CACD) - Spans 10 years (2013-2023) of official examinations - Essays were scored by professional evaluators on a 0-60 scale - Represents high-quality argumentative writing in diplomatic contexts #### LLM-Generated Essays - Generated using 13 different state-of-the-art language models - Models include both proprietary (ChatGPT-4o, Command R+) and open-source (Llama, Mixtral, Gemma, etc.) models - Multiple temperatures used to capture generation variability - Same prompts and evaluation criteria as human candidates ### Language Models Used 1. **ChatGPT-4o** (OpenAI) 2. **Command R+** (Cohere) 3. **Gemma-27b** (Google) 4. **Gemma-9b** (Google) 5. **Llama-405b** (Meta) 6. **Llama-8b** (Meta) 7. **Mixtral-8x22b** (Mistral AI) 8. **Mixtral-8x7b** (Mistral AI) 9. **Phi-3-Mini** (Microsoft) 10. **Phi-4** (Microsoft) 11. **Qwen2-72b** (Alibaba) 12. **Qwen2-7b** (Alibaba) 13. **Sabia** (Maritaca AI) ### Preprocessing - Comprehensive linguistic analysis using NLP tools - Automatic metric calculation (BLEU, BERT-Score, ROUGE, CTC) - Correlation analysis between automatic metrics and human scores - Quality control and data validation ### Citation Information ```bibtex @article{diplomatrixbr2024, title={Diplomatrix-BR: A Parallel Corpus of Human-Authored and LLM Essays in the Brazilian Diplomacy Contest}, author={João, Rodrigo Cavalcanti and Casini, Gabriela and Assis, Gabriel and Real, Livy and Vianna, Daniela and Mann, Paulo and Paes, Aline}, year={2024} } ```
提供机构:
melll-uff
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作