Simplifying Administrative Texts for Plain Language using LLM: a Comparative Analysis: Results for Morphosyntactic and Readability Indexes on Public Notices and their AI Generated Plain Language Versions
收藏Figshare2026-02-05 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Simplifying_Administrative_Texts_for_Plain_Language_using_LLM_a_Comparative_Analysis_Results_for_Morphosyntactic_and_Readability_Indexes_on_Public_Notices_and_their_AI_Generated_Plain_Language_Versions/29376692
下载链接
链接失效反馈官方服务:
资源简介:
Compilation of the achieved results on the paper Simplifying Administrative Texts for Plain Language using LLM: a Comparative Analysis.On our paper, we evaluated if LLMs could perform text simplification following plain language guides, we compared the results on statistical readability indexes, such as Flesch Reading Ease and Gunning Fog Index, together with morphosyntactic metrics from the NILC-Metrix project (https://doi.org/10.1007/s10579-023-09693-w). The following models were evaluated:gemini-2.5-flash-preview-04-17gemini-2.5-pro-preview-05-06phi4:14.7bphi3:3.8bcow/gemma2_tools:2bllama3.2:3.2bgemma3:4bqwen2.5:14bdeepseek-r1:14bgranite3-dense:2bgranite3-dense:8bThe used code is available at:https://github.com/Joao-Pedro-P-Holanda/text-simplification/tree/SBSI-2026Data CleaningPre-Generation CleaningTo ensure that only the text would take focus, we performed the following processing steps before sending the text content to the LLMS:1. Extracted the Markdown text from the original PDF with the prompt ”Convert this document to markdown” on Google Gemini 2.5-pro;2. Stripped date and document numeration from page headers;3. Edited section headings to match exactly the PDF metadata;4. Edited paragraphs and line breaks to match the PDF structure;5. Added images that could not be translated to text effectively as a reference in Markdown format;6. Removed or added bullet points in order to match the original PDF structure, points using icons were replaced by simple points;7. Removed digital signatures, but preserving the author names and positions;Cleaning for PDF GenerationAfter saving the LLM responses in Markdown files, combining all chunks in a single document, we only removed the Deepseek tag from the texts and then proceeded to generate PDF files for the original and AI Generated versions.The exact command to generate a file was:pandoc -o -Vgeometry:margin=1in --pdfengine=lualatex -V header-includes="\usepackage{fontspec}" -V mainfont=Times New RomanCleaning for Metric CalculationAfter generating the pdf files, we made 2 extra steps on the markdown files to ensure the adequate computation of readability indexes and morphosyntatic metrics:Removal of Markdown tables, extra whitespace and enumeration line starts (I., a), etc.)Addition of new lines between dot (.) separated sentencesArtifactsThe file prompt_simplify_document.txt contains the prompt given to all models during our text simplification process, the anonymized_readability_metrics_results.csv and anonymized_morphosyntactic_results.csv files have the raw results for all files in each model.The confidence_intervals csv files describe the measure confidence of the mean value across all documents for each model in each metric. NILC-Metrix metrics that had a proportional relationship to complexity were separated from the inversely proportional ones (in this work, only personal pronoun ratio). The same occurred for the statistical indexes, where grade level metrics were put separated from metrics in the 0-100 scale.foreign_word_ratio was not described in the NILC-Metrix paper, we propose its usage on this same paper, using the "Foreign" feat from Universal Dependencies.For the readability indexes, we split syllables on the words using the Pyphen (https://doc.courtbouillon.org/pyphen/stable/), and considered all words outside the 5000 most common Linguateca's words (https://www.linguateca.pt/acesso/tokens/formas.todos.txt) as complex. All indexes had their values adapted to Brazilian Portuguese, following the paper ALT: UM SOFTWARE PARA ANÁLISE DE LEGIBILIDADE DE TEXTOS EM LÍNGUA PORTUGUESA (https://revistas.ufrj.br/index.php/policromias/article/view/54352)We used UDPipe to perform Tokenization, PoS tagging, Lemmatization and Dependency Parsing with the treebank portuguese-porttinari-ud-2.15-241121.All morphosyntactic metrics used the CoNLL-U result files from UDPipe, we excluded the decontracted tokens and considered only the contracted word, e.g. "do" was decontracted to "de" preposition and "o" definite article, but only the word "do" was used in our implementation.
创建时间:
2026-02-05



