PleIAs/BSF_Redline

Name: PleIAs/BSF_Redline
Creator: PleIAs
Published: 2026-02-27 16:59:26
License: 暂无描述

Hugging Face2026-02-27 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/PleIAs/BSF_Redline

下载链接

链接失效反馈

官方服务：

资源简介：

# Fighting Conflict-Related Sexual Violence With Specialized AI assistants Pleias, in collaboration with Bibliothèques Sans Frontières and the Dr. Denis Mukwege Foundation, has developed a series of small language models to act as specialized AI assistants. Our goal is to place critical legal knowledge directly into the hands of conflict-related sexual violence (CRSV) survivor networks and their advocates. ## Curated Data, the core of an efficient RAG system The source material, the Red Line Initiative’s Guidebook (https://www.endcrsv.org/guidebook/), is an extremely useful resource, but can present an usability challenge due to its legal density. An AI assistant based on RAG (Retrieval Augmented Generation) positions itself as an ideal solution to this issue, but first we needed to transform the Guidebook raw text into a high-quality RAG sources by applying the following processes: - Semantic Chunking: Moving beyond simple size division to create logically complete segments that accurately represent the different guidebook sections. - Context Injection: Adding context annotations to help the model identify the precise jurisdiction and framework corresponding to each chunk. - Entity Normalization: Standardizing acronyms and terminology across the dataset. - Markdown Structuring: Formatting data to ensure the LLM can parse hierarchy and emphasis correctly. ## State of the art AI on a Raspberry Pi Leveraging the architecture of our SOTA Baguettotron model (https://huggingface.co/PleIAs/Baguettotron), we have developed a specialized RAG assistant optimized for edge deployment. At only 321M parameters, this model retains the ability to perform complex reasoning and citation tasks while running locally on hardware as accessible as a Raspberry Pi. This capability is critical for deployment in difficult conflict ravaged environments, which many times lack reliable internet infrastructure. By processing data locally, we not only solve the connectivity challenge but also guarantee that sensitive data concerning sexual violence remains secure on the user's device. ### How to run the application locally The application servers a Flask API server for the Pleias RAG system. It retrieves relevant sources from a LanceDB vector database and generates sourced answers using a local GGUF model. You can run directly with docker using: ```bash docker build -t pleias-rag . docker run -p 8081:8081 pleias-rag ``` You can also manually install the dependencies and start the server ```bash pip install flask lancedb pandas llama-cpp-python python -m src.main --port 8081 ``` ### API Endpoints #### `POST /chat` — Single response Returns the complete generated answer in a single JSON response. ```bash curl -X POST http://localhost:8081/chat \ -H "Content-Type: application/json" \ -d '{"query": "What is CRSV?", "lang": "en"}' ``` #### `POST /stream_chat` — Streaming response Streams the response in real-time as newline-delimited JSON chunks. ```bash curl -N -X POST http://localhost:8081/stream_chat \ -H "Content-Type: application/json" \ -d '{"query": "What is CRSV?", "lang": "en"}' ``` > The `-N` flag disables curl's output buffering so chunks appear as they arrive. The available languages are "en" (English) and "fr" (French) ### Streaming response format The response is a sequence of newline-delimited JSON objects, sent in three stages: **1. Metadata** — sent immediately after source retrieval: ```json { "formatted_prompt": "<|query_start|>What is CRSV<|query_end|>\n\n<|source_start|>...<|source_end|>", "language": "en", "query": "What is CRSV", "source_limit": 3, "source_urls": [ "https://www.endcrsv.org/guidebook/introduction/", "https://www.endcrsv.org/guidebook/unsc/#toc-i", "https://www.endcrsv.org/guidebook/cat/#toc-i-2" ], "sources_count": 3 } ``` **2. Source analysis** — streamed token by token inside a single JSON object: ```json {"source_analysis": "**Query decomposition:** \"What is CRSV\" → definitional/conceptual understanding..."} ``` **3. Answer** — streamed token by token, with inline `<ref>` citations: ```json {"answer": "\n\nCRSV (Conflict-Related Sexual Violence) is a term used to describe acts of a sexual nature committed against any person under coercive circumstances ... <ref name=\"1\">From Source 1-- For States, conflict-related sexual violence (CRSV) is regulated through international humanitarian law (IHL)...</ref>"} ``` ### Single response format Returns the complete generated answer in a single JSON response. The response shares the same fields as the streaming endpoint (`formatted_prompt`, `language`, `query`, `source_limit`, `source_urls`, `sources_count`) plus some additional fields: - `generated_text` — the full raw model output with section tags (`<|source_analysis_start|>`, `<|answer_start|>`, etc.) - `parsed_sections` — the generated text decomposed into named sections: `source_analysis`, `answer`, `draft`, `language_detected`, `query_analysis`, `query_report`, `source_report` - `generation_time` — total inference time in seconds ## Training of a RAG model with users and expert feedback For the online version of the assistant, we took the open-source Gemma 3-12B as a base and applied multiple iterations of fine-tuning using datasets built from questions and answers collected during in-person workshops (Ukraine and Nigeria), as well as feedback from legal experts in the Bibliothèque Sans Frontières network and the University of Cincinnati. We also included training on a dataset produced with our SYNTH pipeline (https://huggingface.co/datasets/PleIAs/SYNTH) to prevent over-specialization of the model. The result is an assistant that goes beyond basic RAG functions, rather than limiting its responses to retrieved information chunks, it is capable of providing structured reasoning before drafting its final answer, as well as broader legal context on the applicable regulatory framework, all expressed in clear, useful, and accessible language for the end user. ### Prompt format The query and the sources must follow this format: ```bash ### Query ### Are states responsible for providing emergency services to victims of sexual violence in conflict? ### Source ### <source_1> ... </source_1> <source_2> ... </source_2> <source_3> ... </source_3> ### Draft ### ``` The model will then generate step-by-step reasoning and a final answer, accompanied by the necessary references: ```bash 1. **Deconstruct the Query**: The core question is whether states have a legal responsibility to provide emergency services for victims of sexual violence during conflict, specifically mentioning "emergency services." 2. **Analyze Sources for "Emergency Services"**: (...) ### Answer ### Yes, states have obligations under international law to provide victims of sexual violence in conflict with appropriate care and emergency services<ref name="source_1">States should ensure that support of victims/survivors of CRSV (Conflict-Related Sexual Violence) includes timely care, safety, non-maleficence, confidentiality, privacy, informed consent, and respect for the wishes, rights and dignity of the victim/survivor.</ref>. This support includes: * **Timely and Specialized Medical Care**: States must ensure victims have unimpeded access to good-quality, timely medical care (...) ```

提供机构：

PleIAs

5,000+

优质数据集

54 个

任务类型

进入经典数据集