five

TheJeanneCompany/french-senate-session-reports

收藏
Hugging Face2025-03-06 更新2025-11-01 收录
下载链接:
https://hf-mirror.com/datasets/TheJeanneCompany/french-senate-session-reports
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nd-4.0 task_categories: - text-classification - question-answering - token-classification - summarization - text-generation - text2text-generation - zero-shot-classification language: - fr tags: - legal - politics - france pretty_name: French Senate (senat.fr) Session Reports — 508M tokens size_categories: - 1K<n<10K --- <img src="https://upload.wikimedia.org/wikipedia/fr/6/63/Logo_S%C3%A9nat_%28France%29_2018.svg" width=240 height=240> # 🏛️ French Senate Session Reports Dataset *A dataset of parliamentary debates and sessions reports from the French Senate.* 508,647,861 tokens of high-quality French text transcribed manually from Senate Sessions ## Description This dataset consists of **all** session reports from the French Senate debates, crawled from the official website `senat.fr`. It provides high-quality text data of parliamentary discussions, covering a wide range of political, economic, and social topics debated in the upper house of the French Parliament. The dataset is valuable for natural language processing (NLP) tasks such as text classification, question answering, summarization, named entity recognition (NER), and text generation. ## Data gathering and processing workflow 1. Initial data was crawled from [French Senate Official Website](https://senat.fr) using a custom home-made web-archiving crawler producing `WARC` files [(spec)](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) 2. Text was then cleaned and post-processed from the `WARC` files into the `parquet` file, ensuring quality of the extracted text. ## Details **Source**: [French Senate Official Website](https://senat.fr) **Language**: French 🇫🇷 **Time Period**: every report from 1996 to March 4th 2025 **Number of Reports**: `3314` **Tokens**: `508,647,861` **Size**: `436MB` **Format**: `parquet` ## Contact [@JeanneCompany](https://x.com/JeanneCompany) on x.com. [contact (at) the-jeanne-company (dot) com](mailto:contact@the-jeanne-company.com) via mail. Don't hesitate to contact us if you want the original WARC files or in a different format. Also we do contracting for web crawling, software and infrastructure engineering and and are ready to help you with all your needs. ## Licensing & Legal Considerations The dataset is derived from public domain parliamentary reports, which are not subject to copyright protection (Article L.122-5 of the French Intellectual Property Code). However, users must adhere to the following conditions from senat.fr: 1. **Attribution**: Cite "www.senat.fr" as the source. 2. **Integrity**: The text must remain unaltered. 3. **No Commercial Use for Videos, Images and Audios**: This dataset does not include video/images/audio content due to copyright restrictions. The header logotype is part of a French registered trademark and used according to French fair-use (Article L.711-1 of the French Intellectual Property Code) ## Potential Applications **Political Science Research**: Analyze trends in legislative debates. **Journalism**: Quickly extract summaries and key quotes. **AI & NLP Training**: Improve models for French language understanding. **Civic Engagement**: Provide accessible insights into government discussions. ## Limitations & Ethical Considerations **Bias & Representation**: The dataset reflects the political landscape of the French Senate and may contain biased narratives. **Privacy Concerns**: While the reports are public, contextual data (e.g., speaker metadata) should be used responsibly. **Misuse Risk**: Any NLP application using this dataset should avoid misinformation or out-of-context misrepresentation. ## Citation If you find the data useful, please cite: ``` @misc{FrenchSenateTranscripts2025, author = {French Senate (Processed by TheJeanneCompany)}, title = {French Senate Debate Reports Dataset}, year = {2025}, url = {https://huggingface.co/datasets/TheJeanneCompany/french-senate-session-reports} } ``` --- [the-jeanne-company.com](https://the-jeanne-company.com)
提供机构:
TheJeanneCompany
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作