five

eckendoerffer/news_fr

收藏
Hugging Face2023-10-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/eckendoerffer/news_fr
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-3.0 task_categories: - text-generation language: - fr tags: - news - media - Press size_categories: - 1M<n<10M --- # NEWS FR There is an open-access [dataset on BnF / Gallica](https://transfert.bnf.fr/link/3a04ea3f-dbe8-4a4a-a302-913a89c3a7a8) comprising nearly a hundred newspapers from the print media spanning almost 100 years. Unfortunately, for this dataset, only 85% of the text is transcribed accurately. ## DATASET This dataset compiles 1M online articles from nearly 100 Francophone media outlets. This dataset is intended for research purposes and non-commercial use. It includes 1,140,000 lines for model training, and 63,500 lines for the test and validation files. Included with this dataset are scripts to extract and process the article text from the same sources. The script is somewhat rough around the edges, but it is functional and commented. ### Format - **Type**: Text - **File Extension**: `.txt` The text has been standardized for consistent formatting and line length. Additionally, the dataset has been filtered using the `langid` library to include only text in French. ### Structure The dataset is divided into the following splits: - `train.txt`: 2.2 GB - 1,140,000 rows - 90% - `test.txt` : 122 MB - 63,500 rows - 5% - `valid.txt`: 122 MB - 63,500 rows - 5% ### Exploring the Dataset You can use the `explore_dataset.py` script to explore the dataset by randomly displaying a certain number of lines from it. The script creates and saves an index based on the line breaks, enabling faster data retrieval and display. ### Additional Information This dataset is a subset of a larger 10GB French dataset, which also contains several thousand books and theses in French, Wikipedia, as well as several hundred thousand Francophone news articles. ## EXTRACT NEWS FR The "NEWS FR" module allows for the extraction of online press articles from over a hundred different sources. ## Installation To set up the module, follow the steps below: 1. **Database Setup**: - Create a database and incorporate the two tables present in `database.sql`. 2. **Database Configuration**: - Update your MySQL connection information in the `config.py` file. 3. **Dependencies Installation**: - Install it using pip install: ``` pip install aiohttp mysql-connector-python beautifulsoup4 chardet colorama pyquery ``` ## Usage ### 1_extract_rss.py: This script fetches RSS feeds from various media outlets and adds URLs for further extraction. ```bash python 1_extract_rss.py ``` ### 2_extract_news.py: This script retrieves the sources of articles for subsequent local processing. ```bash python 2_extract_news.py ``` ### 3_extract_news_txt.py: This script extracts the text content of press articles and saves it (title + text) to a `.txt` file. ```bash python 3_extract_news_txt.py ``` After completing this step, you can use the Python script located at /dataset/2_cleaning_txt.py to standardize the text for your dataset. ### 4_extract_news_url.py: This script allows for the extraction of links to other articles from local article sources. This ensures swift retrieval of numerous past articles, as opposed to fetching only the most recent ones. ```bash python 4_extract_news_url.py ``` After using this script, you'll need to run 2_extract_news.py again to retrieve the sources of the new articles, as well as 3_extract_news_txt.py to extract the text from these articles. ---
提供机构:
eckendoerffer
原始信息汇总

数据集概述

数据集信息

  • 许可证: cc-by-3.0
  • 任务类别:
    • 文本生成
  • 语言:
    • 法语
  • 标签:
    • 新闻
    • 媒体
    • 报刊
  • 数据量:
    • 1M<n<10M

数据集详情

  • 来源: 该数据集包含来自近100家法语媒体的100万篇在线文章。
  • 用途: 该数据集旨在用于研究目的和非商业用途。
  • 包含内容:
    • 模型训练数据: 1,140,000行
    • 测试和验证数据: 63,500行

数据格式

  • 类型: 文本
  • 文件扩展名: .txt
  • 格式标准化: 文本已标准化,以确保一致的格式和行长度。
  • 语言过滤: 使用langid库过滤,仅包含法语文本。

数据结构

  • 训练集:
    • 文件: train.txt
    • 大小: 2.2 GB
    • 行数: 1,140,000
    • 占比: 90%
  • 测试集:
    • 文件: test.txt
    • 大小: 122 MB
    • 行数: 63,500
    • 占比: 5%
  • 验证集:
    • 文件: valid.txt
    • 大小: 122 MB
    • 行数: 63,500
    • 占比: 5%

数据探索

  • 探索脚本: explore_dataset.py
  • 功能: 随机显示数据集中的特定行数,并创建并保存基于换行符的索引,以实现更快的数据检索和显示。

附加信息

  • 数据集子集: 该数据集是更大的10GB法语数据集的子集,该大集还包含数千本法语书籍和论文、维基百科以及数十万篇法语新闻文章。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作