five

raeidsaqur/Hansard

收藏
Hugging Face2024-01-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/raeidsaqur/Hansard
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en - fr task_categories: - translation pretty_name: hansard size_categories: - 100K<n<1M --- <h1> <img alt="RH" src="./icon.png" style="display:inline-block; vertical-align:middle" /> Pedagogical Machine Translation (Dialect) dataset: the filtered Canadian Hansard Dataset. </h1> The Canadian [Hansard](https://www.ourcommons.ca/documentviewer/en/35-2/house/hansard-index) is an archive of parliamentary sessions in the two official languages in Canada - English and Franch. ## 📋 Table of Contents - [🧩 Hansard Dataset](#-hansard-dataset) - [📋 Table of Contents](#-table-of-contents) - [📖 Usage](#-usage) - [Downloading the dataset](#downloading-the-dataset) - [Dataset structure](#dataset-structure) - [Loading the dataset](#loading-the-dataset) <!--- [Evaluating](#evaluating) - [Running the baselines](#running-the-baselines) - [Word Embeddings and Pre-trained Language Models](#word-embeddings-and-pre-trained-language-models) - [Large Language Models](#large-language-models) --> - [✍️ Contributing](#️-contributing) - [📝 Citing](#-citing) - [🙏 Acknowledgements](#-acknowledgements) ## 📖 Usage ### Downloading the dataset The hansard dataset can be downloaded from [here](https://www.cs.toronto.edu/~raeidsaqur/hansard/hansard.tar.gz) or with a bash script: ```bash bash download_hansard.sh ``` ### Dataset structure The dataset is provided as csv (and parquet) files, one for each partition: `train.[csv|parquet]` and `test.csv`. We also provide a `hansard.[csv|parquet]` file that contains all examples across all splits. The splits are sized as follows: <!-- | Split | # Walls | |:-------|:---------:| | `train` | 311K | | `test` | 49K | Here is an example of the dataset's structure: ```csv ``` --> ### Loading the dataset The three partitions can be loaded the same way as any other csv file. For example, using Python: ```python dataset = { "train": csv.load(open("./Hansard/train.csv", "r"))["dataset"], "test": csv.load(open("./Hansard/test.csv", "r"))["dataset"], } ``` However, it is likely easiest to work with the dataset using the [HuggingFace Datasets](https://huggingface.co/datasets) library: ```python # pip install datasets from datasets import load_dataset # The dataset can be used like any other HuggingFace dataset dataset = load_dataset("raeidsaqur/hansard") ``` <!-- > __Note__ --> <!-- ### Evaluating We provide a script for evaluating the performance of a model on the dataset. Before running, make sure you have installed the requirements and package: ```bash pip install -r requirements.txt pip install -e . ``` To run the evaluation script: ### Running the baselines --> ## ✍️ Contributing We welcome contributions to this repository (noticed a typo? a bug?). To propose a change: ``` git clone https://github.com/raeidsaqur/hansard cd hansard git checkout -b my-branch pip install -r requirements.txt pip install -e . ``` Once your changes are made, make sure to lint and format the code (addressing any warnings or errors): ``` isort . black . flake8 . ``` Then, submit your change as a pull request. ## 📝 Citing If you use the Canadian Hansarddataset in your work, please consider citing our paper: ``` @article{raeidsaqur2024Hansard, title = {The Canadian Hansard Dataset for Analyzing Dialect Efficiencies in Language Models}, author = {Raeid Saqur}, year = 2024, journal = {ArXiv}, url = } ``` ## 🙏 Acknowledgements The entire CSC401/2511 teaching team at the Dept. of Computer Science at the University of Toronto.
提供机构:
raeidsaqur
原始信息汇总

数据集概述

基本信息

  • 许可证: MIT
  • 语言: 英语, 法语
  • 任务类别: 翻译
  • 数据集名称: hansard
  • 数据集大小: 100K<n<1M

数据集描述

加拿大Hansard是加拿大议会会议的官方语言(英语和法语)档案。

使用指南

下载数据集

数据集可以从这里下载,或者使用以下bash脚本: bash bash download_hansard.sh

数据集结构

数据集以csv和parquet文件格式提供,每个分区一个文件:train.[csv|parquet]test.csv。还提供了一个包含所有示例的hansard.[csv|parquet]文件。分区大小如下:

加载数据集

可以使用Python加载数据集,例如: python dataset = { "train": csv.load(open("./Hansard/train.csv", "r"))["dataset"], "test": csv.load(open("./Hansard/test.csv", "r"))["dataset"], }

或者使用HuggingFace Datasets库: python

pip install datasets

from datasets import load_dataset

dataset = load_dataset("raeidsaqur/hansard")

贡献

欢迎对该仓库进行贡献(发现拼写错误?bug?)。提出更改的步骤如下: bash git clone https://github.com/raeidsaqur/hansard cd hansard git checkout -b my-branch pip install -r requirements.txt pip install -e .

更改完成后,确保代码格式正确并提交拉取请求。

引用

如果在工作中使用了加拿大Hansard数据集,请考虑引用我们的论文:

@article{raeidsaqur2024Hansard, title = {The Canadian Hansard Dataset for Analyzing Dialect Efficiencies in Language Models}, author = {Raeid Saqur}, year = 2024, journal = {ArXiv}, url = }

致谢

感谢多伦多大学计算机科学系CSC401/2511教学团队的所有成员。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作