GATT_library
收藏魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/GATT_library
下载链接
链接失效反馈官方服务:
资源简介:
# GATT Library Dataset Card
## Dataset Overview
**Dataset Name:** GATT Library
**Description:**
The GATT Library dataset comprises a comprehensive collection of documents related to the General Agreement on Tariffs and Trade (GATT), spanning from January 1, 1946, to September 6, 1996. This dataset is organized into a single Parquet file, which contains detailed information about the documents, including metadata extracted from the original files. The original files are stored in a tar.gz archive, organized into folders.
**Languages Covered (based on a sample of 10,000 entries):**
- English (en): 7,878 documents
- French (fr): 1,357 documents
- Unknown (Unknown): 486 documents
- Spanish (es): 261 documents
- Catalan (ca): 12 documents
- Portuguese (pt): 4 documents
- German (de): 2 documents
**Total Number of Documents:** 60,182
**Total Number of Words:** 133,196,586
**Average Number of Words per Document:** 2,213.23
**Total Number of Characters:** 941,277,211
**Average Number of Characters per Document:** 15,640.51
**Date Range:** January 1, 1946 to September 6, 1996
## Dataset Structure
The dataset is stored in a single Parquet file, which contains the following metadata for each document:
- `source`: The source of the document.
- `document_id`: A unique identifier for the document.
- `title`: The title of the document.
- `short_title`: A shorter version of the document title.
- `author`: The author of the document.
- `date`: The publication date of the document.
- `type_of_document`: The type of document.
- `identifier`: An identifier for the document.
- `link`: A link to the document.
- `file`: The file name.
- `folder`: The folder where the document is stored.
- `word_count`: Total number of words in the document.
- `character_count`: Total number of characters in the document.
- `text`: Full text content of the document.
Each Parquet file is accompanied by the original XML files collected into tar.gz files, one for each Parquet file.
## Usage
The dataset can be used for various purposes including, but not limited to:
- Analyzing historical trade documents and agreements.
- Studying the evolution and impact of GATT on international trade.
- Developing natural language processing models to analyze the content of trade documents.
- Extracting insights related to trade policies and their implementation over time.
## Source
The dataset is collected from the General Agreement on Tariffs and Trade (GATT) archive. The original files are processed and converted into Parquet format for efficient storage and analysis.
## Dataset Citation
If you use this dataset in your research, please cite it as follows:
```
@dataset{GATT_Library_2024,
title={GATT Library Dataset},
author={Pleias},
year={2024},
description={Collection of GATT-related documents, organized by date and language, with detailed metadata extracted from the original files.}
}
```
**Note:** The dataset is presented and maintained by Pleias. All rights reserved.
# GATT库数据集卡片
## 数据集概览
**数据集名称:** GATT库
**描述:**
GATT库数据集收录了1946年1月1日至1996年9月6日期间所有与关税与贸易总协定(General Agreement on Tariffs and Trade,GATT)相关的文献,形成全面的文献集合。本数据集以单个Parquet文件存储,内含从原始文献中提取的详细元数据,完整记录每篇文献的相关信息。原始文献以tar.gz压缩包形式归档,并按文件夹分类存储。
**覆盖语言(基于10000条样本统计):**
- 英语(en):7878篇
- 法语(fr):1357篇
- 未知语言(Unknown):486篇
- 西班牙语(es):261篇
- 加泰罗尼亚语(ca):12篇
- 葡萄牙语(pt):4篇
- 德语(de):2篇
**文献总数量:** 60,182篇
**总词数:** 133,196,586
**单篇文献平均词数:** 2,213.23
**总字符数:** 941,277,211
**单篇文献平均字符数:** 15,640.51
**时间范围:** 1946年1月1日至1996年9月6日
## 数据集结构
本数据集以单个Parquet文件存储,该文件内含每篇文献的如下元数据:
- `source`:文献来源
- `document_id`:文献唯一标识符
- `title`:文献标题
- `short_title`:文献简短标题
- `author`:文献作者
- `date`:文献发表日期
- `type_of_document`:文献类型
- `identifier`:文献标识符
- `link`:文献链接
- `file`:文件名
- `folder`:文献存储文件夹
- `word_count`:文献总词数
- `character_count`:文献总字符数
- `text`:文献完整文本内容
每个Parquet文件均配套有原始XML文件集,该文件集以单个tar.gz压缩包存储,与Parquet文件一一对应。
## 使用场景
本数据集可应用于多种研究场景,包括但不限于:
- 历史贸易文献与协定分析
- 研究GATT对国际贸易的演进历程与影响
- 开发用于贸易文献内容分析的自然语言处理模型
- 提取不同时期贸易政策及其实施相关的研究洞察
## 数据集来源
本数据集采集自关税与贸易总协定(GATT)官方档案库。原始文献经处理后转换为Parquet格式,以实现高效存储与分析。
## 数据集引用规范
若您在研究中使用本数据集,请按照如下格式引用:
@dataset{GATT_Library_2024,
title={GATT Library Dataset},
author={Pleias},
year={2024},
description={Collection of GATT-related documents, organized by date and language, with detailed metadata extracted from the original files.}
}
**注意:** 本数据集由Pleias发布并维护,保留所有权利。
提供机构:
maas
创建时间:
2025-06-19



