five

GATT_library

收藏
魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/GATT_library
下载链接
链接失效反馈
官方服务:
资源简介:
# GATT Library Dataset Card ## Dataset Overview **Dataset Name:** GATT Library **Description:** The GATT Library dataset comprises a comprehensive collection of documents related to the General Agreement on Tariffs and Trade (GATT), spanning from January 1, 1946, to September 6, 1996. This dataset is organized into a single Parquet file, which contains detailed information about the documents, including metadata extracted from the original files. The original files are stored in a tar.gz archive, organized into folders. **Languages Covered (based on a sample of 10,000 entries):** - English (en): 7,878 documents - French (fr): 1,357 documents - Unknown (Unknown): 486 documents - Spanish (es): 261 documents - Catalan (ca): 12 documents - Portuguese (pt): 4 documents - German (de): 2 documents **Total Number of Documents:** 60,182 **Total Number of Words:** 133,196,586 **Average Number of Words per Document:** 2,213.23 **Total Number of Characters:** 941,277,211 **Average Number of Characters per Document:** 15,640.51 **Date Range:** January 1, 1946 to September 6, 1996 ## Dataset Structure The dataset is stored in a single Parquet file, which contains the following metadata for each document: - `source`: The source of the document. - `document_id`: A unique identifier for the document. - `title`: The title of the document. - `short_title`: A shorter version of the document title. - `author`: The author of the document. - `date`: The publication date of the document. - `type_of_document`: The type of document. - `identifier`: An identifier for the document. - `link`: A link to the document. - `file`: The file name. - `folder`: The folder where the document is stored. - `word_count`: Total number of words in the document. - `character_count`: Total number of characters in the document. - `text`: Full text content of the document. Each Parquet file is accompanied by the original XML files collected into tar.gz files, one for each Parquet file. ## Usage The dataset can be used for various purposes including, but not limited to: - Analyzing historical trade documents and agreements. - Studying the evolution and impact of GATT on international trade. - Developing natural language processing models to analyze the content of trade documents. - Extracting insights related to trade policies and their implementation over time. ## Source The dataset is collected from the General Agreement on Tariffs and Trade (GATT) archive. The original files are processed and converted into Parquet format for efficient storage and analysis. ## Dataset Citation If you use this dataset in your research, please cite it as follows: ``` @dataset{GATT_Library_2024, title={GATT Library Dataset}, author={Pleias}, year={2024}, description={Collection of GATT-related documents, organized by date and language, with detailed metadata extracted from the original files.} } ``` **Note:** The dataset is presented and maintained by Pleias. All rights reserved.

# GATT库数据集卡片 ## 数据集概览 **数据集名称:** GATT库 **描述:** GATT库数据集收录了1946年1月1日至1996年9月6日期间所有与关税与贸易总协定(General Agreement on Tariffs and Trade,GATT)相关的文献,形成全面的文献集合。本数据集以单个Parquet文件存储,内含从原始文献中提取的详细元数据,完整记录每篇文献的相关信息。原始文献以tar.gz压缩包形式归档,并按文件夹分类存储。 **覆盖语言(基于10000条样本统计):** - 英语(en):7878篇 - 法语(fr):1357篇 - 未知语言(Unknown):486篇 - 西班牙语(es):261篇 - 加泰罗尼亚语(ca):12篇 - 葡萄牙语(pt):4篇 - 德语(de):2篇 **文献总数量:** 60,182篇 **总词数:** 133,196,586 **单篇文献平均词数:** 2,213.23 **总字符数:** 941,277,211 **单篇文献平均字符数:** 15,640.51 **时间范围:** 1946年1月1日至1996年9月6日 ## 数据集结构 本数据集以单个Parquet文件存储,该文件内含每篇文献的如下元数据: - `source`:文献来源 - `document_id`:文献唯一标识符 - `title`:文献标题 - `short_title`:文献简短标题 - `author`:文献作者 - `date`:文献发表日期 - `type_of_document`:文献类型 - `identifier`:文献标识符 - `link`:文献链接 - `file`:文件名 - `folder`:文献存储文件夹 - `word_count`:文献总词数 - `character_count`:文献总字符数 - `text`:文献完整文本内容 每个Parquet文件均配套有原始XML文件集,该文件集以单个tar.gz压缩包存储,与Parquet文件一一对应。 ## 使用场景 本数据集可应用于多种研究场景,包括但不限于: - 历史贸易文献与协定分析 - 研究GATT对国际贸易的演进历程与影响 - 开发用于贸易文献内容分析的自然语言处理模型 - 提取不同时期贸易政策及其实施相关的研究洞察 ## 数据集来源 本数据集采集自关税与贸易总协定(GATT)官方档案库。原始文献经处理后转换为Parquet格式,以实现高效存储与分析。 ## 数据集引用规范 若您在研究中使用本数据集,请按照如下格式引用: @dataset{GATT_Library_2024, title={GATT Library Dataset}, author={Pleias}, year={2024}, description={Collection of GATT-related documents, organized by date and language, with detailed metadata extracted from the original files.} } **注意:** 本数据集由Pleias发布并维护,保留所有权利。
提供机构:
maas
创建时间:
2025-06-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作