five

yunruoyi/ptb_text_only

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/yunruoyi/ptb_text_only
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - found language: - en license: - other license_details: LDC User Agreement for Non-Members multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling paperswithcode_id: null pretty_name: Penn Treebank dataset_info: features: - name: sentence dtype: string config_name: penn_treebank splits: - name: train num_bytes: 5143706 num_examples: 42068 - name: test num_bytes: 453710 num_examples: 3761 - name: validation num_bytes: 403156 num_examples: 3370 download_size: 5951345 dataset_size: 6000572 --- # Dataset Card for Penn Treebank ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://catalog.ldc.upenn.edu/LDC99T42 - **Repository:** 'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt', 'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.valid.txt', 'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.test.txt' - **Paper:** https://www.aclweb.org/anthology/J93-2004.pdf - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Needs More Information] ### Dataset Summary This is the Penn Treebank Project: Release 2 CDROM, featuring a million words of 1989 Wall Street Journal material. The rare words in this version are already replaced with <unk> token. The numbers are replaced with <N> token. ### Supported Tasks and Leaderboards Language Modelling ### Languages The text in the dataset is in American English ## Dataset Structure ### Data Instances [Needs More Information] ### Data Fields [Needs More Information] ### Data Splits [Needs More Information] ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information Dataset provided for research purposes only. Please check dataset license for additional information. ### Citation Information @article{marcus-etal-1993-building, title = "Building a Large Annotated Corpus of {E}nglish: The {P}enn {T}reebank", author = "Marcus, Mitchell P. and Santorini, Beatrice and Marcinkiewicz, Mary Ann", journal = "Computational Linguistics", volume = "19", number = "2", year = "1993", url = "https://www.aclweb.org/anthology/J93-2004", pages = "313--330", } ### Contributions Thanks to [@harshalmittal4](https://github.com/harshalmittal4) for adding this dataset.
提供机构:
yunruoyi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作