five

Hong Kong Parallel Text

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2004T08
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>Hong Kong Parallel Text was developed by the Linguistic Data Consortium (LDC) and contains data from three sub-corpora, namely Hong Kong Hansards Parallel Text, Hong Kong Laws Parallel Text and Hong Kong News Parallel Text.</p><br> <p>Hong Kong Hansards Parallel Text contains excerpts from the Official Record of Proceedings of the Legislative Council of the Hong Kong Special Administrative Region (HKSAR). Hong Kong Laws Parallel Text contains law codes acquired from the Department of Justice of the HKSAR. Hong Kong News Parallel Text contains press releases from the Information Services Department of the HKSAR.</p><br> <p><a href="http://catalog.ldc.upenn.edu/LDC2000T50" rel="nofollow">Hong Kong Hansards Parallel Text</a>, <a href="http://catalog.ldc.upenn.edu/LDC2000T47" rel="nofollow">Hong Kong Laws Parallel Text</a> and <a href="http://catalog.ldc.upenn.edu/LDC2000T46" rel="nofollow">Hong Kong News Parallel Text</a> were published in 2000. The 2000 versions of Hong Kong Hansards Parallel Text and Hong Kong News Parallel Text are aligned at the document level, while the 2004 versions are aligned at the sentence level. The 2000 and 2004 versions of Hong Kong News Parallel Text were aligned using different sentence alignment algorithms. As a result, the 2004 version has better sentence alignment and it also has slightly more data than the 2000 version.&nbsp;Chinese text is presented in the traditional script and encoded as BIG5.</p><br> <h3>Data</h3><br> <p><strong>Hong Kong Hansards</strong></p><br> <p>Hong Kong Hansards contains excerpts from the Official Record of Proceedings (hansards) of the Legislative Council of the HKSAR from October 1985 to April 2003. LDC downloaded the hansards, which were in pdf format, from the official website of HKSAR. A total of 1,428 files (714 in Chinese, 714 in English) were downloaded. One to one correspondence between the English hansards and the Chinese hansards is indicated by the file names. LDC converted the pdf files into plain text files using automatic conversion software and segmented the files at sentence boundaries. Efforts were made to remove tables from all files.</p><br> <p><strong>Hong Kong Laws</strong></p><br> <p>Hong Kong Laws contains statute laws of Hong Kong, downloaded from the Bilingual Laws Information System (BLIS, <a href="http://www.justice.gov.hk" rel="nofollow">http://www.justice.gov.hk/</a>), a searchable electronic database of the statute laws of Hong Kong, established and updated by the Department of Justice of the HKSAR, in 2000.</p><br> <p>The original BLIS database contains statute laws of Hong Kong in English and Chinese, constitutional instruments, national laws and other relevant instruments, collections of terms and expressions used in the laws of Hong Kong and subject indices of Ordinances. This corpus contains only statute laws of Hong Kong in English and Chinese, constitutional instruments, national laws and other relevant instruments published up to year 2000.</p><br> <p>The original files were in html format, and document level alignment was indicated by file names. LDC converted the html files into plain text files using automatic conversion software, and segmented the files at sentence boundaries. Efforts were made to remove tables from all files.</p><br> <p><strong>Hong Kong News</strong></p><br> <p>Hong Kong News contains press releases from July 1997 to October 2003 from the government of HKSAR. The HKSAR publishes press releases in both Chinese and English on a daily basis. Most press releases are available in both languages, some were translated from English to Chinese, some were translated from Chinese to English.</p><br> <p>The original files were in html format. LDC converted the html files into plain text files using automatic conversion software. Efforts were made to remove tables from all files. The original files do not indicate document level alignment in any way. The document level alignment was performed at LDC using an automatic document aligner. The document-aligned files were then segmented at sentence boundaries.</p><br> <p>Sentence alignment was performed on all data using Champollion, a parallel text sentence alignment tool developed at LDC. See <a href="http://champollion.sourceforge.net" rel="nofollow"> http://champollion.sourceforge.net</a> for more information about Champollion.</p><br> <p><strong>Final Data Format and Validation</strong></p><br> <p>For the Chinese data, there are approximately 49M-words, while for the English translation, there are approximately 59M-words in total, and 466K unique words.</p><br> <p>The following table shows the number of documents, paragraphs, segments, words and characters for each source.</p><br> <table width="100%"><br> <tbody><br> <tr><br> <td width="16%">Source</td><br> <td width="16%">#Documents</td><br> <td width="16%">#Paragraphs (English/Chinese)</td><br> <td width="16%">#Segments (English/Chinese)</td><br> <td width="16%">#English Words</td><br> <td width="16%">#Chinese Characters</td><br> </tr><br> <tr><br> <td width="16%">Hong Kong Hansards</td><br> <td width="16%">714</td><br> <td width="16%">642,008/632,173</td><br> <td width="16%">1,688,278/1,414,573</td><br> <td width="16%">36,140,737</td><br> <td width="16%">56,618,181</td><br> </tr><br> <tr><br> <td width="16%">Hong Kong Laws</td><br> <td width="16%">42,255</td><br> <td width="16%">423,192/462,283</td><br> <td width="16%">451,884/491,719</td><br> <td width="16%">8,396,243</td><br> <td width="16%">14,868,621</td><br> </tr><br> <tr><br> <td width="16%">Hong Kong News</td><br> <td width="16%">44,621</td><br> <td width="16%">605,183/603,118</td><br> <td width="16%">811,638/775,019</td><br> <td width="16%">14,798,671</td><br> <td width="16%">26,677,514</td><br> </tr><br> <tr><br> <td width="16%">Total</td><br> <td width="16%">87,590</td><br> <td width="16%">1,670,383/1,697,574</td><br> <td width="16%">2,951,800/2,681,311</td><br> <td width="16%">59,335,651</td><br> <td width="16%">98,164,316</td><br> </tr><br> </tbody><br> </table><br> <h3>Samples</h3><br> <p>Please view the following samples</p><br> <ul><br> <li><a href="desc/addenda/LDC2004T08.cmn.txt">Chinese</a></li><br> <li><a href="desc/addenda/LDC2004T08.eng.txt">English</a></li><br> <li><a href="desc/addenda/LDC2004T08.ali.txt">Alignment</a></li><br> </ul><br> <h3>Updates</h3><br> <p>There are no updates available at this time.</p><br> <h3>Copying and Distribution</h3><br> <p>Permission is granted to the Linguistic Data Consortium to make and distribute copies of the laws, press releases and news of Hong Kong Special Administrative Region provided this copyright notice and permission notices are distributed with all copies.</p><br> <p>Permission has been given to the Linguistic Data Consortium to reproduce the laws, press releases, and/or news articles from the Hong Kong Special Administrative Region Government website for research, education, and technology development.</p><br> <h3>Additional Licensing Instructions</h3><br> <p>This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact&nbsp;<a href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>&nbsp;for information about becoming a member.</p></br> Portions © 1985-2003 The Government of the Hong Kong Special Administrative Region, © 2004 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作