five

2016 ImageCLEF WEBUPV Collection

收藏
NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/1038553
下载链接
链接失效反馈
官方服务:
资源简介:
=============================================================================== Changelog: 2018-05-05  Test data ground truth released. 2016-04-24  Test data for main subtask 3 is released. 2016-03-17  Test data for teaser tasks is released. 2016-03-16  A Bug was found in the Visual Feature files, please redownload if you use them 2016-02-17  Fixed the mixxing xml files in scaleconcept16_data_textual.webpages.tar.gz 2016-02-15  Added input document files for the development set of the teaser tasks.             * DevData/TeaserTasks/scaleconcept16.teaser_dev_input_documents.tar.gz 2016-02-15  Fixed some newline formatting issues in             * Features/scaleconcept16.teaser.TrainTestSplit.v20160215.tar.gz             Please download the latest version. 2016-02-12  Teaser development set: There were 2 duplicate webpages and 2             near-duplicate images in the previous release. Based on user             feedback, we have updated the dataset -- which now only includes             3337 image-webpage pairs. Please download the latest versions of             the following to reflect these minor updates.             * DevData/TeaserTasks/scaleconcept16.teaser_dev_id.v20160212.tar.gz             * DevData/TeaserTasks/scaleconcept16.teaser_dev_data_textual.scofeat.v20160212.gz             * DevData/TeaserTasks/scaleconcept16.teaser2_dev_groundtruth.v20160212.txt =============================================================================== This document describes the ScaleConcept dataset compiled for the ImageCLEF 2016 Scalable Concept Image Annotation challenge. The data mentioned here indicates what is ready for download. However, upon request or depending on feedback from the participants, additional data may be released. The following is the directory structure of the collection, and below there is a brief description of what each compressed file contains. Directory structure ------------------- . | |--- README.txt |--- scaleconcept16.agreement.txt.tar.gz |--- scaleconcept16.concepts.tar.gz | |--- Features/ |      | |      |--- scaleconcept16_ImgID.txt_Mod.tar.gz |      |--- scaleconcept16_ImgToTextID.tar.gz |      |--- scaleconcept16_TextID.txt_Mod.tar.gz |      |--- scaleconcept16.teaser.TrainTestSplit.*.tar.gz     |      | |      |--- Textual/ |      |       | |      |       |---scaleconcept16_data_textual.scofeat.tar.gz     |      |       |---scaleconcept16_data_textual.webpages.zip             |      | |      |--- Visual/ |              | |              |--- scaleconcept16_data_visual_gist.dfeat.gz |              |--- scaleconcept16_data_visual_sift_1000.sfeat.gz |              |--- scaleconcept16_data_visual_rgbsift_1000.sfeat.gz |              |--- scaleconcept16_data_visual_opponentsift_1000.sfeat.gz |              |--- scaleconcept16_data_visual_colorhist.sfeat.gz |              |--- scaleconcept16_data_visual_getlf.sfeat.gz |              |--- scaleconcept16_data_visual_vgg16-relu7.dfeat.gz |                 |--- scaleconcept16_images.zip | |--- DevData/ |       | |       |--- MainSubTasks/ |       |       | |       |       |--- scaleconcept16.dev.visual.bbox.*.tar.gz |       |       |--- scaleconcept16.dev.textdesc.*.tar.gz |       |       |--- scaleconcept16.subtask3.dev.input_bbox.*.gz |       |       |--- scaleconcept16.subtask3.dev.textdesc.*.gz |       | |       |--- TeaserTasks/ |       |       | |       |       |--- scaleconcept16.teaser_dev_data_textual.scofeat.*.gz |       |       |--- scaleconcept16.teaser_dev_data_visual_colorhist.sfeat.gz |       |       |--- scaleconcept16.teaser_dev_data_visual_csift_1000.sfeat.gz |       |       |--- scaleconcept16.teaser_dev_data_visual_getlf.sfeat.gz |       |       |--- scaleconcept16.teaser_dev_data_visual_gist.sfeat.gz |       |       |--- scaleconcept16.teaser_dev_data_visual_opponentsift_1000.sfeat.gz |       |       |--- scaleconcept16.teaser_dev_data_visual_rgbsift_1000.sfeat.gz |       |       |--- scaleconcept16.teaser_dev_data_visual_sift_1000.sfeat.gz |       |       |--- scaleconcept16.teaser_dev_data_visual_vgg16-relu7.dfeat.gz |       |       |--- scaleconcept16.teaser_dev_id.*.tar.gz |       |       |--- scaleconcept16.teaser_dev_images.zip |       |       |--- scaleconcept16.teaser_dev_pages.zip |       |       |--- scaleconcept16.teaser2_dev_groundtruth.txt | |--- TestData/ |      | |      |--- concepts.lst |      |--- scaleconcept16_subtask1_test.lst |      |--- scaleconcept16_subtask2_test.lst |      |--- scaleconcept16_subtask3_test.lst |      |--- scaleconcept16_subtask3_test.input_bbox.txt |      |--- scaleconcept16_teaser1_test_image_collection.lst |      |--- scaleconcept16_teaser1_test.lst |      |--- scaleconcept16_teaser2_test.lst |      |--- scaleconcept16_teaser_test_input_documents.tar.gz Contents of files ----------------- * scaleconcept16.concepts.tar.gz   -> scaleconcept16.concepts.txt      List of 251 concepts for the 2016 challenge. File format:      wordnet-offset \t category-word.pos.## \t list,of,synonyms,separated,by,commas \t defintiion   -> scaleconcept16.concepts_hierarchy.txt      The hierarchy structure of the 'general level' categories. File format:      *category \t *parent-category \t definition.      For example, *mammal is the child of *animal. '#' represents the root node.   -> scaleconcept16.concepts_to_parents.txt      List of 'general level' category parent(s) for each 251 concept. A concept may      have multiple parents (separated by commas). File format:      category \t *parent1,*parent2   * Features/scaleconcept16_ImgID.txt_Mod.tar.gz   IDs of the images in the dataset. * Features/scaleconcept16_TextID.txt_Mod.tar.gz   IDs of the webpages in the dataset. * Features/scaleconcept16_ImgToTextID.tar.gz   IDs of images that appear on corresponding web pages * Features/scaleconcept16.teaser.TrainTestSplit.*.tar.gz   For Teasers 1 and 2: IDs of images and webpages, split into approximately   300K for training and exactly 200K for testing.   The 200K test data cannot be explored during training for both teaser tasks.   * Features/Textual/scaleconcept16_data_textual.scofeat.tar.gz       The processed text extracted from the webpages near where the images   appeared. Each line corresponds to one image, having the same order   as the data_iids.txt list. The lines start with the image ID,   followed by the number of extracted unique words and the   corresponding word-score pairs. The scores were derived taking into   account 1) the term frequency (TF), 2) the document object model   (DOM) attributes, and 3) the word distance to the image. The scores   are all integers and for each image the sum of scores is always   <=100000 (i.e. it is normalized). * Features/Textual/scaleconcept16_data_textual.webpages.tar.gz   Contains all of the webpages which referenced the images in the   dataset set after being converted to valid xml. In total there are   525766 files, since each image can appear in more than one page, and   there can be several versions of same page which differ by the   method of conversion to xml. To avoid having too many files in a   single directory (which is an issue for some types of partitions),   the files are found in subdirectories named using the first two   characters of the RID, thus the paths of the files after extraction   are of the form:     ./scaleconcept16_data_textual.webpages/{RID:0:2}/{RID}.{CONVM}.xml.gz   * Features/Visual/scaleconcept16_images.zip   Contains thumbnails (maximum 640 pixels of either width or height)   of the images in jpeg format. To avoid having too many files in a   single directory (which is an issue for some types of partitions),   the files are found in subdirectories named using the first two   characters of the image ID, thus the paths of the files after extraction   are of the form:     ./scaleconcepts16_images/{IID:0:2}/{IID}.jpg * Features/Visual/scaleconcept16_*.{s|d}feat.gz   The visual features in a simple ASCII text format either in sparse   (*.sfeat.gz files) or dense (*.dfeat.gz files). The first   line of the file indicates the number of vectors (N) and the   dimensionality (DIMS). Then each line corresponds to one vector.   For the dense features each line has exactly DIMS values separated   by spaces, i.e., the format is:     N DIMS     Val(1,1) Val(1,2) ... Val(1,DIMS)     Val(2,1) Val(1,2) ... Val(2,DIMS)     ...     Val(N,1) Val(N,2) ... Val(N,DIMS)   For the sparse features, each line starts with the number of non-zero   elements and is followed by dimension-value pairs, being the first   dimension 0, i.e., the format is:     N DIMS     nz1 Dim(1,1) Val(1,1) ... Dim(1,nz1) Val(1,nz1)     nz2 Dim(2,1) Val(2,1) ... Dim(2,nz2) Val(2,nz2)     ...     nzN Dim(N,1) Val(N,1) ... Dim(N,nzN) Val(N,nzN)   The order of the features is the same as in the list data_iids.txt.   The procedure to extract the SIFT based features in this   subdirectory was conducted as follows. Using the ImageMagick   software, the images were first rescaled to having a maximum of 240   pixels, of both width and height, while preserving the original   aspect ratio, employing the command:     convert {IMGIN}.jpg -resize '240>x240>' {IMGOUT}.jpg   Then the SIFT features where extracted using the ColorDescriptor   software from Koen van de Sande   (http://koen.me/research/colordescriptors). As configuration we   used, 'densesampling' detector with default parameters, and a hard   assignment codebook using a spatial pyramid as   'pyramid-1x1-2x2'. The number in the file name indicates the size of   the codebook. All of the vectors of the spatial pyramid are given in   the same line, thus keeping only the first 1/5th of the dimensions   would be like not using the spatial pyramid. The codebook was   generated using 1.25 million randomly selected features and the   k-means algorithm. The GIST features were extracted using the   LabelMe Toolbox. The images where first resized to 256x256 ignoring   original aspect ratio, using 5 scales, 6 orientations and 4   blocks. The other features colorhist and getlf, are both color   histogram based extracted using our own implementation. * Features/Visual/scaleconcept16_data_visual_vgg16-relu7.dfeat.gz   Contains the 4096 dimensional activations of the relu7 layer of Oxford   VGG's 16-layer CNN model, extracted using the Berkeley Caffe library.   More details can be found at https://github.com/BVLC/caffe/wiki/Model-Zoo.   * DevData/MainSubTasks/scaleconcept16.dev.viusal.bbox.*.tar.gz   Development set ground truth localised annotations for sub task 1.   The format for the development set of annotated bounding boxes of   the concepts is       The development set contains 1,979 images. The bounding boxes may enclose   single instances (a single tree) or grouped instances (e.g. group of trees),   depending on the context. The annotations are not exhaustive: the emphasis   is on concepts that are interesting enough to be described in the image,   although background objects are also optionally annotated by our annotators   in many cases. Also note that a person might not be annotated if the   annotator could not decide whether the person is a man/woman/boy/girl. * DevData/MainSubTasks/scaleconcept16.dev.textdesc.*.tar.gz   Development set ground truth textual description annotations of images for   Subtask 2   The format is:   \t \t   The development set contains 2,000 images with 5 to 51 textual descriptions   per image (mean: 9.492, median: 8). Please note that the sentences contain a   mix of both American and British English spelling variants (e.g. color vs   colour) -- we have decided to retain this variation in the annotations to   reflect the challenge of real-world English spelling variants. Basic   spell-correction has been performed on the textual descriptions, but we cannot   guarantee that they are completely free from spelling or grammatical error. * DevData/MainSubTasks/scaleconcept16.subtask3.dev.input_bbox.*.gz   Input bounding boxes for Subtask 3. This is a selected subset of 500 development   images from scaleconcept16.dev.visual.bbox above (please refer to above for file format). * DevData/MainSubTasks/scaleconcept16.subtask3.dev.textdesc.*.gz   Annotated textual descriptions for 500 development images, to be used to evaluate   the content selection ability of the text generation system in the clean track of   SubTask 3. The format is the same as the original scaleconcept16.dev.textdesc   file, except that we further annotated textual terms with their corresponding   input bounding boxes, for example [[[dogs|0,4]]] in a textual description refers   to the two instances of dogs with the bounding box id 0 and 4 in   scaleconcept16.subtask3.dev.input_bbox.   Note that not all descriptions from the original scaleconcept16.dev.textdesc are   used in this version, and as such the sequence numbers of the descriptions may not   necessarily be contiguous as we retained the sequence numbers from the original   file for consistency.   * DevData/TeaserTasks/scaleconcept16.teaser_dev_id.*.tar.gz   -> scaleconcept16.teaser_dev.ImgID.txt      IDs of 3339 images for the development set of both teaser tasks. Note that 2 images      are near-duplicates and will thus not be used in this dataset. We have left them      intact to avoid having participants re-download the visual features.   -> scaleconcept16.teaser_dev.TextID.txt      IDs of 3337 webpage documents for the development set of both teaser tasks.   -> scaleconcept16.teaser_dev.ImgToTextID.txt      IDs of 3337 images that appear on corresponding web pages. * DevData/TeaserTasks/scaleconcept16.teaser_dev_input_documents.tar.gz   -> scaleconcept16.teaser_dev.docID.txt      IDs of 3337 input text documents for the development set of both teaser tasks.   -> docs/{docID}      The text for each input document. These should be used as input for both teaser tasks.   * DevData/TeaserTasks/scaleconcept16.teaser_dev_images.zip   Contains 3339 images for the development set of both teaser tasks in jpeg format. * DevData/TeaserTasks/scaleconcept16.teaser_dev_pages.zip   Contains 3337 webpages for the development set of both teaser tasks after   converting to valid xml format, each compressed as a gzip file. * DevData/TeaserTasks/scaleconcept16.teaser_dev_data_textual.scofeat.*.gz   The processed text extracted from 3337 webpages near where the images appeared.   Please refer to Features/Textual/scaleconcept16_data_textual.scofeat.tar.gz   above for more details.     * DevData/TeaserTasks/scaleconcept16_teaser_dev_data_visual_*.{s|d}feat.gz   The visual features for the 3339 images for the development set of both teaser tasks.   Please refer to Features/Visual/scaleconcept16_*.{s|d}feat.gz above for more details.     * DevData/TeaserTasks/scaleconcept16.teaser2_dev_groundtruth.*.txt   The GPS coordinates for 3337 documents from the development set for Teaser Task 2 (Geolocation).   File format:   WebpageID latitude longitude * TestData/concepts.lst   List of 251 concepts for the main subtasks 1, 2, and 3.   File format: wordnet-offset \t category-word.pos.## * TestData/scaleconcept16_subtask{1|2}_test.lst     List of 510,123 images to annotate for main subtasks 1 and 2. Both files are identical. * TestData/scaleconcept16_subtask3_test.lst     List of 450 images to annotate for main subtask 3. * TestData/scaleconcept16_subtask3_test.input_bbox.txt   List of bounding boxes for 450 test images, to be used as input for main subtask 3.   The format is the same as the development set. * TestData/scaleconcept16_teaser1_test_image_collection.lst   The collection of 200,000 test images to be used for Teaser Task 1 (text illustration) * TestData/scaleconcept16_teaser1_test.lst   The list of IDs for 180,000 text documents to be used as input for Teaser Task 1.   Please note that the IDs are *not* the same as the webpage IDs provided in the 500K corpus.   The task is to provide a ranked list of the top 100 images (from the 200,000 test image collection above)   for each input text document. * TestData/scaleconcept16_teaser2_test.lst   The list of IDs for 180,000 text documents to be used as input for Teaser Task 2 (identical to teaser task 1).   The task is to provide the latitude and longitude for each input text document. * TestData/scaleconcept16_teaser_test_input_documents.tar.gz   The input text documents to be used for Teaser Tasks 1 and 2.   These should be used as input for both teaser tasks. * TestData/scaleconcept16_groundtruth.zip   Ground truth for the test set.   Contact ------- For further questions, please contact:   Andrew Gilbert
创建时间:
2020-01-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作