Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/125749
Title: Aligning Textual and Visual Data towards Scalable Multimedia Retrieval
Researcher: Kompalli Pramod Sankar
Guide(s): Prof Jawahar C.V.
Keywords: Document Recognition and Retrieval
Multimedia Retrieval
Video Annotation
University: International Institute of Information Technology, Hyderabad
Completed Date: 13/05/2015
Abstract: The search and retrieval of images and videos from large repositories of multimedia, is acknowledged as a hard challenge. With existing solutions, one cannot obtain detailed, semantic description for a given multimedia document. Towards addressing this challenge, we observe that several multimedia collections contain similar parallel information. For example, the content of a news broadcast is also available in the form of newspaper articles. If a correspondence could be obtained between the videos and such parallel information, one could access one medium using the other. Different lt Multimedia, Parallel Information gt pairs, require different alignment techniques, depending on the granularity at which entities could be matched across them. We choose four pairs of multimedia, along with parallel information obtained in the text domain. The framework that we propose begins with an assumption that we could segment the multimedia and the text into meaningful entities that could correspond to each other. The problem then, is to identify features and learn to match a text-entity to a multimedia segment (and vice versa). Such a matching scheme could be refined using additional constraints, such as temporal ordering and occurrence statistics. We build algorithms that could align across i) movies and scripts, and ii) document images with lexicon. Further, we relax the constraint in the above assumption, such that the segmentation of the multimedia is not available apriori. The problem now, is to perform a joint inference of segmentation and annotation. A large number of putative segmentations are matched against the information extracted from the parallel text, with the joint inference achieved through dynamic programming. This approach was successfully demonstrated on i) Cricket videos with commentaries, and ii) word-images using the text-equivalent of the word. As a consequence of the approaches proposed in this thesis, we were able to demonstrate text-based retrieval systems over large multimedia collections.
Pagination: xiv,120
URI: http://hdl.handle.net/10603/125749
Appears in Departments:Computer Science and Engineering

Files in This Item:
File Description SizeFormat 
01_title.pdfAttached File70.21 kBAdobe PDFView/Open
02_certificate.pdf39.86 kBAdobe PDFView/Open
03_acknowledgements.pdf63.65 kBAdobe PDFView/Open
04_abstract.pdf118.16 kBAdobe PDFView/Open
05_contents.pdf168.15 kBAdobe PDFView/Open
06_list_of_tables_figures.pdf650.61 kBAdobe PDFView/Open
07_chapter1.pdf2.48 MBAdobe PDFView/Open
08_chapter2.pdf8.11 MBAdobe PDFView/Open
09_chapter3.pdf21.72 MBAdobe PDFView/Open
10_chapter4.pdf21.4 MBAdobe PDFView/Open
11_chapter5.pdf6.45 MBAdobe PDFView/Open
12_chapter6.pdf2.99 MBAdobe PDFView/Open
13_chapter7.pdf152.99 kBAdobe PDFView/Open
14_references.pdf619.3 kBAdobe PDFView/Open


Items in Shodhganga are protected by copyright, with all rights reserved, unless otherwise indicated.