Portable Document Format (PDF) is ideal for electronic books due to its ability to preserve the appearance of the original documents. The PDF documents contain various components like text, images, charts, tables etc. There are various researches was carried out on recognition of text, images, charts and tables in PDF files. But there is no research has been carried out to locate the recognized components of PDF in same place of transmitted PDF file which has gained attention in the past few decades. Hence in this paper, proposed a technique that locates the recognized text, images, charts, tables in the same place of transmitted PDF files. The graphic region of PDF file is separated from text by making use of single pass connected components. In the graphic region chart and non chart part of PDF files are separated based on the Connected Component Labels (CCL) and the images in the chart part are recognized based on connectivity components. The text in the PDF file is recognized by using a whitespace analysis approach based on connected components. Finally the T-Recs table recognition system is used to recognize the tables in the PDF files. The location of recognized text, images, charts and tables are maintained in a matrix and it is located in the same place based on the matrix value. The experiments are conducted in number of PDF files to prove the effectiveness of the proposed method in terms of accuracy, precision, recall and F-measure.
Recognizing Location Of Graphical And Text For Restoration After Separation And Compression In Pdf Documents
Research Article
DOI:
http://dx.doi.org/10.24327/ijrsr.2017.0806.0388
Subject:
science
KeyWords:
Portable Document Format, Recognition, Connected Component Labels, T-Recs table recognition system.
Abstract: