Is there a way to extract pictures from image files?

LAPIII · 01-26-2012, 08:00 PM

I want to extract text from image files that also hae pictures. I rather not manually revmoe the pictures using an image editor, I want to automate this to avoid that.

MS3FGX · 01-26-2012, 09:59 PM

An image file that has a picture? Isn't that the same thing? I'm not sure I follow you here.

Dark_Helmet · 01-26-2012, 10:51 PM

I'm with MS3FGX...

What file format are you trying to work with?

When you say text in the file, do you mean metadata?

Multiple images in one file with text... sounds like a PDF or a web page... maybe a tiff? All of those have different approaches to extract the components.

So, let us know what file format you're using, and we can probably point you in the right direction.

LAPIII · 01-27-2012, 11:13 AM

Quote:

Originally Posted by Dark_Helmet

I'm with MS3FGX...
What file format are you trying to work with?

Tiff's and PDF's.

Quote:

Originally Posted by Dark_Helmet

When you say text in the file, do you mean metadata?

No

From Tiff's and PDF's, I can finalized these two texts with Tesseract OCR. I just want to know if there's an easier way to extract the pictures from the images, so that has Tesseract can do its job better.

-EDIT-

I'm reading, from Understanding the PDF File format – images, that:

Quote:

Images are not stored inside a PDF file as Tiff or PNG or JPG images. They are stored as the binary pixel data along with the Colorspace used by that data.

I've also read that ImageMagick has PDF handling that will extract images, but like most PDF image extractors, it probably goes by btmp, png, jpeg, etc.

There is a command line utility called pdfimages and, as with IM, I guess the same.