Predictive Hacks

How to extract Text from PDF files

In many NLP tasks, we are dealing with PDF files which need to be converted to txt files. For this task I prefer to work with Apache Tika. Notice that Tika works also with .ppt and .doc files. The full list can be found here.

We can install tika-python by typing pip install tika in the terminal. Let’s give a quick example of how we can extract text from pdf.


from tika import parser

# I took the sample pdf from the link here
# http://www.africau.edu/images/default/sample.pdf


file_data = parser.from_file("sample.pdf")

# get the content of the pdf file
output = file_data['content']
# convert it to utf-8 
output = output.encode('utf-8', errors='ignore')

# save it to a file called output.txt
with open('output.txt', 'w') as the_file:
    the_file.write(str(output))

output

Below we represent what we get from the pdf file. Notice that the \n means “new line”.

b'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n A Simple PDF File \n This is a small demonstration .pdf file - \n\n just for use in the Virtual Mechanics tutorials. More text. And more \n text. And more text. And more text. And more text. \n\n And more text. And more text. And more text. And more text. And more \n text. And more text. Boring, zzzzz. And more text. And more text. And \n more text. And more text. And more text. And more text. And more text. \n And more text. And more text. \n\n And more text. And more text. And more text. And more text. And more \n text. And more text. And more text. Even more. Continued on page 2 ...\n\n\n\n Simple PDF File 2 \n ...continued from page 1. Yet more text. And more text. And more text. \n And more text. And more text. And more text. And more text. And more \n text. Oh, how boring typing this stuff. But not as boring as watching \n paint dry. And more text. And more text. And more text. And more text. \n Boring.  More, a little more text. The end, and just as well. \n\n\n'

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.