Predictive Hacks

# How to extract Text from PDF files

In many NLP tasks, we are dealing with PDF files which need to be converted to txt files. For this task I prefer to work with Apache Tika. Notice that Tika works also with .ppt and .doc files. The full list can be found here.

We can install tika-python by typing pip install tika in the terminal. Let’s give a quick example of how we can extract text from pdf.

from tika import parser

# I took the sample pdf from the link here
# http://www.africau.edu/images/default/sample.pdf

file_data = parser.from_file("sample.pdf")

# get the content of the pdf file
output = file_data['content']
# convert it to utf-8
output = output.encode('utf-8', errors='ignore')

# save it to a file called output.txt
with open('output.txt', 'w') as the_file:
the_file.write(str(output))

output


Below we represent what we get from the pdf file. Notice that the \n means “new line”.

b'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n A Simple PDF File \n This is a small demonstration .pdf file - \n\n just for use in the Virtual Mechanics tutorials. More text. And more \n text. And more text. And more text. And more text. \n\n And more text. And more text. And more text. And more text. And more \n text. And more text. Boring, zzzzz. And more text. And more text. And \n more text. And more text. And more text. And more text. And more text. \n And more text. And more text. \n\n And more text. And more text. And more text. And more text. And more \n text. And more text. And more text. Even more. Continued on page 2 ...\n\n\n\n Simple PDF File 2 \n ...continued from page 1. Yet more text. And more text. And more text. \n And more text. And more text. And more text. And more text. And more \n text. Oh, how boring typing this stuff. But not as boring as watching \n paint dry. And more text. And more text. And more text. And more text. \n Boring.  More, a little more text. The end, and just as well. \n\n\n'

### Get updates and learn from the best

Miscellaneous

#### How to Redirect and Save Errors in Unix

In Unix, there are three types of redirection such as: Standard Input (stdin) that is denoted by 0. Usually, it’s

Python

#### Content-Based Recommender Systems with TensorFlow Recommenders

In this post, we will consider as a reference point the “Building deep retrieval models” tutorial from TensorFlow and we