In many NLP tasks, we are dealing with PDF files which need to be converted to txt files. For this task I prefer to work with Apache Tika. Notice that Tika works also with .ppt and .doc files. The full list can be found here.
We can install tika-python by typing pip install tika
in the terminal. Let’s give a quick example of how we can extract text from pdf.
from tika import parser # I took the sample pdf from the link here # http://www.africau.edu/images/default/sample.pdf file_data = parser.from_file("sample.pdf") # get the content of the pdf file output = file_data['content'] # convert it to utf-8 output = output.encode('utf-8', errors='ignore') # save it to a file called output.txt with open('output.txt', 'w') as the_file: the_file.write(str(output)) output
Below we represent what we get from the pdf file. Notice that the \n
means “new line”.
b'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n A Simple PDF File \n This is a small demonstration .pdf file - \n\n just for use in the Virtual Mechanics tutorials. More text. And more \n text. And more text. And more text. And more text. \n\n And more text. And more text. And more text. And more text. And more \n text. And more text. Boring, zzzzz. And more text. And more text. And \n more text. And more text. And more text. And more text. And more text. \n And more text. And more text. \n\n And more text. And more text. And more text. And more text. And more \n text. And more text. And more text. Even more. Continued on page 2 ...\n\n\n\n Simple PDF File 2 \n ...continued from page 1. Yet more text. And more text. And more text. \n And more text. And more text. And more text. And more text. And more \n text. Oh, how boring typing this stuff. But not as boring as watching \n paint dry. And more text. And more text. And more text. And more text. \n Boring. More, a little more text. The end, and just as well. \n\n\n'