Quick blog post on how to extract text from an image, pdf, word doc, excel spreadsheet and more using Python and the textract library.
If you haven’t already, jump into the various other blog posts on Python on the blog here to get a better understanding of the language.
For this solution, let’s start by creating a new directory and adding our overall system and project dependencies.
mkdir pythonextract cd pythonextract brew install cask brew install xquartz --cask brew install poppler antiword unrtf tesseract swig python -m venv .pythonextract source ./.pythonextract/bin/activate pip install PyPDF2 pip install textract
Next, create your python script (extracttext.py) and copy/pasta the code below. This example below has been tailored for PDF text extraction using the PyPDF2 library and then falls back to using Textract. It’s slim, simple, yet efficient at extracting text from PDFs. If you need to extract from other file formats, just use the one-line text = textract.process(filePath, method=’tesseract’, encoding=’utf-8′)
import PyPDF2 import textract def extractPdfText(filePath=''): print(f"Opening file at {filePath}") pdfFileReader = PyPDF2.PdfFileReader(filePath) totalPages = len(pdfFileReader.pages) print(f"This pdf contains {totalPages} pages.") currentPageNumber = 0 text = '' while(currentPageNumber < totalPages): pdfPage = pdfFileReader.pages[currentPageNumber] text = text + pdfPage.extract_text() currentPageNumber += 1 if(text == ''): text = textract.process(filePath, moethod='tesseract', encoding='utf-8') return text
This library is extremely powerful and finding ways to run batch operations on your data may become labor or infrastructure intensive. Amazon recognized the opportunity to provide the horizontal scale for customers needing this AI tool, so they release Amazon Textract. This service also combines several other AI services to provide a robust text extraction and comprehension solution for Natural Language Processing data from documents, pictures, and more.
To finish off the example above, now that we have a module that has a function called “extractPdfText”, we can call that function with a hardcoded string or pass in an argument to the script as a command line argument.
# in example.py from extracttext import extractPdfText import os text = extractPdfText(os.getcwd() + '/example.pdf') # write out the text that was extracted outputFile = open('output.txt', 'w') outputFile.write(text.decode('utf-8')) outputFile.close()
My previous blog post on Python Command Line Arguments shows how you can call the script and pass in a filename instead of hard coding it into your application, like the example above with ‘example.pdf’
# python example.py /path/to/file.pdf extractPdfText(sys.argv[1]) # sys.argv contains the array data ['example.py', '/path/to/file.pdf']