Engineering

Extract Text from Images, PDFs, Docs, XLSs, and more in Python

By August 27, 2022 No Comments

Quick blog post on how to extract text from an image, pdf, word doc, excel spreadsheet and more using Python and the textract library.

If you haven’t already, jump into the various other blog posts on Python on the blog here to get a better understanding of the language.

For this solution, let’s start by creating a new directory and adding our overall system and project dependencies.

mkdir pythonextract
cd pythonextract
brew install cask
brew install xquartz --cask
brew install poppler antiword unrtf tesseract swig
python -m venv .pythonextract
source ./.pythonextract/bin/activate
pip install PyPDF2
pip install textract

Next, create your python script (extracttext.py) and copy/pasta the code below. This example below has been tailored for PDF text extraction using the PyPDF2 library and then falls back to using Textract. It’s slim, simple, yet efficient at extracting text from PDFs. If you need to extract from other file formats, just use the one-line text = textract.process(filePath, method=’tesseract’, encoding=’utf-8′)

import PyPDF2
import textract

def extractPdfText(filePath=''):
  print(f"Opening file at {filePath}")
  pdfFileReader = PyPDF2.PdfFileReader(filePath)
  totalPages = len(pdfFileReader.pages)
  print(f"This pdf contains {totalPages} pages.")

  currentPageNumber = 0
  text = ''

  while(currentPageNumber < totalPages): 
    pdfPage = pdfFileReader.pages[currentPageNumber]
    text = text + pdfPage.extract_text()
    currentPageNumber += 1
  
  if(text == ''):
    text = textract.process(filePath, moethod='tesseract', encoding='utf-8')

  return text

This library is extremely powerful and finding ways to run batch operations on your data may become labor or infrastructure intensive. Amazon recognized the opportunity to provide the horizontal scale for customers needing this AI tool, so they release Amazon Textract. This service also combines several other AI services to provide a robust text extraction and comprehension solution for Natural Language Processing data from documents, pictures, and more.

To finish off the example above, now that we have a module that has a function called “extractPdfText”, we can call that function with a hardcoded string or pass in an argument to the script as a command line argument.

# in example.py
from extracttext import extractPdfText
import os

text = extractPdfText(os.getcwd() + '/example.pdf')

# write out the text that was extracted
outputFile = open('output.txt', 'w')
outputFile.write(text.decode('utf-8'))
outputFile.close()

My previous blog post on Python Command Line Arguments shows how you can call the script and pass in a filename instead of hard coding it into your application, like the example above with ‘example.pdf’

# python example.py /path/to/file.pdf
extractPdfText(sys.argv[1])

# sys.argv contains the array data ['example.py', '/path/to/file.pdf']