How to create your first OCR project in Python with PyTesseract

Category
Tutorial
Reading
5 mins
Views
1.4K
Posting
19 Jan 2024

The development of the times makes the demand for application features increasingly diverse, especially the development of AI, which is increasingly rapid and massive, making us have to adapt to developments. Perhaps one of them is the OCR feature, or Optical Character Recognition. OCR is a method of recognizing characters in visuals such as images and documents. The way this method works will extract all the characters contained in images and documents such as PDFs. OCR is very useful and useful for retrieving the entire text quickly without having to write manually. Just imagine if you write the text manually, which has more than 50 pages. Therefore, OCR is very helpful.

In this article, we will give a tutorial on how to create a simple project using the OCR method, using the PyTesseract library and OpenCV. We make sure you already understand the basics of the Python programming language, because this tutorial is 100% using the Python programming language. Okay, fine, what is needed to start this OCR project?

 

How to create your first OCR project in Python with PyTesseract

 

 

1. Install tesseract into the Windows operating system.

To run pytesseract, you will not be able to be separated from the main library, namely tesseract. Tesseract is an open-source library specifically for handling this OCR method. You can download the installer file on GitHub: https://github.com/UB-Mannheim/tesseract/wiki. Please install as usual, if so, you can continue to the next step.

 

2. Preparing and installing PyTesseract and OpenCV

If tesseract is already installed on your computer, please open your favorite code editor, create a new project, and create a python file with the name app.py. Next, we need a library to support our project this time, namely pytesseract and opencv. Open your terminal and install both dependencies with:

pip install pytesseract, opencv-python

We need OpenCV to read and manage images, and for maximum results, we must change the image to be processed into black and white and also negative. This aims to make the writing in the image easier to recognize by tesseract in the extraction process. If you don't want to use OpenCV, you can also use other alternatives, such as Pillow library.

 

3. Creating an OCR project

After the pytesseract and opencv-python installation processes are complete, create an app.py file and then fill in some syntax in it. The first step is to import the libraries needed; you can follow it as below:

import pytesseract
import cv2

Next, call the tesseract executable file that we installed earlier by adding the following line:

pytesseract.pytesseract.tesseract_cmd = "C:\\Program Files\\Tesseract-OCR\\tesseract.exe"

Here because we are using Windows 64bit, the Tesseract-OCR folder is inside Program Files folder, if you are using Windows 32bit, just adjust the location of the tesseract folder. Create a new function with the name extract_text() that has image and lang parameters respectively. The image parameter is a string that contains the path of the image while the lang parameter is the expected language output on the result. For the language itself here we use English (eng) as default, actually there are many languages supported by tesseract, you can check it with:

languages = pytesseract.get_languages()
print(languages)

Before entering the text extraction process, first convert the image to gray with the function:

def extract_text(image, lang):
    img = cv2.imread(image)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Next, perform the binaryization process on the image that has been converted to gray using the OTSU thresholding method, where pixels with intensity above the threshold will be converted to white (255) and below the threshold will be converted to black (0).

threshold_img = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

And finally the process of extracting the writing on the image with pytesseract with the image_to_string() method to get the expected results.

return pytesseract.image_to_string(threshold_img, lang=lang)

And the full code looks like this:

import pytesseract
import cv2
pytesseract.pytesseract.tesseract_cmd = "C:\\Program Files\\Tesseract-OCR\\tesseract.exe"

def extract_text(image, lang):
    img = cv2.imread(image)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    threshold_img = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    return pytesseract.image_to_string(threshold_img, lang=lang)

extract = extract_text('test.jpg', 'eng')
print(extract)

If everything is ready, we test our OCR project by calling the command in the terminal: python app.py, please see what happens, if you find text that matches what is in the image, it means you have successfully implemented OCR. 

How to create your first OCR project in Python with PyTesseract

And congratulations! You have successfully created the first OCR project. You can also develop more advanced code from the source code by using other libraries, such as NumPy.

 

4. Using requests and NumPy to read images

You can also implement library requests in your project, library requests allows you to retrieve anything remotely including retrieving images by using a URL to receive images, different from the previous one which only read image files locally, how to do it? The first thing you have to do is install first and then import the required dependencies, namely requests and NumPy.

The NumPy library is used to convert from byte data to arrays, by using NumPy, we can take advantage of the advantages of the array data structure provided by this library and simplify the operation and manipulation of image data more efficiently.

Install the required dependencies by:

pip install requests, numpy

Wait until the installation process is complete, then import the dependencies into your project.

import requests
import numpy as np

Delete the section:

img = cv2.imread(image)

Change it to something like this:

response = requests.get(image)
img = cv2.imdecode(np.frombuffer(response.content, np.uint8), -1)

Change the value of the first parameter that initially calls the path of the image with the URL of the destination image, please try and see the result. Here we need the NumPy library because this will convert the byte data from response.content into uint8 data type, This array represents image data in the form of an array of 8-bit integers without signs, which is compatible with the image color format usually used by OpenCV.

Or you can also modify by conditioning to read the string from the extract_text() parameter which will read whether the string used is a URL or path to an image, you can use the startswith method to read whether the initial string is http protocol or not, that means by conditioning the string, this line of code: img = cv2.imread(image) is still used.

if image.startswith(('http', 'https')):
    response = requests.get(image)
    img = cv2.imdecode(np.frombuffer(response.content, np.uint8), -1)
else:
    img = cv2.imread(image)

 

Conclusion

OCR, or optical character recognition, is not only used for document purposes; this method is also implemented for other needs such as image search, vehicle license plate recognition, and many more. The development of technology makes OCR open a lot of innovative features in a platform or application that is certainly with the hope of facilitating your users and time efficiency.

Share