Processing out information from an image is a very big task in all the fields of work such as in development and business sector. The process of converting an electronical image into encoded form of text is called as OCR. OCR stands for Optical Character Recognition. OCR works under the research categories of Artificial intelligence and machine learning, pattern recognition and computer technologies. This is done in such a way that the image is processed under the machine learning algorithms, and it is certainly known that some kind of information is present in the image.
For understanding the image to text process in Python we need to import two different types of libraries and as followed:
Tesseract is an open-Source tool which is used to process and extract text from images. To make use of Tesseract in Python we need the help of pytesseract which is one of a wrappers for the tesseract. As we are working the image processing, we need the help of pillow library which is used to perform operation on it in Python.
The commands for installing the pytesseract and pillow in Python are:
pip install pytesseract pip install pillow
We need to mainly focus on the paths of the files as it will be helpful in the further execution of code. For most of the directories the path is maintained as:
Process to be Followed:
- We need to import the image firstly using PIL library for viewing image and opening the image.
- Then by using the pytesseract module from the main tesseract library which we use it for the text prediction and extraction from the image.
- After importing libraries and viewing the image, we need to define the path for the tesseract module which we installed in the first and this path will be depending upon on the location where it is installed and we need to define the path for the image file also.
- We need to define the path for tesseract cmd variable, this helps in finding the text in image and to extract it.
- After defining all the paths, we need to pass the image object to image string function. This takes input as image object and returns the text data present it.
- At last, we display the text which is extracted from the image.
We are uploading an image for the reference and the code:
from PIL import Image from pytesseract import pytesseract #defining the path for image and tesseract.exe Path_to_tesseract = r” C:\\ProgramFiles\\Tesseract\\tesseract.exe” imageP = r”csv\sample_texting.png” Image = image. open(imagep) #Viewing the image and storing the image data #Accessing the tesseract directory #pytesseract library setting location Pytesseract . tesseract_cmd = path_to_tesseract Text = pytesseract. image_to_string(image) #providing the image object to image_to_string () #Display output print (text [; -1])
Life is beautiful
The above example code is for reading the single image and displaying the desired output.
Extracting the Text from Multiple Images at Once:
We need to import a new library named as os. This module is used for interacting with the operating system and has many built in functions embedded in it to interact with the file system. We import this library when we are needed to extract the data from multiple images.
The code for image is saved by main.py.
Here, we use iteration technique for reading the images and following the same module method as of the single image.
#images uploaded from PIL import Image from pytesseract import pytesseract import os #defining the path for image and tesseract.exe Path_to_tesseract = r” C:\\ProgramFiles\\Tesseract\\tesseract.exe” imageP = r”images/” #Accessing the tesseract directory #pytesseract library setting location Pytesseract . tesseract_cmd = path_to_tesseract For root, dirs, file_name in os. Walk(path_to_image): #iterate over each file_name in the folder For file_name in the file_name: #Open image with PIL img = Image.open(path_to_image+file_name) #Extract text from image Text = pytesseract. image_to_string(img) Print(text)
Sample Text 1 Sample Text 2 Sample Text 3
Tesseract performs well and good when the image document has high-quality resolution and no noise in it and the scaling of the is done in appropriately.
The tesseract version 4.0 which is the latest one is very much accurate and tesseract as a good tool when it comes to scanning of documents.