Converting PDF Image to CSV Using Multimodal Google Gemini Pro

usmanmalik57 2 Tallied Votes 742 Views Share

In this article, you will learn to use Google Gemini Pro, a state-of-the-art multimodal generative model, to extract information from PDF and convert it to CSV files. You will use a simple text prompt to tell Google Gemini Pro about the information you want to extract. This is a valuable skill for data analysis, reporting, and automation.

You will use Python language to call the Google Vertex AI API functions and extract information from PDF converted to JPEG images.

So, let's begin without ado.

Importing and Installing Required Libraries

I ran my code on Google Colab, where I only needed to install the Google Cloud APIs. You can install the Google Cloud API via the following script installs.

pip install --upgrade google-cloud-aiplatform

Note: You must create an account with Google Cloud Vertex AI and get your API keys before running the scripts in this tutorial. When you sign up for the Google cloud platform, you will get free credits worth $300.

The following script imports the required libraries into our application.


import base64
import glob
import csv
import os
import re
from vertexai.preview.generative_models import GenerativeModel, Part

Defining Helping Functions for Image Reading

Before using Google Gemini Pro to extract information from PDF tables, you must convert your PDF files to image formats, e.g. JPG, PNG, etc. Google Gemini Pro can only accept images as input, not PDF files. You can use any tool that can convert PDF files to JPG images, such as PDFtoJPG.

Once you have converted your PDF files to JPG images, you need to read them as bytes and encode them as base64 strings. Google Gemini Pro can only accept base64-encoded strings as input, not raw bytes. You also need to specify the MIME type of the images, which is image/jpeg since we will process JPEG images.

To simplify these tasks, you can define two helper functions: get_jpg_file_paths() and read_image().

The get_jpg_file_paths() function takes a directory as an argument and returns a list of absolute paths to all the JPG files in that directory and its subdirectories.

The read_image() function takes a list of image paths as an argument and returns a list of Part objects, which are helper classes provided by the vertexai.preview.generative_models module. Each Part object contains the base64-encoded string and the mime type of an image.


def get_jpg_file_paths(directory):

    jpg_file_paths = glob.glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
    return [os.path.abspath(path) for path in jpg_file_paths]

def read_image(img_paths):

    imgs_b64 = []
    for img in img_paths:
        with open(img, "rb") as f: # open the image file in binary mode
            img_data = f.read() # read the image data as bytes
            img_b64 = base64.b64encode(img_data) # encode the bytes as base64
            img_b64 = img_b64.decode() # convert the base64 bytes to a string
            img_b64 = Part.from_data(data=img_b64, mime_type="image/jpeg")

            imgs_b64.append(img_b64)

    return imgs_b64

Extracting Information from PDF Using Google Gemini Pro

Now that you know how to convert your PDF files to JPG images and encode them as base64 strings, you can use Google Gemini Pro to extract information from them.

To use Google Gemini Pro, you must create a GenerativeModel object and pass it the name of the model you want to use.
In this tutorial, You will use Google's latest generative model named gemini-pro-vision, a multimodal LLM capable of processing images and text.

You will also use a specific generation config, which is a set of parameters that control the behavior of the generative model.

But before the above steps, you will need to set the GOOGLE_APPLICATION_CREDENTIALS variable that stores the path to the JSON file having information about your Vertex AI Service Account and API Key.

The following script sets the environment variable, creates the model object, and defines the configuration settings.


os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r"PATH_TO_VERTEX_AI_SERVICE_ACCOUNT JSON FILE"

model = GenerativeModel("gemini-pro-vision")
config={
    "max_output_tokens": 2048,
    "temperature": 0,
    "top_p": 1,
    "top_k": 32
}

Finally, to generate a response from the Google Gemini Pro model, you need to call the generate_content() method of the GenerativeModel object. This method takes three arguments:

input: A list of Part objects that contain the data and the mime type of the input. You can provide both text and image inputs in this list.

generation_config: A dictionary containing the generation parameters you set earlier.

stream: A boolean value that indicates whether you want to receive the response as a stream or as a single object.

You can use the following code to define the generate() function that generates a response from Google Gemini Pro, given an image or a list of images, and a text prompt:


def generate(img, prompt):


    input = img + [prompt]

    responses = model.generate_content(    
        input,
        generation_config= config,
        stream=True,
    )
    full_response = ""

    for response in responses:
        full_response += response.text

    return full_response

As an example, we will convert the contents of this receipt into a CSV file. The receipt is in French language and contains information about the date of purchase, number of tickets, tax information, etc. The receipt is not in tabular format, yet you will see that we will be able to convert the information in this receipt to a CSV file.

receipt1.jpg

For demonstration purposes, I will use two copies of the same receipt to show you how you can extract information from multiple images.

The following calls the get_jpg_file_paths() and read_image() functions that we defined earlier to read all the images in my input directory and convert them into Part objects that the Google Gemini Pro model expects.


directory_path = r'D:\\Receipts\\'
image_paths = get_jpg_file_paths(directory_path)
imgs_b64 = read_image(image_paths)

Next, we define our text prompt to extract information from the image receipt. Your prompt engineering skills will shine here. A good prompt can make the task of LLM much easier. We will use the following prompt to extract information.


prompt = """I have the above receipts. Return a response that contains information from the receipts in a comma-separated file format where row fields are table columns,
whereas row values are column values. The output should contain (header + number of recept rows).
The first row should contain all column headers, and the remaining rows should contain all column values from two recepts one in each row.  
Must use all field values in the receipt. """

Finally, we will pass the input images and the text prompt to the generates() function that returns the model response.

full_response = generate(imgs_b64, prompt)

print(full_response)

Output:

**Numéro de session,Date,Heure,Pass Easy n°,Fin de validité,Type,Quantité,Prix Unitaire,TVA,Montant total HT,Montant total TTC**
1,16/01/2024,09:32:32,3307837143,30/09/2023,Carnet de Ticket t+,10,17,35 €,10,00 %,15,77 €,17,35 €
1,16/01/2024,09:32:32,3307837143,30/09/2023,Carnet de Ticket t+,10,17,35 €,10,00 %,15,77 €,17,35 €

The above output shows that the Google Gemini Pro has extracted the information we need in CSV string format.

The last step is to convert this string into a CSV file.

Converting Google Gemini Pro Response to a CSV File

To convert the response to a CSV file, we first need to split the response into lines using the string object's strip() and split() methods. This will create a list of strings, where each string is a line in the response.

Next, we will define the process_line() function that handles the unique patterns in the response, such as the currency symbols and the decimal separators.


lines = full_response.strip().split('\n')


def process_line(line):

    special_patterns = re.compile(r'\d+,\d+\s[€%]')

    temp_replacement = "TEMP_CURRENCY"

    currency_matches = special_patterns.findall(line)

    for match in currency_matches:
        line = line.replace(match, temp_replacement, 1)

    parts = line.split(',')

    for i, part in enumerate(parts):
        if temp_replacement in part:
            parts[i] = currency_matches.pop(0)

    return parts

The rest of the process is straightforward.

We will open a CSV file for writing using the open function with the mode argument set to w, the newline argument set to '', and the encoding argument set to utf-8. This will create a file object that you can use to write the CSV data.

Next, we will define the create csv.writer object that you can use to write the rows to the CSV file.

We will loop through all the items (CSV rows) in the lines list and write them to our CSV files.


csv_file_path = r'D:\\Receipts\\receipts.csv'  

# Open the CSV file for writing
with open(csv_file_path, mode='w', newline='', encoding='utf-8') as csv_file:
    writer = csv.writer(csv_file)

    # Process each line in the data list
    for line in lines:
        processed_line = process_line(line)
        writer.writerow(processed_line)

Once you execute the above script, you will see the following CSV file in your destination path, containing information from your input receipt image.

output_csv.png

Conclusion

Extracting information from PDFs and images is a crucial task for data analysts. In this tutorial, you saw how to use the Google Gemini Pro, a state-of-the-art multimodal large language model, to extract information from a receipt image. You can use the same technique to extract any other type of information by simply using a text query.

Feel free to leave your feedback and suggestions!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.