PDF Image Table Extractor Web App with Google Gemini Pro and Streamlit

2 Months Ago usmanmalik57 1 148 Views

In my previous article, I explained how to convert PDF image to CSV using Multimodal Google Gemini Pro. To do so, I wrote a Python script that passes text command to Google Gemino Pro for extracting tables from PDF images and storing them in a CSV file.

In this article, I will build upon that script and develop a web application that allows users to upload images and submit text queries via a web browser to extract tables from PDF images. We will use the Python Streamlit library to develop web data applications.

So, let's begin without ado.

Installing Required Libraries

You must install the google-cloud-aiplatform library to access the Google Gemini Pro model. For Streamlit data application, you will need to install the streamlit library. The following script installs these libraries:


google-cloud-aiplatform
streamlit

Creating Google Gemini Pro Connector

I will divide the code into two Python files: geminiconnector.py and main.py. The geminiconnector.py library will contain the logic to connect to the Google Gemini Pro model and make API calls.

Code for geminiconnector.py

import os
from vertexai.preview.generative_models import GenerativeModel, Part
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r"PATH_TO_JSON_API_FILE"

model = GenerativeModel("gemini-pro-vision")
config={
    "max_output_tokens": 2048,
    "temperature": 0,
    "top_p": 1,
    "top_k": 32
}


def generate(img, prompt):

    input = img + [prompt]

    responses = model.generate_content(    
        input,
        generation_config= config,
        stream=True,
    )
    full_response = ""

    for response in responses:
        full_response += response.text

    return full_response

I have already explained the details for the above code in my previous article. Therefore I will not delve into the details of this code here.

Creating Web GUI for the PDF Image Table Extractor

We will develop the following GUI, allowing users to upload images from their local drive. The user must then click the Upload Images button to upload the images stored in a temporary directory. Next, the user enters a text query in a text field and presses the "Submit Query" button. The response is stored in a Pandas dataframe and displayed in the output.

The code for the above GUI is implemented in the main.py file. I will break down the code into multiple code snippets for improved readability.

Import Required Libraries

The following script imports required libraries for the main.py file.

from geminiconnector import generate
from vertexai.preview.generative_models import Part
import streamlit as st
import pandas as pd
import os
import base64
import glob
import re
import csv

Creating an Image Uploader

The first step is to create an image uploader. You can use the st.uploader() method, as shown in the following script.


st.write("# Image Table Extractor")
uploaded_files = st.file_uploader("Choose images", accept_multiple_files=True, type=['jpg', 'png'])

Next, we will define the save_uploaded_files function that accepts the directory for storing images and image files as parameters. The following script also defines the path for storing images.

def save_uploaded_files(directory, uploaded_files):

    if not os.path.exists(directory):
        os.makedirs(directory)  

    for uploaded_file in uploaded_files:
        file_path = os.path.join(directory, uploaded_file.name)

        with open(file_path, "wb") as f:
            f.write(uploaded_file.getbuffer())

local_dir = "tempdir"

Next, we will define the Upload Images button using the st.button() method, which, when clicked, uploads images to the local directory.

if st.button('Upload Images'):
    if uploaded_files:

        save_uploaded_files(local_dir, uploaded_files)
        st.success(f'Images have been Uploaded.')

    else:
        st.error('Please upload at least one image.')

Defining Image Preprocessing Functions

Like the previous article, we will define two image processing functions: get_jpg_file_paths() and read_image(). The former returns the file paths of all the files in a directory, while the latter converts images to Google Gemini Pro compliant format.


def get_jpg_file_paths(directory):

    jpg_file_paths = glob.glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
    return [os.path.abspath(path) for path in jpg_file_paths]


def read_image(img_paths):

    imgs_b64 = []
    for img in img_paths:
        with open(img, "rb") as f: # open the image file in binary mode
            img_data = f.read() # read the image data as bytes
            img_b64 = base64.b64encode(img_data) # encode the bytes as base64
            img_b64 = img_b64.decode() # convert the base64 bytes to a string
            img_b64 = Part.from_data(data=img_b64, mime_type="image/jpeg")

            imgs_b64.append(img_b64)

    return imgs_b64

Creating Query Submitter and Result Generator

To capture user queries, we will define a text area using the st.write() method as shown below:


st.write("## Enter your query.")
user_input = st.text_area("query",
                          height=100,
                          label_visibility = "hidden")

Before generating a response from the Google Gemini Pro model, we will define the process_line() function that handles the unique patterns in the response, such as the currency symbols and the decimal separators.


def process_line(line):

    lines = full_response.strip().split('\n')

    special_patterns = re.compile(r'\d+,\d+\s[€%]')

    temp_replacement = "TEMP_CURRENCY"

    currency_matches = special_patterns.findall(line)

    for match in currency_matches:
        line = line.replace(match, temp_replacement, 1)

    parts = line.split(',')

    for i, part in enumerate(parts):
        if temp_replacement in part:
            parts[i] = currency_matches.pop(0)

    return parts

Finally, we will create a Submit Query button, which, when clicked, passes the user input prompt and the input images to the generate() function from the geminiconnector.py file.

The response is split into multiple lines. Each line is formatted using the process_line() function and appended to the data list. The pd.DataFrame constructor converts the data list to a Pandas dataframe, which is displayed on the web page using the st.write() method.

And this is it! You have successfully developed your PDF image table extractor. I used the following prompt to extract tables from the PDF image. I intentionally made spelling mistakes to see if the model returned the desired results.


"""I have the above receipts. Return a response that contains information from the receipts in a comma-separated file format where row fields are table columns,
whereas row values are column values. The output should contain (header + number of recept rows).
The first row should contain all column headers, and the remaining rows should contain all column values from two recepts one in each row.  
Must use all field values in the receipt. """

You can modify the above prompt to get different information from your PDF file.

Conclusion

In this article, you saw how to create a PDF image table extractor using multimodal Google Gemini Pro and Python Streamlit library. Using Google Gemini Pro is exceptionally straightforward. I encourage you to develop your Streamlit web applications using Google Gemini Pro or other multimodal large language models. It is easy and fun to use and can solve highly complex tasks requiring image and text inputs.

Be the first to reply

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.