Extract Tabular Data from PDF Images using Hugging Face Table Transformer

usmanmalik57 1 Tallied Votes 213 Views Share

In a previous article, I explained how to extract tabular data from PDF image documents using Multimodal Google Gemini Pro. However, there are a couple of disadvantages with Google Gemini Pro. First, Google Gemini Pro is not free, and second, it needs complex prompt engineering to retrieve table, columns, and row pixel coordinates.

To solve the problems above, in this article, you will see how to extract tables from PDF image documents using Microsoft's Table Transformer from the Hugging Face library. You will see how to detect tables, rows, and columns within a table, extract cell values from tables using an OCR, and save the table as CSV. So, let's begin without ado.

Installing and Importing Required Libraries

The first step is to install various libraries you will need to run scripts in this article.

!pip install transformers
!sudo apt install tesseract-ocr
!pip install pytesseract
!pip install easyocr
!sudo apt-get install -y poppler-utils
!pip install pdf2image
!wget "https://fonts.google.com/download?family=Roboto" -O roboto.zip
!unzip roboto.zip -d ./roboto

The following script imports the required libraries into your application.


from transformers import AutoImageProcessor, TableTransformerForObjectDetection
import torch
from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt
import csv
import numpy as np
import pandas as pd
from pdf2image import convert_from_path
from tqdm.auto import tqdm
import pytesseract
import easyocr

Table Detection with Table Transformer

The Table Transformer has two sub-models: table-transformer-detection, and table-structure-recognition-v1.1-all model. As a first step, we will detect tables within a PDF document using the table-transformer-detection model.

Importing and Converting PDF to Image

The following script defines the pdf_to_img() function that converts PDF documents to bytes images. This step is mandatory since the Table transformer expects documents in image format.

# convert PDF to Image
def pdf_to_img(image_path):

  image = convert_from_path(pdf_path)[0].convert("RGB")
  return image

pdf_path = '/content/sample_input_ieee-10.pdf'
image = pdf_to_img(pdf_path)
image

Output:

image1.png

The above output shows the input image. We will detect tables inside this image.

Detecting Tables

The following script imports the preprocessor and model objects for the table-transformer-detection model. The preprocessor converts the input image to a format the table-transformer-detection model can process.


model_name = "microsoft/table-transformer-detection"
# define image preprocessor for table transformer
image_processor = AutoImageProcessor.from_pretrained(model_name)

# import table transformer model for table detection
model = TableTransformerForObjectDetection.from_pretrained(model_name,
                                                           revision="no_timm")

Next, we define the detect_table() function that accepts the input image as a parameter. The method preprocesses the image and then passes it to the table-transformer-detection model.

The preprocesses objects post_process_object_detection() method processes the output from the table-transformer-detection model. The final processed output consists of the label, bounding box coordinates, and the prediction confidence score for the detected tables. The detect_table() function returns the final output.


def detect_table(image_doc):

  # preproces image document
  inputs = image_processor(images = image_doc, return_tensors="pt")

  # detect tables
  outputs = model(**inputs)

  # convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
  target_sizes = torch.tensor([image_doc.size[::-1]])
  results = image_processor.post_process_object_detection(outputs,
                                                          threshold=0.9,
                                                          target_sizes=target_sizes)[0]

  return results


results = detect_table(image)
results

Output:


{'scores': tensor([0.9993, 0.9996], grad_fn=<IndexBackward0>),
 'labels': tensor([0, 0]),
 'boxes': tensor([[ 111.4175,  232.4397, 1481.5710,  606.8784],
         [ 110.4231,  738.1602, 1471.6283,  916.2267]],
        grad_fn=<IndexBackward0>)}

The above output shows the confidence score, labels (0 for table), and bounding box coordinates for the two detected tables.

Next, we define the get_table_bbox() function, which prints the labels, confidence scores, and bounding box coordinates for the detected tables. The function also returns the detected bounding box coordinates for all the tables.


def get_table_bbox(results):

  tables_coordinates = []

  # iterate through all the detected table data
  for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]

    # store bbox coodinates in Pascal VOC format for later use
    table_dict = {"xmin" : box[0],
                  "ymin" : box[1],
                  "xmax" : box[2],
                  "ymax" : box[3]}

    tables_coordinates.append(table_dict)

    # print prediction label, prediction confidence score, and bbox values
    print(
        f"Detected {model.config.id2label[label.item()]} with confidence "
        f"{round(score.item(), 3)} at location {box}"
        )

  return tables_coordinates

table_bbox = get_table_bbox(results)

Output:


Detected table with confidence 0.999 at location [69.43, 344.96, 660.61, 488.47]
Detected table with confidence 0.989 at location [68.7, 549.5, 657.53, 838.82]

Display Tables

Finally, the script below plots the original image and draws red rectangles around the detected tables using their bounding box coordinates.


def highlight_tables(image, table_bbox, padding):
    # Create a drawing context for doc image
    doc_image = image.copy()
    draw = ImageDraw.Draw(doc_image)

    # Iterate over each table in the list
    for table in table_bbox:
        # Define the coordinates for the rectangle with padding for each table
        rectangle_coords = (table["xmin"] - padding,
                            table["ymin"] - padding,
                            table["xmax"] + padding,
                            table["ymax"] + padding)

        # Draw a red rectangle around the detected table
        draw.rectangle(rectangle_coords, outline="red", width=2)

    return doc_image

padding = 10
table_detected_image = highlight_tables(image, table_bbox, padding)
table_detected_image

Output:

image2.png

You can see the detected tables in the above image.

Subsequently, we define the get_cropped_image() function that accepts the original image, the corresponding bounding box coordinates, and padding values as parameters. The get_cropped_image() function returns the cropped table, which you can use to extract rows and columns.

def get_cropped_image(image, table, padding):
  # Create a new image object with the cropped area
  cropped_image = image.copy().crop((table["xmin"] -padding,
                             table["ymin"] - padding,
                             table["xmax"] + padding,
                             table["ymax"] + padding
                             ))

  return cropped_image

cropped_image = get_cropped_image(image, table_bbox[1], padding)
cropped_image

Output:

image3.png

Extract Table data

Now that we have cropped a table, we can extract rows and columns.

Extract Table Features

You can extract table rows and columns using the table-structure-recognition-v1.1-all model. The following script imports this model.


# import model for detecting table features e.g. rows, columns, etc
structure_model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-structure-recognition-v1.1-all")

We define the get_table_features() function that accepts the cropped table image as a parameter and returns the labels, confidence scores, and the bounding box coordinates for the detected rows and columns. The function also prints these values.


def get_table_features(cropped_image):

  # preprocess image input for table transformer
  inputs = image_processor(images = cropped_image, return_tensors="pt")

  # make prediction using table transformer
  outputs = structure_model(**inputs)

  # post process output to Pasval VOC bbox format
  target_sizes = torch.tensor([cropped_image.size[::-1]])
  results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[0]

  # define a list to store detected features
  features = []

  # iterate through all the detected features and store feature label, confidence score, and bbox values to cells list
  for i, (score, label, box) in enumerate(zip(results["scores"], results["labels"], results["boxes"])):
      box = [round(i, 2) for i in box.tolist()]
      score = score.item()
      label = structure_model.config.id2label[label.item()]

      cell_dict = {"label":label,
                  "score":score,
                  "bbox":box
                  }


      # print table features
      features.append(cell_dict)
      print(
          f"Detected {label} with confidence "
          f"{round(score, 3)} at location {box}"
      )

  return outputs


features = get_table_features(cropped_image)

Output:

image4.png

Display Detected features

Next, we define the display_detected_features() function that draws rectangles around detected rows and columns.


def display_detected_features(cropped_image, features):

  cropped_table_visualized = cropped_image.copy()
  draw = ImageDraw.Draw(cropped_table_visualized)

  # increase font size for text labels
  font = ImageFont.truetype("/content/roboto/Roboto-Bold.ttf", 15)

  # iterate through all features and display bounding box with text labels
  for feature in features:
      draw.rectangle(feature["bbox"], outline="red")

      text_position = (feature["bbox"][0], feature["bbox"][1] - 3)

      draw.text(text_position, feature["label"], fill="blue", font = font)

  # return cropped image with bounding box
  return cropped_table_visualized

display_detected_features(cropped_image, features)

Output:

image5.png

Extract Cell Text Using OCR and Convert to CSV

In the final step, we will detect cell text and convert the detected table to CSV format.

Extract Cells Coordinates

We define the get_cell_coordinates_by_row() function that iterates through the detected rows and extracts column values for each row. The function returns a list of rows where each row contains cell values for all the columns.

def get_cell_coordinates_by_row(table_data):
    # Extract rows and columns
    rows = [entry for entry in table_data if entry['label'] == 'table row']
    columns = [entry for entry in table_data if entry['label'] == 'table column']

    # Sort rows and columns by their Y and X coordinates, respectively
    rows.sort(key=lambda x: x['bbox'][1])
    columns.sort(key=lambda x: x['bbox'][0])

    # Function to find cell coordinates
    def find_cell_coordinates(row, column):
        cell_bbox = [column['bbox'][0], row['bbox'][1], column['bbox'][2], row['bbox'][3]]
        return cell_bbox

    # Generate cell coordinates and count cells in each row
    cell_coordinates = []

    for row in rows:
        row_cells = []
        for column in columns:
            cell_bbox = find_cell_coordinates(row, column)
            row_cells.append({'cell': cell_bbox})

        # Append row information to cell_coordinates
        cell_coordinates.append({'cells': row_cells, 'cell_count': len(row_cells)})


    return cell_coordinates

cell_coordinates = get_cell_coordinates_by_row(features)

Extract Text from Cell Coordinates using OCR

Finally, we define the apply_ocr() function that iterates through all the rows and then applies the PyTesseract OCR to extract cell values for all the columns in a row. The function returns a dictionary where each dictionary value is a list of items corresponding to row cell values from the input table, as you can see in the output of the following script.


def apply_ocr(cell_coordinates, cropped_image):
    # let's OCR row by row
    data = dict()
    max_num_columns = 0
    for idx, row in enumerate(tqdm(cell_coordinates)):
        row_text = []
        for cell in row["cells"]:
            # crop cell out of image
            cell_image = np.array(cropped_image.crop(cell["cell"]))

            # apply OCR using PyTesseract
            text = pytesseract.image_to_string(cell_image, lang='eng', config='--psm 6').strip()
            if text:
                row_text.append(text)


        if len(row_text) > max_num_columns:
            max_num_columns = len(row_text)

        data[idx] = row_text

    print("Max number of columns:", max_num_columns)

    # pad rows which don't have max_num_columns elements
    for row, row_data in data.copy().items():
        if len(row_data) != max_num_columns:
            row_data = row_data + ["" for _ in range(max_num_columns - len(row_data))]
        data[row] = row_data
        print(row_data)

    return data

data = apply_ocr(cell_coordinates, cropped_image)

Output:

image6.png

As a last step, we iterate through the table rows data dictionary and write the row values line by line to a CSV file using the csv.writer() method.


def write_csv(data):

  with open('output.csv','w') as result_file:
      wr = csv.writer(result_file, dialect='excel')
      for row, row_text in data.items():

        wr.writerow(row_text)

write_csv(data)

df = pd.read_csv("output.csv")
df

Output:

image7.png

The above output shows the Pandas dataframe containing the data from the generated CSV file.

I hope you liked this tutorial. Feel free to leave your feedback or comments.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.