OpenAI o3 vs Anthropic Claude 4 for Text Classification & Summarization

usmanmalik57 0 Tallied Votes 24 Views Share

OpenAI and Anthropic are two AI giants delivering state-of-the-art large language models for various tasks. In a previous article, I compared OpenAI GPT-4o and Anthropic Claude 3.5 sonnet models for text classification tasks. That article was published almost a year ago. Since then, both OpenAI and Anthropic have released state-of-the-art models in o3 and Claude 4 opus, respectively.

In this article, I compare the performance of OpenAI o3 and Claude 4 opus for zero-shot text classification and summarization tasks.

So, let's begin without ado.

Installing and Importing Required Libraries

The following script installs OpenAI and Anthropic Python libraries along with the other modules required to run codes in this article.

!pip install anthropic
!pip install openai
!pip install rouge-score
!pip install --upgrade openpyxl
!pip install pandas openpyxl

The script below imports the required libraries.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from itertools import combinations
from collections import Counter
from sklearn.metrics import hamming_loss, accuracy_score
from rouge_score import rouge_scorer
import anthropic
from openai import OpenAI

from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
ANTHROPIC_API_KEY = userdata.get('ANTHROPIC_API_KEY')

Text Classification Comparison

Let's first compare o3 and Claude 4 opus for text classification. We will predict the sentiment of tweets in the Twitter US Airline dataset from Kaggle.

The following script imports the dataset into a Pandas dataframe.

## Dataset download link
## https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment?select=Tweets.csv

dataset = pd.read_csv(r"/content/Tweets.csv")
print(dataset.shape)
dataset.head()

Output:

img1.png

The text column contains a tweet's text, while the airline_sentiment column contains the corresponding sentiment value, which can be positive, negative, or neutral.

We will analyze the sentiments of 100 tweets, evenly divided into three categories: 33 positive, 33 negative, and 34 neutral tweets.


# Remove rows where 'airline_sentiment' or 'text' are NaN
dataset = dataset.dropna(subset=['airline_sentiment', 'text'])

# Remove rows where 'airline_sentiment' or 'text' are empty strings
dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]

# Filter the DataFrame for each sentiment
neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
positive_df = dataset[dataset['airline_sentiment'] == 'positive']
negative_df = dataset[dataset['airline_sentiment'] == 'negative']

# Randomly sample records from each sentiment
neutral_sample = neutral_df.sample(n=34)
positive_sample = positive_df.sample(n=33)
negative_sample = negative_df.sample(n=33)

# Concatenate the samples into one DataFrame
dataset = pd.concat([neutral_sample, positive_sample, negative_sample])

# Reset index if needed
dataset.reset_index(drop=True, inplace=True)

# print value counts
print(dataset["airline_sentiment"].value_counts())

Output:

airline_sentiment
neutral     34
positive    33
negative    33
Name: count, dtype: int64

Next, we define the make_prediction() function that accepts OpenAI or Anthropic client, the model ID, the content (input to the model), and the maximum number of output tokens.

Depending on the client and the model ID, the make_prediction model uses either the o3 or the Claude 4 opus model for generating the model response.

def make_prediction(client, model, content, max_tokens):

    if model == "o3":

      response = client.chat.completions.create(
        model= "gpt-4",
        temperature = 0,
        max_tokens = max_tokens,
        messages=[
              {"role": "user", "content": content}
          ]
      )

      response_value = response.choices[0].message.content


    if model == "claude-opus-4-0":

      response = client.messages.create(
                              model= model,
                              max_tokens = max_tokens,
                              temperature=0.0,
                              messages=[
                                  {"role": "user", "content": content}
                              ]
                          )

      response_value = response.content[0].text

    return response_value

We will define the classify_tweets() function that accepts the client and model names along with the dataset and maximum number of output tokens. The function iterates through all the tweets in the dataset, embeds them in the content, and sends them to the make_prediction() function for classification.

The prediction for all the tweets is stored in the all_sentiments list, which is returned to the calling function.


def classify_tweets(client, model, dataset, max_tokens):

    all_sentiments = []

    tweets_list = dataset["text"].tolist()

    message = False
    exceptions = 0

    for tweet in tweets_list:

      content = """What is the sentiment expressed in the following tweet about an airline?
      Select sentiment value from positive, negative, or neutral.
      Return only the sentiment value in small letters e.g. positive, negative, or neutral in the output.
      Here is the tweet: {}""".format(tweet)

      sentiment_value = make_prediction(client, model, content, max_tokens)

      print(sentiment_value)
      all_sentiments.append(sentiment_value)

    return all_sentiments

Text Classification with OpenAI o3

Let's first predict tweet sentiments using the OpenAI o3 model.

%%time

client = OpenAI(api_key = OPENAI_API_KEY,)
model = "o3"
max_tokens = 10

predictions = classify_tweets(client, model, dataset, max_tokens)
accuracy = accuracy_score(predictions, dataset["airline_sentiment"])
print("Accuracy:", accuracy)

Output:

Accuracy: 0.87
CPU times: user 897 ms, sys: 99.6 ms, total: 996 ms
Wall time: 1min 4s

The above output shows we achieved an accuracy of 87%. In addition, 100 tweets are processed in 1 minute and 4 seconds.

Text Classification with Claude 4 Opus

Next, we will classify tweets using the Claude 4 opus.


%%time

client = anthropic.Anthropic(api_key = ANTHROPIC_API_KEY)
model = "claude-opus-4-0"

predictions = classify_tweets(client, model, dataset, max_tokens)
accuracy = accuracy_score(predictions, dataset["airline_sentiment"])
print("Accuracy:", accuracy)

Output:

Accuracy: 0.79
CPU times: user 2.02 s, sys: 279 ms, total: 2.3 s
Wall time: 5min 6s

We achieved an accuracy of 79%, with a processing time of 5 minutes and 6 seconds. The OpenAI o3 model is a clear winner here.

Text Summarization Comparison

For comparing text summarization performance, we compare articles in the News Article dataset from GitHub.

# https://github.com/reddzzz/DataScience_FP/blob/main/dataset.xlsx

dataset = pd.read_excel(r"/content/summary_datasets.xlsx")
print(dataset.shape)
dataset.head()

Output:

img2.png

The content and human_summary columns contain article texts and corresponding human-generated summaries, respectively.

To evaluate text summarization performance, we will use the ROUGE metric.

The following script defines the calculate_rouge() function that calculates ROUGE score for the predicted and target summaries.

def calculate_rouge(reference, candidate):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    return {key: value.fmeasure for key, value in scores.items()}

Finally, we define the summarize_articles() function, which takes the client object, the model ID, the dataset, and the maximum number of output tokens. The function iterates through the first 10 records in the dataset and calls the make_prediction() function to generate summaries using o3 and Claude 4 opus models.

def summarize_articles(client, model_id, dataset, max_tokens):

    results = []
    for i, (_, row) in enumerate(dataset[:10].iterrows(), start=1):
        article = row['content']
        human_summary = row['human_summary']

        print(f"Summarizing article {i}.")

        content = f"Summarize the following article in 1150 characters. The summary should look like human created:\n\n{article}\n\nSummary:"

        generated_summary = make_prediction(client, model, content, max_tokens)

        rouge_scores = calculate_rouge(human_summary, generated_summary)

        results.append({
            'article_id': row.id,
            'generated_summary': generated_summary,
            'rouge1': rouge_scores['rouge1'],
            'rouge2': rouge_scores['rouge2'],
            'rougeL': rouge_scores['rougeL']
        })

    return results

Text Summarization with OpenAI o3

We first call the summarize_articles() function using the OpenAI client and the o3 model.


%%time

client = OpenAI(api_key = OPENAI_API_KEY,)
model = "o3"
max_tokens = 1150
results = summarize_articles(client, model, dataset, max_tokens)
results_df = pd.DataFrame(results)
mean_values = results_df[["rouge1", "rouge2", "rougeL"]].mean()
print(mean_values)

Output:

rouge1    0.351287
rouge2    0.115566
rougeL    0.179441
dtype: float64
CPU times: user 668 ms, sys: 59.4 ms, total: 727 ms
Wall time: 1min 21s

Text Summarization with Claude 4 Opus

Next, we call the summarize_articles() function using the claude 4 opus model.

%%time

client = anthropic.Anthropic(api_key = ANTHROPIC_API_KEY)
model = "claude-opus-4-0"

results = summarize_articles(client, model, dataset, max_tokens)
results_df = pd.DataFrame(results)
mean_values = results_df[["rouge1", "rouge2", "rougeL"]].mean()
print(mean_values)

Output:

rouge1    0.341579
rouge2    0.067680
rougeL    0.141958
dtype: float64
CPU times: user 974 ms, sys: 115 ms, total: 1.09 s
Wall time: 2min 12s

The text summarization outputs for the o3 and Claude 4 models show that OpenAI performs better on all three ROUGE metrics. In addition, it takes much less time to process records.

Conclusion

The results from this article show that OpenAI o3 is significantly better than Claude 4 Opus for text classification and summarization tasks. Furthermore, the OpenAI API has significantly lower latency compared to the Anthropic API.

Let me know what you think of these results, and feel free to share if you have any benchmarks on the performance of these models.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.