Sentiment Analysis with Data Augmentation Using ChatGPT

Updated usmanmalik57 6 Tallied Votes 2K Views Share

Sentiment analysis, a subfield of Natural Language Processing (NLP), aims to discern and classify the underlying sentiment or emotion expressed in textual data. Whether it is understanding customers' opinions about a product, analyzing social media posts, or gauging public sentiment towards a political event, sentiment analysis plays a vital role in unlocking valuable insights from vast amounts of textual data.

However, training an accurate sentiment classification model often demands a substantial volume of annotated data, which may not always be readily available or time-consuming to acquire. This limitation has led researchers and practitioners to explore innovative techniques, such as data augmentation, to generate synthetic data and augment the training set.

In this article, we will delve into the world of data augmentation, specifically using ChatGPT, a powerful language model developed by OpenAI, to generate additional training samples and bolster the performance of sentiment classification models. By leveraging the capabilities of ChatGPT, we can efficiently create diverse and realistic data, opening new possibilities for sentiment analysis in scenarios where limited annotated data would otherwise be an obstacle.

Sentiment Classification without Data Augmentation

To train the sentiment classification model, we will use the IMDB dataset, which contains movie reviews labeled with sentiments. We'll then train a Random Forest model using TF-IDF (Term Frequency-Inverse Document Frequency) features, which allow us to represent the text data numerically. By dividing the dataset into training and testing sets, we can evaluate the model's performance on unseen data. The accuracy score will be used to measure how well the model predicts sentiments.

Now, let's proceed with the code:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

dataset = pd.read_csv(r"D:\Datasets\IMDB Dataset.csv")

dataset = dataset.head(600)
X_train, X_test, y_train, y_test = train_test_split(dataset['review'], 

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)

# Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators  = 500), y_train)

# Transform the test data using the same vectorizer
X_test_tfidf = vectorizer.transform(X_test)

# Predict the sentiment on the test data
y_pred = rf_model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Without data augmentation, I achieved an accuracy of 0.6916.

Data Augmentation with ChatGPT

Let’s now augment our data using ChatGPT. We will generate 100 more reviews. So, let’s begin without an ado.

import os
import openai

api_key = os.getenv('OPENAI_KEY2')
openai.api_key = api_key

In the above code snippet, we are setting up the OpenAI API for accessing their language model, specifically the "GPT-3.5" model. This is the language model used by ChatGPT behind the scenes.

First, we import the required libraries. Then, we set the OpenAI API key by retrieving it from the environment variable OPENAI_KEY2. This key is necessary for making API calls to OpenAI services.

Next, we configure the openai library with the obtained API key by assigning it to openai.api_key.

The next step is to define a function that generates movie reviews.

import time

def generate_reviews(review):

    content = """Generate one short IMDB movie review using the following review as an example. The review sentiment should be positive, negative, or neutral.
    Review = {}. Also mention the sentiment of the review in one word (positive, negative, neutral) in the beginning of your reply.""".format(review)

    generated_review = openai.ChatCompletion.create(
      temperature = 0.7,
            {"role": "user", "content": content}

    return generated_review["choices"][0]["message"]["content"]

The code above defines a function called generate_reviews(review) that uses the OpenAI GPT-3.5 Turbo language model to generate a movie review based on an example review. The function takes the input review as the example and prompts the model to create a new movie review with a specified sentiment (positive, negative, or neutral). The temperature parameter controls the creativity of the generated text.

The function then returns the generated movie review. This approach allows us to easily generate diverse movie reviews with different sentiments, leveraging the power of the GPT-3.5 Turbo language model from OpenAI.

Next, we will iterate through the first 100 movie reviews in our training set and use them as examples to generate new reviews.

The following code contains a loop that generates 100 movie reviews using the generate_reviews(review) function. The generated reviews are stored in the generated_reviews list. Each review is based on a different example from the training data (X_train). This approach allows to create diverse and creative movie reviews.

generated_reviews = []

X_train_list = X_train.tolist()

for i in range(100):

    review = X_train_list[i]
    generated_review = generate_reviews(review)
    print(i + 1, generated_review)

Training with Augmented Data

We will now train our machine learning model using original plus augmented data.

Let’s first convert ChatGPT to generated reviews into Pandas dataframe containing review, and sentiment columns. The following script iterates through each generated review, splits the review into sentiment and review, and returns these values to the calling function. Text ad sentiments for all the generated reviews are stored in a dictionary which is then appended to a list, and converted to a Pandas dataframe.

import re
def split_string_into_two(input_string):

    # Split the string into words
    words = input_string.split()

    # Create the first string with the first word
    first_string = words[0].lower()
    first_string = re.sub(r'[^a-zA-Z0-9]', '', first_string)

    # Create the second string with the second word
    second_string = ' '.join(words[1:]) if len(words) > 1 else ''

    # Remove new line characters
    second_string = second_string.replace('\n', ' ')

    return first_string, second_string

generated_reviews_dicts = []
for review in generated_reviews:
    first, second = split_string_into_two(review)
    review_dict = {'sentiment':first, 'review':second}

df = pd.DataFrame(generated_reviews_dicts)

The script generated a total of 99 movie reviews. Among them, 48 reviews were predicted to have a positive sentiment, 46 had a negative sentiment, and 5 were labeled as neutral. However, one review had the text "review" as the predicted sentiment, which seemed incorrect. Consequently, I removed that particular record from the results, retaining only reviews with sentiments classified as positive, negative, or neutral.

Next, I will add the generated reviews to the reviews in the original training set:

X_train_aug = df["review"]
X_train_new = X_train.append(X_train_aug)

y_train_aug = df["sentiment"]
y_train_new = y_train.append(y_train_aug)

The rest of the process is the same, we will use TFIDF to convert text to vectors, use the random forest algorithm to train our model and make predictions on the test set.

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train_new)

# Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators  = 500), y_train_new)

# Transform the test data using the same vectorizer
X_test_tfidf = vectorizer.transform(X_test)

# Predict the sentiment on the test data
y_pred = rf_model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In the output, I achieved a classification accuracy of 0.75, which is roughly a 6% improvement on the original 0.6916. The results are very impressive with only 100 newly generated records. This shows the significant capabilities of ChatGPT for data augmentation.

I hope you would have enjoyed this tutorial. Feel free to share your thoughts on how I can further improve these results.

AndreRet 526 Senior Poster

Great effort and informative tutorial covering the basics. Maybe add some more information on the training module.

You have my upvote!

commented: I will write on fine-tuning soon. Keep checking :) +0
Jane_11 0 Newbie Poster

Your diligent work on this informative tutorial, covering the fundamentals, is commendable. It might be beneficial to consider expanding upon the training module with additional insights. Your commitment to enhancing the learning experience is truly appreciated. Thank you for your valuable contribution!

Alex_171 0 Newbie Poster

To be honest, I generally don’t read. But, this article caught my attention.

anthonybell897 0 Newbie Poster

ChatGPT is a powerful tool for data augmentation, but it is important to be aware of its limitations. The data generated by ChatGPT may not be entirely accurate or representative of the real world, so it is important to use it in conjunction with other data augmentation techniques.

Juliana_3 0 Newbie Poster

could this have been done outside of python? [:O]

usmanmalik57 commented: You can directly use OpenAI playground I suppose. +0
Abdul_116 0 Newbie Poster

Fascinating to see sentiment analysis being applied to understand Pakistani consumers on High Street Pakistan! As online shopping thrives, it'd be interesting to compare brand opinions on both platforms - how do traditional High Street stores fare against online giants in terms of sentiment? Could data augmentation help bridge the data gap for local businesses lacking online reviews?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.