Al_628 0 Newbie Poster

I have the following dataframe:

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2]
}
df_test = pd.DataFrame(d_test)

I want to identify similar names in name column if those names belong to one cluster number and create unique id for them. For example South Beach and Beach belong to cluster number 1 and their similarity score is pretty high. So we associate it with unique id, say 1. Next cluster is number 2 and three entities from name column belong to this cluster: Dog, Big Dog, Cat, 'Fish' and 'Dry Fish'. Dog and Big Dog have high similarity score and their unique id will be, say 2. For Cat unique id will be, say 3. Finally for 'Fish' and 'Dry Fish' unique id will be, say 4. And so on.

I created a code for the logic above:

# pip install thefuzz
from thefuzz import fuzz

df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)

df_test['id'] = 0

i = 1
for index, row in df_test.iterrows():
    row_ = row
    index_ = index

    while index_ < len(df_test) and df_test.loc[index, 'cluster_number'] == df_test.loc[index_, 'cluster_number'] and df_test.loc[index_, 'id'] == 0:
        if row['name'] == df_test.loc[index_, 'name'] or fuzz.ratio(row['name'], df_test.loc[index_, 'name']) > 50:
            df_test.loc[index_,'id'] = i
            is_i_used = True
        index_ += 1

    if is_i_used == True:
        i += 1
        is_i_used = False

Code generates expected result:

    name         cluster_number  id
0   Beach               1        1
1   South Beach         1        1
2   Big Dog             2        2
3   Cat                 2        3
4   Dog                 2        2
5   Dry Fish            2        4
6   Fish                2        4
7   Ant                 3        5
8   Bird                3        6
9   Dear                4        7

Computation runs for 210 seconds for dataframe with 1 million rows where in average each cluster has about 10 rows and max cluster size is about 200 rows. I am trying to understand how to vectorize the code.

Also thefuzz module has process function and it allows to process data at once:

from thefuzz import process
out = process.extract("Beach", df_test['name'], limit=len(df_test))

But I don't see if it can help with speeding up the code.

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2]
}
df_test = pd.DataFrame(d_test)

# pip install thefuzz
from thefuzz import fuzz

df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)

df_test['id'] = 0

i = 1
for index, row in df_test.iterrows():
    row_ = row
    index_ = index

    while index_ < len(df_test) and df_test.loc[index, 'cluster_number'] == df_test.loc[index_, 'cluster_number'] and df_test.loc[index_, 'id'] == 0:
        if row['name'] == df_test.loc[index_, 'name'] or fuzz.ratio(row['name'], df_test.loc[index_, 'name']) > 50:
            df_test.loc[index_,'id'] = i
            is_i_used = True
        index_ += 1

    if is_i_used == True:
        i += 1
        is_i_used = False
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.