I want to develope an ETL application for Data Ware House Project (Decision Support System) for university,
i am extracting data from, Excel, Access and CSV format files, i have done the extraction, but now the problem is the Transformation of Data, i have to clean it and standardize it, so is there any algorithm that can find the occurence or probability of a name and make it a standard and for other name with different spellings they will be replaced by a standard.

for example
Malik, Maliq, Malique, Mallick, Malick, Malicq, Malik

now the occurence of Malik is more than other names

i want to find probabiltity of Malik and make it a standard and replace other same name but different spellings with MALIK which is a standard

So, please help me how to do it

I am working on C++ using .NET 2008

Recommended Answers

All 9 Replies

I don't really get what you're trying to describe. What do you mean by "the probability of Malik"? The probability of what?

From what you're describing it seems to me that you just want to count how many times every name occurs, pick the one with the highest occurrence (question: What would you say is desired if two names appear equally as much?) and replace all others with that one.

More importantly, how is your data stored? I assume your data set contains more than just "Malik" and variations of it. If so, how are you going to determine which names are similar to eachother? To illustrate this with an example similar to yours, what if your population looked like:

Malik, Maliq, Malique, Mallick, Malick, Malicq, Malik, ChuckNorris, ChuckNorris, ChuckNorris

ChuckNorris has the highest occurrence here, but given your description you don't want that one becoming the standard for "the Maliks", right? So please provide more information about your data set and assumptions that may be made about it.

Yes exactly i want to calculate the how many times the word is occuring, actually data sets that our instructor gave us are excel sheets, access and text files which contains data of four different campuses of our university.

All i want to do is develop an ETL application for DSS(Desicion Support System) i have done the data extraction and now i am stuck at transformation, we have to it manually, we are not allowed to use any BI tool

I got that much. But how is your data stored now? Can we assume that the input for this problem always contain only names that are similar to eachother, or do we first have to group them somehow? If the latter is the case I wouldn't quite sure what your rules are for this.

yes we can assume that it contains only names for example

Zain Malik, Ali Malek, Zain Malick, Taimoor Malique, Ali Malik, Taimoor Malik, Alina Malik, Alia Malick,

i just have to standardize the data, i have to implement a universal logic for occurence

I don't quite understand it still, could you explain why something like the following is wrong?

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>

using namespace std;

int main()
{
    // Using an array for initialization, not needed in C++0x.
    string names[] = {"Malik", "Maliq", "Malique", "Mallick", "Malick", "Malicq", "Malik"};
    vector<string> data (names, names + sizeof(names) / sizeof(string));

    // Sort the data first in an attempt to make things more efficient. I'm assuming the alghoritm one
    // is efficient, but you could replace this with your own version.
    sort(data.begin(), data.end());

    // Count the highest occurrence.
    string standardName, previousName;
    int highestOccurrence = 0, currentOccurrence = 1;

    for (vector<string>::iterator it = data.begin(); it < data.end(); it++)
    {
        if ((*it) == previousName)
        {
            currentOccurrence++;
        }
        else
        {
            if (currentOccurrence > highestOccurrence)
            {
                highestOccurrence = currentOccurrence;
                standardName      = previousName;
            }

            currentOccurrence = 1;
            previousName      = (*it);
        }
    }

    cout << "The highest occurrence is: " << standardName << " (" << highestOccurrence << ").\n";

    // Replace all entries in the vector with the new standard.
    fill (data.begin(), data.end(), standardName);

    return 0;
}

thanks for your help really appreciate it, but can you tell me about th vector that you used, i actually i never used vector, what is it's functionality?

It sounds to me like you may want to set up a bank of words/names that represent the set that are capable of mispellings. Once you have that, you might check each entry using Levenshtein distance to determine if a change is necessary. This is not straightforward. Consider the names Jesse and Jessie: they are distinct and should remain so (one is traditionally a female version the other typically a male counterpart). However, you may want Jessy or Jessse to be mapped to one of the previous names. Dealing with cases like this will be the hardest part of the assignment, in my opinion.

Is there any reason you can not just leave the data in it's original format? It might not conform to the requirements of your project but if I were writing and ETL process I would only want to ensure that the structure of the content was managed; I would expect that the data would remain untouched.

Actually i want to find a probabilty of a name spelling wise e.g

Malik, Malique, Malick, Malik, Malicq, Malique, Malik, Malike, Maliqk, Maliq, Malik, Malick, Malik, Malique, Malik

now in above example the probability of "Malik" is high spelling wise

so is there any algorithm or logic for how can i find a probabilty of a name spelling wise

You first need a way to bin all of those names to determine which occurs more frequently. You need to address that (harder) problem before trying to find out how often particular names are used.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.