I think the OP is talking about plagarism detection software. This is software that can take a given document (text, code, video, etc.) and any other meta-data (title, assignment problem statement, etc.), and then scan the world (i.e., the internet) for any document that is suspiciously similar to it. This is, for example, very common in scientific publication systems, i.e., if you submit a scientific article or dissertation, it will automatically compared against a database of existing articles and dissertations to determine if it is likely to be plagarism or re-publishing the same work, in integral form or re-hashed. For school assignments, especially in computer science, this is also very common to verify that students did not cheat by grabbing code off the web. Youtube also has similar scanning algorithms to automatically detect copyrighted material (video or music) in uploaded videos.
Needless to say, this kind of software is quite complicated in order to avoid taking forever to do its work. It has to employ all sorts of strategies to cut down on the amount of work it has to do, such as compressing data with checksums, doing associative searches, lots of fancy data mining indexing techniques, etc..
But for a simple example, like mining the internet for any code that might be too similar to code submitted as an assignment, then the solution could be rather straight forward. Just pick out keywords from the problem statement, try out many permutations, and for each, you do a google search, pick the first 100 resulting pages, and scan them for a nearly verbatim version of the code that you are testing against, and report any significantly similar hits. This is not the easiest task, but it's not that terribly hard either.
mike you are right that is exactly what i want to do, but i am novice and this is given to me as my project work by my supervisor, what language can i use and please put help me with some code and explanation, this are the programming language i have idea of vb.net,c#,visual c++ and php.
This thing could be done with any language, i.e., it is possible to do it in any language. However, certain languages are not really appropriate for this problem. This is a fairly low-level, heavy-lifting application, meaning that high-level, light-work languages are not going to be very appropriate. On that criteria, you can rule out VB.NET and C# right off the bat, because these languages are geared towards the rapid development of user-facing applications, not for heavy-lifting. PHP might be useful for its "network oriented" qualities, maybe as a front-end to your application. But for the most part, you'll have to write it in a heavy-lifting language like C or C++. This is really the kind of application that C++ is geared towards.
You'll have to define more precisely what your task is. You said that what I described was exactly what you meant, but I described things if very vague and general terms, not really something that can be "exact". You need to provide a more specific definition of your problem.
It seems that what you need is just a utility like diff. You can easily reproduce it. This is a classic utility used extensively for the purpose of figuring out what changed (especially in source code) between different versions (i.e., make "patches").
The algorithm is quite simple actually, it is just basic string comparisons. You concurrently traverse the two strings and find matching sections and differing sections of it. There are a few interesting challenges such as doing concurrent look-aheads and things like that. You also might want to do things like ignoring comments and white spaces, but that's easy too.
Is there anything you are having trouble with in particular?