My company is planning to develop a web search engine with a crawler, and we're considering using three languages, namely C++, Java and Python. Now we're a bit not sure which language is best suitable for features like web crawling, extracting keywords & indexing, ranking of indexed pages and searching.

We're aware that some programming languages are most suitable for performing certain tasks, and we want to make the right choices. Someone suggested we use C++ for features that require absolute speed and Python for glue code that isn't all that time-critical. But we're not sure of the exact features that require absolute speed.

Now my questions are:

  1. Which language (C++ or Java) is most suitable for developing a web crawler and why?
  2. Which language is best suited for developing a search ranking algorithm - C++ or Java?
  3. Which features of the search engine should C++ be used for?
  4. Which features should Java be used for?
  5. Where should Python come in? Which features should it be used for?
  6. Do these three languages make a good combination when developing a search application?

Getting some enlightenment on these issues will enable us to get down to work. And your suggestions will be much appreciated.

I most highly recommend...
Write the whole thing in Jython.

(Jython is the version of Python that runs on the Java Virtual Machine (JVM))

If you are successful enough to have performance problems, then...

  1. Optimize the Jython/Python code.
  2. Consider upgrading to Java 7.
  3. Convert only a few key performance critical routines to Java.

(Java 7 may be an attractive option in a year or two because Jython and other similar JVM languages might be able to take advantage of the "Invoke Dynamic" byte code that was added for that purpose in Java 7.)

Why write it all in Jython/Python?

  • All else being equal, you'll get the system working most quickly in Jython/Python.
  • Using only a single language gives you a lot of flexibility in terms of staffing, work assignments and performance optimizations.

It is highly unlikely that C++ will ever be useful on your project. Yes, Java has a slow startup and takes lots of RAM. But on longer runs, Java generally outperforms C++ because runtime optimizer knows more about the actual runtime CPU and your actual code execution patterns than a C++ compiler could know.

Hi Jeff,

Thanks for your recommendation - I will discuss that with my programming team.

Ask your programming team and use whatever they tell you to use. You shouldn't take advice from random noobs on a forum with an axe to grind.

But if you want my answer, I'd say that Jython is a pretty good suggestion. Note that Java, and thus Jython, can always call out to C++ code if the CPU ever becomes a bottleneck. It won't though -- L2 cache and total RAM will be your bottleneck.

Edit: I would recommend Scala more strongly than Jython as your choice of JVM language. If you gave my coworkers the choice, we would surely go with Scala. See http://blog.redfin.com/devblog/2010/05/how_and_why_twitter_uses_scala.html for more evangelism in that direction.

Thanks Rashakil for your contribution. I will look into Scala to see what it really offers.

Thanks pyTony for your time and contribution.

Edit: I would recommend Scala more strongly than Jython as your choice of JVM language. If you gave my coworkers the choice, we would surely go with Scala. See http://blog.redfin.com/devblog/2010/05/how_and_why_twitter_uses_scala.html for more evangelism in that direction.

I wholeheartedly agree: Scala would be a good choice. And if you go that direction, I'd recommend making the system 100% Scala.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.