I've been learning a lot about machine learning (mostly clustering and regressions), but when I've looked into programming libraries/environments, I've found "production ready" ones to be few and far between.

What I mean by "production ready" is that it could be incorperated into a product, not just used a tool to fool around with in order to find some relationships. For instants, Pandora is probably a good example of a company that uses some algorithm (I would guess logistical regression, but I don't really know) to compute values from code.

In Java, things like Weka and RapidMiner seem to include the ability to be called from code as an afterthought and it's always awkward to do. I can't imagine either of them being used in production. Python's too slow to use for anything very big (It's suppose to be "Big Data" after all) and R can't scale at all.

It's possible to write everything in pure Java/C++/C code, but why arn't there any good libraries targeted at programming designed for programming environments. I feel like I'm missing something. What are these big companies using?

Recommended Answers

All 4 Replies

Python's too slow to use for anything very big

Stop right there. Yes, implementing the algorithms used in pure python is not a good idea, but thats not how the scientific community uses python. For scientific and numerical computation, python is really more of a 'glue' language and components are written in a more optimum way.

I would suggest you take a look at Anaconda to see how they do scientific computing. In particular I think you might find the scikit-learn package interesting.

R can't scale at all

Sure it can! And it's scientific packages are good I hear. Never got to using it myself too too much though.

I think you'll find a lot of places use Matlab, Python and R to be perfectly honest, though I don't know too many people in the industry. They probably implement a hunk of it themselves in C or FOTRAN.

Other things you might expect to see a bit of are Mathematica and Maple.

It is possible some larger companies wrote it all in-house, but I wouldn't expect it unless they were planning to redistribute it to end users.

Testing to see how accurate I might be:
I went onto the pandora jobs page, and navigated to a "computational advertising" position. As you can see, they use Python, Matlab and R.

It seems you are making some assumptions that you have not verified yourself. The ones Hiroshe point out are good examples:

Python, if used correctly, fits very nicely into an analysts toolbox regardless of scale. Have you had a specific example where this is not the case?

R has some issues on a 32-bit architecture due to it's internal implementation (needing to maintain contiguous memory, for example) but there are libraries aimed at addressing this. The true answer, however, is that these issues mostly disappear on a 64-bit architecture.

As far as your big data comment - I claim you don't have the amount of data that will present an issue to either Python or R. Big data has become somewhat of a buzz word that means little anymore. True big data are content that companies like Google, Facebook, and AT&T have to deal with. Even then, these companies regularly use tools like Python and R to process their data.

I find that the process for analyzing data usually follows a path similar to:

  1. Run experiments and/or collect unformatted data
  2. Process data (using Ruby, Python, Perl, ...) into a meaningful and usable format
  3. Experiment with samples of the formatted data in a real-time, interactive enironment.
  4. Build binary version of #3 that can be used on much larger sets of data.

R is very good for doing #3: it is a DSL for statistical analysis coupled with a rich visualization toolset and vast package ecosystem through CRAN. However, there is a steep learning curve to R and many people feel more comfortable using a Python-based toolset for this: numpy, scipy, and pandas. There is a growing debate over which is better (if one can measure such a thing) and you can learn a lot searching for R vs Python or something similar.

Most times, I claim, you never really need #4. If you do you have choices of using the C interface to either Python or R to build a binary version of your application and integrate it into your previous framework without changing the original API whatsoever. However, at that point you've moved a little beyond data analysis and more into tool production.

things like Weka and RapidMiner

These seem (at first glance) like two very typical examples of the kinds of "tools" you often find for specific purposes (e.g., analytics).

Weka is an academic project, probably written by students to try out specific algorithms, it's written in an academic language (Java), has a terrible by-the-book API with poor documentation (probably due to inexperience), and is, of course, half-finished. These problems are very common of libraries coming out of academia (including my own, in all honesty).

RapidMiner is an commercial tool, which functions by a whole different set of rules: the profit motive. Generally, these commercial tools don't want you to be able to "break out" of them, because that hurts their bottom line. They want you to adopt the tool, use it exclusively, and keep renewing your licenses and support plans. If they made an easy to use redistributable library with a nice API, you could just write your own code that calls it for the specific tasks that you need it for, and then be done with it. You would probably never need to change this again in a long while, and you wouldn't need much support or to upgrade. That's not a good revenu model for these companies. Instead, they give you a "full-feature" "graphical user interface" "no need to program anything" piece of software that they advertise to you as the perfect "all-in-one" solution for your data analytics needs. You get that software, you start creating stuff in it, and you think it's awesome and super productive until a few years later when you realize that you have all this real-estate tied to that software suite and that there are lots of limitations that you had to work you way around over the years, and that your wallet feels kind of light. Now, that's a good revenu model for those companies!

Once in a while you find things that are in that sweet spot, which is, an open-source library that was developed by experienced experts from the field who were just tired of having to deal with one or the other option I just described. One example of that (in another domain) is Code.Aster, which is a finite-element analysis software suite and library (all open-source) that was developed by engineers and FEA experts who were just tired of the extremely costly licensing fees and limitations of commercial packages, and now, that open-source software suite is far superior to its commercial competitors. But this example is still, sadly, the exception rather than the norm.

So, what you real companies use? Usually a mix of both in-house software and off-the-shelf tools built by other companies. I don't know about the languages and tools that are specific to your application domain, but in general terms, I would assume it's similar to other fields. A typical situation is that a company might use some tools (off-the-shelf software and DSLs) for testing out their algorithms and tweaking it for optimal results on test data sets, and once they are happy with the algorithm, they might implement it in a in-house library in some native code (C/C++) or not too far from that. Domain-specific tools are, I find, best used as platforms for testing and tweaking, because they are productive at that, but you typically want a more native solution for the final production code (which is typically easy to write once you've gone through the design process and you have just one final algorithm that you are happy with). Also, many domain-specific tools have code-generators that can turn whatever you have into some C code, or even compile it completely. However, code-generators are still not completely trusted by people, in my experience, because they tend to be quirky, inefficient (lots of useless overhead in the generated code), and cannot, in general, be formally tested (for correctness, fault-tolerance, reliability, etc.), and therefore, not strictly-speaking "production-ready" code.

Is this discussion still on? What about Apache Mahout? appreciate any thoughts on this...

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.