I don't have much experience with dealing with databases that have more than a few hundred records. But the other day I really got to wondering about how it is that with a database of more than 700 million records, when I enter my username and password into facebook I am greeted by the homepage in about less than a second with all my other information (news feeds and so on) displayed on this next page.

How exactly can the information I've entered be verified against so many records in such little time. I doubt facebook releases information like this so it's most likely going to be speculation on how they actually do it, but any ideas anyone?

well, you can be very sure that Facebook is not running on a single desktop, but on several servers whit loads of cpu power and ram. also, the way you search your elements will influence.

for instance, if you have all your elements in an unsorted array and go searching, or if you are using a hashmap and use the hashcode, that 'll speed up things (using the hashmap, that is :) )

Facebook uses a NoSQL database system rather than SQL for its query language (as do Amazon and Google for example) and these are designed specifically to handle 'Big Data' by scaling across a huge number of machines. You can probably find out more by doing a Google search on Facebook and Cassandra, the NoSQL implementation it developed in-house as a cross between the Google BigTable database and the Amazon Dynamo system.

700 million records is not a lot for one machine. (Several million users trying to retrieve information from 700 million records would be a lot, though.) Facebook has many many more records than that, because there are many postings and photos per user.

NoSQL databases are really good for packaging documents. For example, in a relational database, I will have a posts table, and a threads table, and a members table. Pulling this forum thread is going to pull the relevant posts from the posts table, the thread record from the threads table, and all of the members from the members table. In a NoSQL database, each forum thread is going to have its own document that includes everything needed to pull up that record in one place.

When it comes to doing searches though ... a lot of times Sphinx and Lucene are involved. I don't have any experience with Lucene, but essentially what Sphinx does is create an index of potential columns that are going to be searched on. You run the search against a cluster of search servers and it returns the relevant record IDs. You can then go pull up the complete records from MySQL, Cassandra, MongoDB, etc.

Davey, while Facebook developed Cassandra, a no-SQL database, they only use it for certain tasks. They still primary rely on MySQL. I went to a MySQL Performance conference last summer and had a realllllllllly long sit down with one of Facebook's database admins, who gave me a big jump start as far as designing the new DaniWeb system ... back when the system was just a twinkle in my eye.

That's interesting Dani, as much has been made of Facebook's adoption of (and devlopment hand in) NoSQL over the last few years. NoSQL bandwagon marketing hype more than practical reality then?

NoSQL bandwagon marketing hype more than practical reality then?

No. Well, I'm not measuring hype, and I'm not sure if there even is hype. They're just tools. But they're useful. For example, a lot of Yahoo is powered by an internal proprietary NoSQL product.

Many NoSQL products out nowadays are quite poorly written pieces of software. Cassandra was and possibly still is one of them, according to some HFT developer I talked to about a year ago, who claimed and acted like it was common knowledge that it and others would disappear data. I've also heard that it's gotten better recently. MongoDB, for example, has a global write lock on each database, making it quite mediocre for writes. Almost all have bad performance when it comes to how they write to disk, and some others are shoddily glued together monstrosities of Java and MySQL. Also they're usually bad at managing the notions of datacenters and machines between datacenters.

Another reason NoSQL isn't that useful is because computer RAM capacities keep getting larger, we now have SSDs too, which means single-machine SQL databases or primitive multi-machine failover mechanisms can keep getting larger and larger. And single-machine network capacity keeps getting larger. And of course the upgrade path for an SQL database is "sharding" and more often "buying a newer machine".

But if you have craploads of data, especially a crapload of large blobs like images and such, terabytes or petabytes of data, you need some integrated system for managing such information on a cluster, and SQL based databases ain't it. And other circumstances that require super-fast stuff might want a more specific architecture that NoSQL can provide. The SQL databases have years and years of engineering behind them that just makes them pretty damn good, though.

What's really interesting is what'll happen if/when memristors come around, because that completely changes some data storage math, and the software generally isn't equipped to take advantage of that. But that's orthogonal to SQL/NoSQL questions.