Web Application Architecture

Posted by pat on November 02, 2011

For web apps among the applications we are investigating we’ll pick as turnkey as possible an architecture and try to explain the reasoning.

Components:

  • Web App: We’ll use Ruby on Rails 3, just because it’s fun. I also believe that it will scale pretty well for the needs it faces. We will only ask RoR to present pages and read from a DB of pre-calculated values. No heavy lifting here so fun counts.
  • DB: NoSQL naturally. In past projects we have used HBase and MongoDB. HBase is a clone of BigTable the Google key value store and when we used it several years ago it was pretty green. It does not seem to be the NoSQL that has gotten the most cred or community support (correct me if I’m wrong). MongoDB is the new kid on the block and is also a little green but I like it because it represents the next gen in easy to set up and use NoSQL DBs. The other possibilities are numerous but the only one I’ll look at for now is Cassandra. It seems to be getting a lot of design wins and is pretty mature. I think a review of these will be another post.
  • Number Cruncher Services: Here we’ll use java and Hadoop with it’s kick ass framework Mahout. This will give us all sorts of benefits including: scalability, great high level libraries, a distributed scalable file system, a mapreduce framework, and all the stuff that comes with java. See a quick overview here.
  • Data Gathering and Crawling: Here there are several different data gathering needs. In some cases there will be occasional downloads, as with Wikipedia and other times there will be a need to crawl part of the web. For this later we’ll probably use Nutch from Apache but I’m not 100% on that yet.
  • Tools: IntelliJ Idea and Eclipse for IDE’s. I name both because, in my small experience with Idea it is really nice but lacks some important plugins that may make it worth using Eclipse for some things. For source control Git. I use it for other projects and have a paid Github account already so it’s an easy decision.
  • Resources: Currently a Macbook Pro and two Ubuntu 11.10 servers that are fairly old but all 64 bit with fair amounts of ram. This will not speed things up very much but will at least prove out the parallel architecture. These are in my home and one doubles as a media center. All double as space heaters.

Architecture:

The RoR app will serve up the pages and data from the DB. The services will crawl and process data using mapreduce and stuff the data in the DB. As best practices indicate for web-scale services we will try to pre-calculate everything and where we can’t we’ll create a service that will do things fairly quickly in java. That way the RoR app is pretty simple.

 

Philosophy:

I should have put this first but it is kind of boring so here it is where you can skip it. We’ll try to stick to best practices and design for scalability, redundancy, and reliability but we have a very small team (1 unless you want to join—hint hint) so we’ll feel free to take shortcuts where there is a payoff. Also in the best iterative practices we may knowingly hack something first to see if it’s worth spending more time to do right.