Today I found this absolutely superb article on distributed systems and the central role of the log: The Log: What every software engineer should know about real-time data’s unifying abstraction
It’s a technical-but-readable exploration of the role of the humble log in databases and distributed systems. The article lives up to it’s title and I think every software engineer dealing with scale and distributed data should read it. The author, Jay Kreps, has accurately and insightfully framed what’s currently going in the world of distributed systems:
You can’t fully understand databases, NoSQL stores, key value stores, replication, paxos, hadoop, version control, or almost any software system without understanding logs; and yet, most software engineers are not familiar with them. I’d like to change that. In this post, I’ll walk you through everything you need to know about logs, including what is log and how to use logs for data integration, real time processing, and system building.
This is such an exciting time for applied distributed-systems engineering because more and more companies are reaching global scale (100s of millions of users, billions of daily events, etc).
We were certainly grappling with these issues at Bazaarvoice (see my colleague Victor Trac’s recent post at High Scalability: Evolution Of Bazaarvoice’s Architecture To 500M Unique Users Per Month). It’s interesting that so many companies seem to have settled on a databus architecture, with a variety of sources and consumers; and that different views (or indexes) are built using domain-appropriate technology (e.g. ElasticSearch for text-search, Cassandra for aggregation, etc) by each watching the bus. As Jay Kreps writes:
Maybe if you squint a bit, you can see the whole of your organization’s systems and data flows as a single distributed database. You can view all the individual query-oriented systems (Redis, SOLR, Hive tables, and so on) as just particular indexes on your data. You can view the stream processing systems like Storm or Samza as just a very well-developed trigger and view materialization mechanism. Classical database people, I have noticed, like this view very much because it finally explains to them what on earth people are doing with all these different data systems—they are just different index types!
A big, ambitious piece of writing – props to Mr Kreps and the LinkedIn engineering org for sharing.