23 August 2010

The Eternal Tao: Push vs. Pull

The Tao is the concept of opposites, articulated historically by Laozi (the Lao Tzu of my youth).   The two opposites are referred to as yin and yang.

I am a Taoist in that I can see large patterns driven by opposites that seem diametrically opposed, and seem to always manifest, one with the other.

Right now on the Internet, another major paradigm shift appears to be happening in a shift from push to pull.  In this model, push is moving information to where the application is that wants to use it, whereas pull is the application going to get information when it needs it.

Pull is enabled by cheap, fast, global communications, and standard ways to represent metadata - the data about data. (Ouch!  It always makes my head hurt a little bit to say things like that, but it's true.)  I am reading Pull: The Power of the Semantic Web to Transform Your Business by David Siegel.  This a terrific book, best I've read in awhile, and when I am done, I will be writing a review in this blog.  Pull and metadata are the whole topic of Siegel's book.

Yin and yang seem to lob reality back and forth between them like a cosmic tennis game.   It has been compared to a pendulum, but I think of it more as a spiral - the classic Hegelian dialectic: thesis, antithesis, and then synthesis.   The two opposites usually seem to be tied to some third concept, at right angles to both, and that is what produces the spiral.

There is an old saying about remote operations, attributed to Don Box.  If the cost of local function call is 1, then the cost of a call that crosses local process boundaries is 1,000 and the cost to cross machine boundaries is 1,000,000.  A big part of this is because every communication transaction has two costs: channel seizure and then data transfer.  For short transactions, the channel seizure cost tends to be dominant.  That's why we open a file once and then read/write from it many times or open a TCP/IP connection, set up the SSL, and then use it for awhile.  When you cross process boundaries, it takes a major context switch.  When you cross machine boundaries, it requires remoting, using stubs and ties at many levels.

Caching can go a long way to hide this cost.  That is why modern computing systems use lots pools of pre-built, expensive objects, such as connections to files, databases, and remote machines, and keep pools of recently retrieved data from remote processes and machines.  The hard problem here is cache-coherency, which means keeping the local copy in the cache in sync with changes to real data.  Fortunately lots of work has been done on this, so we have a lot of tools in our toolbox.  A meatspace example of the caching problem might be as simple as finding out that a relation has a new child that you didn't know about, or as complex as figuring which is the final last will & testament of a deceased person.

Still, cheap, fast, global communications, and standard ways to represent metadata are all reducing the channel seizure cost to go get the data when you need it.  Thus new application features are becoming possible and the Internet is becoming a more lively, integrated environment.  I find it really exciting and fun!

No comments: