I'm starting to see common ways to design complex systems, so I figured I might as well start a series of blog posts which will be posted whenever I find something new.
The first design pattern, though it might be fairly obvious, is addressing the problem of: "Well, we can't have this system fail -- ever. Never-ever-ever FAIL, or you lose your job." That problem is fairly common. Your OS is a good example of a never-die system. Web servers are another one.
The way to solve the problem is to add a layer of indirection between the core and the services. The bottom layer will be the core system that will, to the best of your ability, NEVER fail. The top layer will be any mashup of services. Why does this design work so well? Because you can make changes to your extended services very easily while guaranteeing that any service failure will never affect the core system. What's the other way to do this? Combining the services with the core system. That's just a recipe for disaster.
I'm amazed at how hard it is to see the design. At the bughouse server, we have 30,000 LOC of C, and everybody was still trying to add features to the server! The simplest design, staring right at them in the face, was to farm the work out to automated bots. The additional power of such a design was that you could write a bot in any language you wanted. That's when you know you have a real winning design -- when the interface is so damn clear, you don't even need to specify the language you need to write the program in.
We had a similar problem at Nutanix regarding a automated test system that needed to, ahem, "NEVER FAIL." I won't talk too much about it, but the last time we discussed the system, we were still talking about extending the whole thing with yet more Python. It's just so clear now -- you need to define an interface for easily adding services!
Anyway, this concludes my rant for today. That was a very enlightening write-up. I think I might do more of these as the design patterns come in.