ge01f's Automation Imporium

February 15, 2010

The Importance of Reinventing the Wheel

“Don’t reinvent the wheel” seems like pretty common sense advice, but is it good advice?

What kind of wheel do you use for your car? My wheels have aluminum alloy inner wheels, with spokes to reduce the amount of material that needs to be rotated to reduce rotational inertia. The outer wheel, the tire, is vulcanized rubber wrapped around steel mesh to give additional structure to the rubber to improve its durability, and the vulcanization allows the rubber to remain firm and not break apart even as it heats up.

The treads in the tire are numerous, and were designed to allow water to flow in and around the tire as it rolls into contact with the ground and then rolls back out of contact again. This keeps the maximum amount of friction with the road, so that my car can get traction and propel itself forward or backward.

When did this wheel stop being reinvented? It appears to me it never has, and according to my material engineer friends, and proven in amateur practical attempts, tire technology is far behind in being keep cars firmly on the road at the speeds that motors and drive shafts are able to propel cars. Wheels today are vastly different than the wooden spoke wheels of past centuries, or the rock wheels of past millennia.

Does the tire industry use the phrase “don’t reinvent the wheel”? I doubt they do. Yet, this phrase is used all the time in technology companies, even as software and hardware is known to be among the most volatile of technology in terms of its’ pace of change. Why is this?

One reason is that it’s a simplification of “don’t overly complicate things”. If you are performing a simple job, use tools that exist and get it done so you can move onto the next job. If software exists that properly does a job, use it and move on.

What about when software does not properly do a job? What about if it does the job, but poorly and requires significant maintenance requirements or causes any future changes to be looked at with fear of breaking the running system and thus to be avoided?

I believe this is the time when the “don’t reinvent the wheel” is the least likely to be useful. All progress requires reinvention of “wheels” all the time, or things would not be progressing. The real question is “is this worth our time and money?” This is a question that is always appropriate and can always supersede a “let’s not reinvent the wheel” simplification.

When is it time to reinvent the wheel?

When you want your wheels to move faster.
When you want your wheels to last longer.
When you want your wheels to provide better traction, especially moving fast and taking corners.
When you have the capability to make a better wheel, and the time and money to do so.
When the cost-benefit ratio is worth it.

Applying this thinking to your business provides a useful metaphor for when it is time to reinvent a wheel, and when it is time to use the wheels that already exist. The metaphor has many built in direct comparisons, speed can be exchanged for volume or turn-around time. Lasting longer maps well to maintenance cycles. Being able to take turns at speed maps to being able to make new business goals and have your organization and software change to meet the new goals.

When I go to buy actual wheels for my car, I don’t develop my own tires, grow my own rubber trees, mine or produce my own metals. I am not skilled enough at any of these things to make an improvement on the wheels that are sold by existing commercial organizations. I also could not do it for anywhere near the costs of buying a new wheel and tire. I would have to buy ore, create a factory or set one up at home (probably illegally), and it may take me years to create a usable wheel and tire combination to use, and they would almost certainly be of far lower quality than the worst tires and wheels I could purchase. This is clearly a poor option, and purchasing a wheel provides many benefits.

In technology, things work similarly, the most similar being hardware, which has very similar creation processes. You can now outsource your fabrication, but the nature of physical electronic development is extremely difficult, error prone, and even getting working hardware out of your outsourced fabrication plants and into your customers hands can be so tricky that even businesses with working hardware specifications can go out of business while trying to get their devices manufactured and in their hands to sell.

Software has the enormous benefit of being totally virtual. Software merely has to attach properly to the environment it runs in (OS, drivers, libraries), fit in it’s physical resource constraints (storage, network and memory), and be internally consistent to provide a desired functionality.

Software provides one of the most obvious places to reinvent the wheel, because software is a series of commands to do what you want, and what you want is often different under different circumstances. The same software wheel cannot provide you all the different results you want without being reinvented to update its’ internal logic and data to your desires.

Many pieces of software, say the Apache HTTP server, are so generic and customizable that they become ubiquitous in internet based software environments. The original purpose of Apache was to deliver static content, in the form of HTML formatted text and image files, and later to allow running executable programs whose results would be returned instead of the static content.

Over the years, our desires for what software will give us has changed dramatically, and Apache has changed dramatically too, but still does essentially the same job. Apache was once at the heart of what a web server was, and now it is merely a window that functions to keep requesters on one side, and the producers on the other side, while being mostly transparent, just routing information through from requester to producer and back again, with some access and redirection rules.

Some organizations have done away with Apache, or only use it to deliver static text and image quickly, and then all other requests are sent elsewhere. The wheel of web request serving has been rewritten, but has it been rewritten for the last time?

That is unlikely, and all that is needed before the next time you find yourself needing to reinvent the wheel is a goal that can’t be met with current technology in a satisfactory manner for the goals you wish to accomplish.

Time and money permitting.

February 11, 2010

1 Part Corps of Engineers, 1 Part Secret Service

I’ve been thinking a lot lately about the role of System, Network and Database Administrators, and I’ve found it’s useful to think of us, collectively as the “Operations” arm of an organization. Whether I think of it as “Systems Operations”, “Network Operations” or some other variant doesn’t really help me understand our role any better, but could be useful to differentiate it from some other kind of operations department.

What I have found helpful is to think about our responsibilities and the functions we perform.

The model I have at the moment is that we are 1 part US Army Corp of Engineers, and 1 part US Secret Service.

We are like the Corp of Engineers because we build infrastructure. We survey problem sites. We design solutions for the environments that we support. This requires us to plan ahead, so that we are not short on resources, and so that our work is done in coordination with other efforts, so there aren’t costly delays. We need to do requirements planning, and do peer reviews to ensure we do not have oversights in our plans that will lead to defects, and the implementations need to be inspected for defects as things are built.

If defects are found, we have to determine how much it impacts the project, and how much it will take to repair it. If it will lead to a catastrophic failure, it is our duty to report it and try to raise awareness about the seriousness of the issue, so that the catastrophic failure will not occur.

When these steps are not followed, natural disasters, whose exact timing could not be predicted, but whose eventuality can predicted from a long-term understanding of the environment and it’s use and rate of decay.

This feels like a very useful analogy to me, and I think I can use it as a value system for making judgments. “Should I do this?” “What should I do?” are difficult questions on their own about any topic, but having a clear roadmap when I can imagine what I would expect out of the Corp of Engineers, just for my own and others safety, I can see the similarities and come to decisions for how best to help my organization.

Operations also serve another purpose though, which I think of as similar to the Secret Service in many interesting ways. The Secret Service has many different jobs, but primarily they consist of protecting the President, protecting the emergency military response gear, protecting areas of Executive National Security, and they also deal with counterfeiting.

The Secret Service has the responsibility of last say in terms of security of the President and Executive interests. They are supposed to veto any plan that puts the President in unnecessary danger, and while they may be over ridden at times, the domain of maintaining the proper security belongs to them.

The Secret Service plans for failures, and drills constantly to ensure they are ready for unexpected events and stay calm and respond according to a plan that has evolved from many previous experiences.

Finally, they have the responsibility to physically manhandle the President if an emergency situation breaks out.

I see a lot of parallels between Operations staff and the Secret Service. We are responsible for ensuring that no one enters our systems without authorization. We are responsible for ensuring that are data, customer data, and financial data are secure from being accessed by intruders.

We have to use certificates of authority, and secret handshakes with publicly and privately known information to allow our agents to communicate.

We are the ones who will first encounter problems with code on websites, or failures in hardware, and must plan and carry out an evacuation of any data that might be lost, or redirect traffic so that our customers aren’t sent to a broken service.

If attackers manage to enter our systems, we might isolate and quarantine them quickly, and thoroughly understand how they entered so we can stop it from being possible again, and find any similar vectors of attack. We must then go through all our systems that can’t immediately be burned and resurrected, and ensure that they are not infected, or plan for a rapid migration to be able to clean the system and restore the safe perimeter of our organization.

Finally, I don’t think we need to tackle any of our management, but I think we do need to report our findings to them and impress upon them the accurate level of importance and urgency to each one, according to it’s impact on the business, and preferably with as much detail as how it is likely to impact the business and any hard facts and data that will back this up to help them make alterations in their plans based on how well the previous plans are working.

What I like about these models is that they allow me to more easily imagine what I should be doing, and how I should be acting, not in a cop-drama way, but as an illuminated path towards the responsibility my organization could use from me.

February 10, 2010

The relationship between defects and volume

Imagine listening to an old musical recording on a set of speakers. At low volume, pops and cracks are noticeable because old recordings were done on records which had physical scratches that created the extra noise.

If the volume is turned up, so that the old recording is played louder, the pops and cracks will become more noticeable, and may start to take up more of your attention, and as the volume becomes very loud, the reduced experience from listening to the music with the pops and cracks may cause you to stop listening altogether, because the defects have made it unenjoyable.

Now imagine listening to a very clear new musical recording on the same set of speakers. At low volume, the sound is clear and enjoyable. At a reasonably high volume the sound is clear and enjoyable. As you approach the maximum volume, pops and cracks start to be audible, as the defects in the speakers themselves are now being displayed.

Volume causes defects to become noticeable and important.

This is why a design for a web application that is quickly thrown together may work for an initial group of users, but once the site becomes popular, the application could crash or fail to keep up with the load. The operation of the application may require many more servers to be purchased, or worse, not scale onto more servers and require purchasing an ever larger and more powerful single machine.

Horizontal growth can be purchased at slightly more than linear pricing, with a higher staffing or automation requirement, vertical growth grows exponentially more expensive, until it ceases to be possible to grow more vertically, and some horizontal scaling must be added (typically by horizontally scaling slowly with similar expensive vertical solutions).

The volume of use anything has directly corresponds to the likelihood that defects will cause a failure.

Once this is understood, and used to assist in decision making, then a solid plan can be put together for how long a defect can exist on the path to increasing volume.

February 9, 2010

Solutions Design: Fast, Cheap, Good. Choose 3.

In the past, the truism was: Fast, cheap, good. Choose 2.

Whether talking about software design, or systems operations design, we are now living in the future. Why has this changed? Because the economies of scale for completing software and operations projects has changed.

The kernel of truth to this still exists, each of these elements, speed, cost and quality, can press the upon the others to unbalance any solution. If a project is done fast, it can be done sloppily to not dot every “i” and cross every “t”. To create a solution faster, you could hire more people, which makes the solution fail at cheapness. If you want quality, you could spend a long time building it, and pay for the best in the business.

There is a new way of looking at all of this however, because like technology, other things have progressed with the times as well. The relationship between fast, good and cheap has not changed, but changes of scale have made working faster, better and cheaper, so when best-of-breed hybrid techniques are applied none of the three areas needs to be disproportionate.

Our ability to understand project management has drastically improved over the past 30 years of mainstream computing solution design and development. Our ability to understand how to use technology, with methodologies like Object Orientation, 3-Tier server systems, Agile development methodologies, to name only three well-known improvements, gives us a better way to approach solution creation. There are thousands of improvements out there, in the public, available for anyone to learn and begin using.

Using this knowledge and these skills lead to the ability to work faster, and create higher quality work. Additionally with the broad spectrum of commercial and open source solutions to leverage, there are many applications, services and libraries which turn many solutions into mostly integration work, and the majority of the original thinking is in managing the host of existing solutions.

“Fast, cheap, good: choose 2″, will remain a fun joke for those feeling the pressures to complete projects under pressures, but with all the information out there at your fingertips, all the work already completed and made available for you to use by others, and having seen personally proof to the contrary, I think it is time to retire this saying as a truism.

The bar has been raised; if you’re only getting 2 these days, you’re doing it wrong.

February 9, 2010

Emergency is just another word for incompetency

In the system administration world, unexpected events need to be expected.

Your hard drives failed? The Mean Time Between Failure (MTBF) gave you a statistical prediction this was going to happen, you should have planned for it.

A network partition occurred? All networks have the ability for multiple paths, and the cost-benefit algorithm allows adjustment for how much redundancy can be created to plan for all levels of internal and vendor failures. The CAP (Consistency, Availability / Partitions) principle assists us further, telling us that in any networked environment, we either have to focus on consistency, with periods of unavailability or availability with eventual consistency. In either case, if properly planned and implemented, a solution can provide an acceptable result and not create an emergency.

If fire drill work is erupting frequently enough that your team cannot replace legacy infrastructure fast enough, and is primarily tasked with responding to fires, then there is a systemic problem in place.

The solution to all of these problems is organization, planning, applying expertise to problem domains, having designs and work reviewed by qualified peers, and the all-encompassing requirement to care about the quality of work performed, levels of robustness, and comprehensive failure plans, both through automation and human processes.

When you are experiencing frequent emergencies, it is time to look inward towards your processes and approaches to your work, because some things are being done poorly enough to cause these emergencies, and failure to reflect and change will not lead to anything but more emergencies.

February 7, 2010

Utility Computing vs Cloud Computing

I have spent some time thinking about the functional differences between the terms Utility Computing and Cloud Computing, both as I think they are used today, and as how they could be used to differentiate a different class of service.

I see Utility Computing as a service provider that sells computing instances, computing time slices, networked and “local” storage, computing services (Map Reduce, Key Stores, Message Queue), the network bandwidth needed for this, and ways to reliably target traffic to your site to a single or multiple machines (floating IP address or load balancer).

The way Utility Computing service providers deliver these things to you gives you details about the instances, the volumes, the descriptor names for their network services, but the important point is that you are given a label for a real VM instance on real hardware. You are tracking something that is essentially a fixed service; an EC2 instance gives us it’s instance ID number (i-12345678), and with that we can reference only this one particular assignment of the physical hardware and Xen VM instance.

To contrast this, a Cloud Computing provider would give you an idealized system, and the actual VM instance or real hardware behind it would forever be abstracted. You would know you simply have a MySQL database with two 200GB network attached volumes in a RAID 1 configuration, with 32GB of RAM and 20 CPU units, and the Cloud Computer provider gives you a label to the stored concept of this goal, which could presently have an actual instance behind it, or not. Whether it has a running instance behind the goal, depends on its current configuration state, which could change at any time. The Cloud Computing provider would ensure that a new instance, of the correct specification, is brought back under the goal, and that in a pool of 20 machines, each can have a several volumes assigned to respective device paths, and when a replacement instance launched, all volumes will be re-attached in their proper place. The machine’s configuration process is initiated with the knowledge that this machine is part of a pool, and may load only a certain data set (sharding).

I see the difference between having a label for a machine instance, and having a label for the goal of what you want any instances behind that label to perform, and I believe this is the difference between a useful labeling of Utility Computer and Cloud Computing. I think this underlines my feeling that, at present, Amazon’s EC2 service is a Utility Computing service, and is only starting to become a Cloud Computing service with their new service RDS (Relational Database Service), which allows you to specify a goal for a database system, with its own backup and restore automation, though I haven’t launched one yet to see if this offering still leans more towards Utility or is delivering the Cloud abstraction and management as presently offered.

Presently, if you want Cloud Computing, you have to implement it yourself, or pay someone to help implement it for you so that your computing goals remain functioning even as the underlying hardware has failures, is replaced (perhaps with hardware in a different data center or region) and re-configured so that the goal can be picked up with a new set of hardware, and still serve the same function. I believe that is the reason Cloud Computing creates so much interest, and it appears to become a foundational pillar of the next wave of computing.