----- Original Message [edited to avoid any identification] -----

Dear students �

Call this �The Architect as Detective�

It�s a real tale from the trenches.

It's an example of the kind of architecture case history, told from the perspective of one participant, which would be interesting to use for your class case study.

Note:� This particular one is too short, based on my experience!� But if you can find something similar with a bit more to it, that would be great!

----- Original Message [edited to avoid any identification] -----

From: <unnamed software architect>

To: <other architects he knows, including your professor>

Sent: Monday, March 28, 2005 1:23 PM

Subject: I am NOT making this up!

I just have to vent.

The more things change the more they remain the same.

Technical Tragicomedy:

You folks might be dismayed or amused by this, but I just had to put this one down for posterity.

Chapter 1 - Performance/Capacity

Sometime last year I was (yet again) called in to help out with and oversee a task force which had been started to pull a project out of trouble.� The original symptoms were that it looked like a performance problem.� They were about to ship Release 2 to the customer.� The customer had only given them capacity/throughput requirements the month before.� (Sound familiar?)� The project was at 17 Transactions per Second but the requirement was 70 TPS.

Note that the software was being written by a third party vendor (3PV), and there was the usual "It's the platform!" "No, it's the application!" type of finger pointing.

I started rounding up the usual suspects: database abuse, multi-threaded code written by folks who had never written such before (ill-formed mutexes), malloc/free abuse, etc.

Yep, it was all there.� 25 database calls (on average) per transaction, some mutexes (spin locks!) held for the duration of a transaction effectively single threading the application on a 4 CPU machine, allocating objects on-the-fly, and not only the necessary objects, but every object for every possible transaction type was allocated at the start of ANY transaction!

The unanimous technical recommendation was that the application (and probably some platform libraries as well) needed major overhaul.� Little did we know that we had just scratched the surface!

In the meantime, the marketers had, without the benefit of consulting with either the 3PV or anyone else, for that matter, promised the delivery of R3 by the end of the year, with lots of new features.� They vetoed the re-architecture work, saying that we would deliver the R3 on the R2 base and then start the re-architecture work to increase capacity.� (If you believe that, there's this lovely bridge in Brooklyn!)

Chapter 2 - Stability

During the course of the initial investigation I had a discussion with the 3PV's performance analyst, and he provided me with the test results and graphs he had used to come up with the 17 TPS figure.� All of the graphs stopped after about an hour.

Me: "Where are the results from the 72 hour stability run?"

Him: "Well, I haven't been able to keep the application up over an hour, and I'm making do with what I have."

(Red flags and alarm bells immediately go off in my head.)

Me: "Did you ever have a stable run?"

Him: "Well, yes, R1 was stable when we delivered it to the customer.� We've been on such a tight schedule with the features that we haven't had any stability runs on R2."

(More red flags, more alarm bells!)

During the conference call the next day, I told the folks that we needed to re-create the last stable version and then investigate exactly what had changed between then and now.� One of the participants claimed that it would take too long to do that since they were on Checkpoint (CP) 27 of R2 at the present time.

Me: "You mean to say that there hasn't even been a 12 hour stability run during that whole time?"

They:� "Yes.� Even now, it would just take too long to do a stability run on every checkpoint.� There just isn't enough time."

Me: "Do a binary search algorithm. We know that CP1 is stable (since CP1 is a copy of the R1 code) and we know that CP27 is unstable.� Start with CP14."

The instability was found somewhere between CP23 and CP25.� After much pulling of teeth, it was determined that the major changes between CP23 and CP25 was the addition of "lots" of timers to the code, e.g. transaction timeout, response timeout, etc.� (In addition someone had put in a couple of compiler flags which were suspicious, e.g. "-no-align". Fixing the compiler flags was a piece of cake in comparison to what happened next.)

Some of the added timers were on the order of hours.� "No problem," I thought, "when the transaction is done, these should be cancelled."

Timer Code Developer (TCD):� "Well no, actually, when a timer is cancelled, we just mark it as cancelled and then we clear it when the original expiration time is reached.� We thought it would create a performance problem to clear the timer when it was cancelled because we would need to put a mutex around the 'cancel' code since it would be deleting entries from the middle of the linked list. We have a thread which is scheduled every 100 ms.� It looks at the first timer in the list and if it has expired, it removes it from the beginning of the list."

Me: (Getting suspicious at the wording.) "Does this thread remove all the timers that have expired?"

TCD:� "No we just look at the top timer."

Me: "So the best you can do is expire 10 timers per second?"

TCD:� "Yes, but I don't see why we need so many timers."

(What I wanted to say:� "You idiot, the requirement is for 70 transactions per second and there are

3 timers per transaction!")

(What I really did say)

Me: "You realize that you have created a memory leak, don't you?� At 17 TPS, we're allocating about 50 timers per second and you're only freeing 10 timers per second."

TCD: "Oh, that's not a problem.� We allocate a static data structure of 100,000 timers and we just re-use that.� If the structure is full, we don't set the new timer."

Me: "How do you know that the structure is full?"

TCD: "I traverse the list and if there is a free element, I insert it

into right place.� If there are no free elements, I don't insert it."

(More red flags and alarm bells! This smells like and O(N) algorithm!)

Me: (Directing question at application developers) "What does the application do when a timer can't be set."

AD: "We can't tell when that happens.� The method to allocate a timer is declared as a void. (No return value.)"

(More red flags and alarm bells!)

Me: (To performance analyst) "How far into the run do you get before it goes unstable?"

Him: "Usually about 40-45 minutes."

Me:� "Let's see, at a rate of 17 TPS, we need to allocate roughly 180,000 timers during the first hour.� We're clearing 36,000 timers during that same hour.� That means that by the end of the hour, we would need an allocation for about 144,000 timers. We have a static allocation of 100,000.� When I do the arithmetic, this shows me that about 42 minutes into the run we're out of timers.� Does anyone other than me see a pattern here?"

TCD: "But this code has been running for two years."

Me: "Has it ever been stressed to this extent by any other application."

TCD: "I don't know."

After much discussion, the following course of (in)action was decided upon:

The timer (library) code would be fixed in two stages.� First, the linked-list (which was already sorted by expiration time) would be traversed to delete all expired timers during every pass.� This would at least create a situation where the the outflow would equal the inflow.� Simple "Little's Law" computations were done to determine the size of the structure needed to minimally satisfy customer requirements for a trial.� Results indicated a timer pool of 600,000.� And that was setting the largest timer value at about one eighth of the customer requirement/desirement.

As a second pass, the algorithm for canceling a timer would be modified to actually *delete* the timer when it was cancelled, rather than wait for the expiration time.� It was claimed that this effort would take longer.� That's why the two-pass approach.� Once this scheme is implemented the size of the pool necessary would be *only* around 50,000.

In addition, the method which allocated timers would return a Boolean.� But, the application developers said that, given the short time frames and the fact that most of the developers were working on R3 features, the only code change that they could make was to log this as an error.� In the "future" they would do something intelligent when the error occurred.� (Can you spell "ticking time bomb?")

Computations were done with a spreadsheet to determine a set of timeout values� and static allocations ( that's how we arrived at the 600,000 mentioned above) which would get us to a steady state for pass 1 fixes, and we survived a 90 hour stability run.� The last three hours of which, no new transactions were being fed into the system, but at least we had a ballpark number for the efficiency of the timer-clear code.� It was still taking up 3-5% of the CPU even with no new transactions being entered.� A little further analysis of the graph indicated that it was linear at the tail end, i.e. seemed to be O(N).

<aside>

A colleague who looked at the timer code asked why they had re-invented timer code when there was timer code already available which operated in O(log(N)) time and had been around for at least 8 years?

The answer was that this particular code had to run on not only Solaris, but also Linux and Windows, so they had decided to develop their own.� Now, folks, I ask you, why do we keep re-inventing octagonal wheels? Wouldn't it have been much smarter to take the C code for the existing timer library and port it to various flavors of Linux and Windows rather than roll your own?� Even if we had to #ifdef the spit out of the code to get the system calls right?� Aaaaaargh!

</aside>

So, we got a stable run for about 72 hours with the simplest transaction type.� But it wasn't a "real world" test.� It was decided to pass parameters to the transaction rather than let it assume defaults. This is accomplished with a call to "set_payload" and a reference to the object.� The run was aborted after a day or so when we started paging ourselves to death.� The memory usage graph was monotonically increasing.

Next Conference call -

Me:� "Ok, what happens in "set_payload?"

Library Developer (LD): "We set the payload to whatever is specified."

Me: "What happens if there's already a payload there?"

LD: "We allocate a new payload object."

Me: "What happens to the old payload object?"

LD: "Nothing."

Me: "You don't deallocate it?"

LD: "No that's up to the application.� We have a 'delete_payload' API."

Application Developer (AD): "Well I guess I could grep for all the places in the code where we do a 'set_payload' and insert a 'delete_payload' just before that call.� There must be a couple of dozen."

Me:� "What would happen if we were to 'delete_payload' and there was no payload there?"

LD: "We throw an exception."

Now, this is prima facie evidence that there are people in the world with [certain deficiencies]!� He doesn't report on error on dual allocation, but throws an exception when someone is trying to "free" something which has not been allocated!� Aaaaaaargh!

</commentary>

I had to cut the shouting match short, because the LD was claiming it was up to the application to do things right, while the AD was claiming that, at this late stage in the game, it would take too long to go through all the code and change it, especially since it was NOT safe to just arbitrariliy do a "delete_payload" call prior to every "set_payload" call.

Result: "set_payload" was modified to deallocate prior payload if it existed.� The LD complained to his manager that I had been "heavy handed" and had not listened to his arguments.� That particular manager was a long-term colleague.� After a brief 5 minute phone call with me he told the developer: "Just shut up and do it!"

The subsequent stability run indicated a memory leak of 75 MB per year (extrapolatd from the 72 hour run).� I can live with this since there is scheduled down-time for maintenance every 90 days.

We finally delivered a stable R2 sometime in early Feb. (Remember that *R3* was supposed to be delivered in Dec!) It was running fine at 25 TPS, with no instabilities.� In the meantime, we had negotiated with the customer, and found out that they didn't really need the 70 TPS until their trial was over and they cut live.� They only needed about 15 TPS for the trial. Some management person felt that with a hardware upgrade to a 2.5X faster machine a with a bit of tweaking we would get there eventually.� But that involves going from 32-bit to 64-bit and who knows what other roaches may crawl out of the woodwork.� (Like the spinlocks?)

Some folks breathed a sigh of relief.� I was not one of them, since, while these fixes were being done, R3 features were being added to the unstable code base.� We needed to go through this whole exercise again for R3, because the marketing people had now promised the customer that, for sure now, R3 would be delivered by the end of March.� Did you ever feel like you were Sisyphus, forever rolling a rock uphill?

Chapter 3 - R3

A work still in progress.� The first runs core dumped almost immediately.� It turned out that the fixes to the libraries for R2 were NOT carried forward to the R3 libraries and there were yet more new compiler flags which were incompatible. (Yes, the issue of multiple load-lines rears its ugly head again as does the issue of developer self-discipline.)

Once that was fixed the CPU usage went up by 10% (to ~67%) for the same transaction load.� Symptoms indicate that it's core dumping when the CPU usage goes above 70% or so, thus when someone logs in and tries to check the health of the system during the stability run, it goes unstable.� One symptom is that it's a null-pointer de-reference.� Other symptoms indicate that certain variables which a thread expects to be set are not.

The astute reader will note that in a multi-threaded environment, these kinds of problems can be (and often are) caused by race conditions between processing threads, e.g. someone NOT putting a mutex around a set of code which does resource allocation.

One piece of new code we found:

ptr->foo = bar;

/* some lines down */

if(ptr == NULL) {

�� /* log error, return failure */

}

The core dump occurred at the first statement, and no one here should be surprised.

Next conference call -

Me: "This should have been caught during a code inspection."

They: "With such a tight schedule, we've had to cut some corners."

Aaaaaargh!�� Here we go again!

Meanwhile, while all these tests were going on, we were limited by lab resources.� Other projects needed the machines to do their own testing, sometimes on a different load-line, so that the set-up time to do a re-install ate up a significant portion of the lab time.� Spit!

P.S. - In case you haven't noticed, we haven't even adressed the "too many/few mutexes" issue, or database abuse issue, or the malloc/free abuse issue yet in terms of capacity/throughput.� This one is going to be a long journey.

My latest soapbox:

"The only thing worse than not delivering a product on time is delivering a product on time which doesn't work."

Thank you all for reading/listening to this ranting.

I hope you all had a Happy Easter with your family.

[author sig]