Dear
students – Call
this “The Architect as Detective” It’s
a real tale from the trenches. It's
an example of the kind of architecture case history, told from the
perspective of one participant, which would be interesting to use for your
class case study. Note: This particular one is too short, based on
my experience! But if you can find
something similar with a bit more to it, that would be great! |
|
-----
Original Message [edited to avoid any identification] -----
From:
<unnamed software architect>
To:
<other architects he knows, including your professor>
Sent:
Monday, March 28, 2005 1:23 PM
Subject: I am NOT making
this up!
I
just have to vent.
The
more things change the more they remain the same.
Technical
Tragicomedy:
You
folks might be dismayed or amused by this, but I just had to put this one down
for posterity.
Chapter 1 - Performance/Capacity
Sometime
last year I was (yet again) called in to help out with and oversee a task force
which had been started to pull a project out of trouble. The original symptoms were that it looked
like a performance problem. They were
about to ship Release 2 to the customer.
The customer had only given them capacity/throughput requirements the
month before. (Sound familiar?) The project was at 17 Transactions per Second
but the requirement was 70 TPS.
Note
that the software was being written by a third party vendor (3PV), and there
was the usual "It's the platform!" "No, it's the
application!" type of finger pointing.
I
started rounding up the usual suspects: database abuse, multi-threaded code
written by folks who had never written such before (ill-formed mutexes), malloc/free abuse, etc.
Yep,
it was all there. 25 database calls (on
average) per transaction, some mutexes (spin locks!)
held for the duration of a transaction effectively single threading the
application on a 4 CPU machine, allocating objects on-the-fly, and not only the
necessary objects, but every object for every possible transaction type was
allocated at the start of ANY transaction!
The
unanimous technical recommendation was that the application (and probably some platform
libraries as well) needed major overhaul.
Little did we know that we had just scratched the surface!
In
the meantime, the marketers had, without the benefit of consulting with either
the 3PV or anyone else, for that matter, promised the delivery of R3 by the end
of the year, with lots of new features.
They vetoed the re-architecture work, saying that we would deliver the
R3 on the R2 base and then start the re-architecture work to increase capacity. (If you believe that, there's this lovely bridge
in
Chapter 2 - Stability
During
the course of the initial investigation I had a discussion with the 3PV's
performance analyst, and he provided me with the test results and graphs he had
used to come up with the 17 TPS figure.
All of the graphs stopped after about an hour.
Me:
"Where are the results from the 72 hour stability run?"
Him:
"Well, I haven't been able to keep the application up over an hour, and
I'm making do with what I have."
(Red
flags and alarm bells immediately go off in my head.)
Me:
"Did you ever have a stable run?"
Him:
"Well, yes, R1 was stable when we delivered it to the customer. We've been on such a tight schedule with the
features that we haven't had any stability runs on R2."
(More
red flags, more alarm bells!)
During
the conference call the next day, I told the folks that we needed to re-create
the last stable version and then investigate exactly what had changed between
then and now. One of the participants
claimed that it would take too long to do that since they were on Checkpoint
(CP) 27 of R2 at the present time.
Me:
"You mean to say that there hasn't even been a 12 hour stability run
during that whole time?"
They: "Yes.
Even now, it would just take too long to do a stability run on every
checkpoint. There just isn't enough
time."
Me:
"Do a binary search algorithm. We know that CP1 is stable (since CP1 is a
copy of the R1 code) and we know that CP27 is unstable. Start with CP14."
The
instability was found somewhere between CP23 and CP25. After much pulling of teeth, it was
determined that the major changes between CP23 and CP25 was
the addition of "lots" of timers to the code, e.g. transaction
timeout, response timeout, etc. (In
addition someone had put in a couple of compiler flags which were suspicious,
e.g. "-no-align". Fixing the compiler flags was a piece of cake in
comparison to what happened next.)
Some
of the added timers were on the order of hours.
"No problem," I thought, "when the transaction is done,
these should be cancelled."
Timer
Code Developer (TCD): "Well no,
actually, when a timer is cancelled, we just mark it as cancelled and then we
clear it when the original expiration time is reached. We thought it would create a performance
problem to clear the timer when it was cancelled because we would need to put a
mutex around the 'cancel' code since it would be
deleting entries from the middle of the linked list. We have a thread which is
scheduled every 100 ms. It looks at the
first timer in the list and if it has expired, it removes it from the beginning
of the list."
Me:
(Getting suspicious at the wording.) "Does this thread remove all the
timers that have expired?"
TCD: "No we just look at the top timer."
Me:
"So the best you can do is expire 10 timers per second?"
TCD: "Yes, but I don't see why we need so
many timers."
(What
I wanted to say: "You idiot, the
requirement is for 70 transactions per second and there are
3 timers per transaction!")
(What
I really did say)
Me:
"You realize that you have created a memory leak, don't you? At 17 TPS, we're allocating about 50 timers
per second and you're only freeing 10 timers per second."
TCD:
"Oh, that's not a problem. We
allocate a static data structure of 100,000 timers and we just re-use
that. If the structure is full, we don't
set the new timer."
Me:
"How do you know that the structure is full?"
TCD:
"I traverse the list and if there is a free element, I insert it
into right place.
If there are no free elements, I don't insert it."
(More
red flags and alarm bells! This smells like and O(N)
algorithm!)
Me:
(Directing question at application developers) "What does the application
do when a timer can't be set."
AD:
"We can't tell when that happens. The
method to allocate a timer is declared as a void. (No return value.)"
(More
red flags and alarm bells!)
Me:
(To performance analyst) "How far into the run do you get before it goes
unstable?"
Him:
"Usually about 40-45 minutes."
Me: "Let's see, at a rate of 17 TPS, we need
to allocate roughly 180,000 timers during the first hour. We're clearing 36,000 timers during that same
hour. That means that by the end of the
hour, we would need an allocation for about 144,000 timers. We have a static
allocation of 100,000. When I do the
arithmetic, this shows me that about 42 minutes into the run we're out of
timers. Does anyone other than me see a
pattern here?"
TCD:
"But this code has been running for two years."
Me:
"Has it ever been stressed to this extent by any other application."
TCD:
"I don't know."
After
much discussion, the following course of (in)action
was decided upon:
The
timer (library) code would be fixed in two stages. First, the linked-list (which was already
sorted by expiration time) would be traversed to delete all expired timers
during every pass. This would at least
create a situation where the the outflow would equal
the inflow. Simple "Little's
Law" computations were done to determine the size of the structure needed
to minimally satisfy customer requirements for a trial. Results indicated a timer pool of
600,000. And that was setting the
largest timer value at about one eighth of the customer requirement/desirement.
As
a second pass, the algorithm for canceling a timer would be modified to
actually *delete* the timer when it was cancelled, rather than wait for the
expiration time. It was claimed that
this effort would take longer. That's
why the two-pass approach. Once this
scheme is implemented the size of the pool necessary would be *only* around
50,000.
In
addition, the method which allocated timers would return a Boolean. But, the application developers said that,
given the short time frames and the fact that most of the developers were
working on R3 features, the only code change that they could make was to log
this as an error. In the
"future" they would do something intelligent when the error
occurred. (Can you spell "ticking
time bomb?")
Computations
were done with a spreadsheet to determine a set of timeout values and static allocations ( that's how we
arrived at the 600,000 mentioned above) which would get us to a steady state
for pass 1 fixes, and we survived a 90 hour stability run. The last three hours of which, no new transactions
were being fed into the system, but at least we had a ballpark number for the
efficiency of the timer-clear code. It
was still taking up 3-5% of the CPU even with no new transactions being
entered. A little further analysis of
the graph indicated that it was linear at the tail end, i.e. seemed to be O(N).
<aside>
A
colleague who looked at the timer code asked why they had re-invented timer
code when there was timer code already available which operated in O(log(N)) time and had been around for at least 8 years?
The
answer was that this particular code had to run on not only Solaris, but also
Linux and Windows, so they had decided to develop their own. Now, folks, I ask you, why do we keep
re-inventing octagonal wheels? Wouldn't it have been much smarter to take the C
code for the existing timer library and port it to various flavors of Linux and
Windows rather than roll your own? Even
if we had to #ifdef the spit out of the code to get
the system calls right? Aaaaaargh!
</aside>
So,
we got a stable run for about 72 hours with the simplest transaction type. But it wasn't a "real world"
test. It was decided to pass parameters
to the transaction rather than let it assume defaults. This is accomplished
with a call to "set_payload" and a
reference to the object. The run was
aborted after a day or so when we started paging ourselves to death. The memory usage graph was monotonically
increasing.
Next
Conference call -
Me: "Ok, what happens in "set_payload?"
Library
Developer (LD): "We set the payload to whatever is specified."
Me:
"What happens if there's already a payload there?"
LD:
"We allocate a new payload object."
Me:
"What happens to the old payload object?"
LD:
"Nothing."
Me:
"You don't deallocate it?"
LD:
"No that's up to the application.
We have a 'delete_payload' API."
Application
Developer (AD): "Well I guess I could grep for
all the places in the code where we do a 'set_payload'
and insert a 'delete_payload' just before that
call. There must be a couple of
dozen."
Me: "What would happen if we were to 'delete_payload' and there was no payload there?"
LD:
"We throw an exception."
<commentary>
Now,
this is prima facie evidence that there are people in the world with [certain
deficiencies]! He doesn't report on error
on dual allocation, but throws an exception when someone is trying to
"free" something which has not been allocated! Aaaaaaargh!
</commentary>
I
had to cut the shouting match short, because the LD was claiming it was up to
the application to do things right, while the AD was claiming that, at this
late stage in the game, it would take too long to go through all the code and
change it, especially since it was NOT safe to just arbitrariliy
do a "delete_payload" call prior to every
"set_payload" call.
Result:
"set_payload" was modified to deallocate prior payload if it existed. The LD complained to his manager that I had
been "heavy handed" and had not listened to his arguments. That particular manager was a long-term
colleague. After a brief 5 minute phone
call with me he told the developer: "Just shut up and do it!"
The
subsequent stability run indicated a memory leak of 75 MB per year (extrapolatd from the 72 hour run). I can live with this since there is scheduled
down-time for maintenance every 90 days.
We
finally delivered a stable R2 sometime in early Feb. (Remember that *R3* was
supposed to be delivered in Dec!) It was running fine at 25 TPS, with no
instabilities. In the meantime, we had
negotiated with the customer, and found out that they didn't really need the 70
TPS until their trial was over and they cut live. They only needed about 15 TPS for the trial.
Some management person felt that with a hardware upgrade to a 2.5X faster
machine a with a bit of tweaking we would get there
eventually. But that involves going from
32-bit to 64-bit and who knows what other roaches may crawl out of the
woodwork. (Like the
spinlocks?)
Some
folks breathed a sigh of relief. I was
not one of them, since, while these fixes were being done, R3 features were
being added to the unstable code base.
We needed to go through this whole exercise again for R3, because the
marketing people had now promised the customer that, for sure now, R3 would be
delivered by the end of March. Did you
ever feel like you were Sisyphus, forever rolling a rock uphill?
Chapter
3 - R3
A work still in progress. The first runs core dumped almost
immediately. It turned out that the
fixes to the libraries for R2 were NOT carried forward to the R3 libraries and
there were yet more new compiler flags which were incompatible. (Yes, the issue
of multiple load-lines rears its ugly head again as does the issue of developer
self-discipline.)
Once
that was fixed the CPU usage went up by 10% (to ~67%) for the same transaction
load. Symptoms indicate that it's core dumping when the CPU usage goes above 70% or so,
thus when someone logs in and tries to check the health of the system during
the stability run, it goes unstable. One
symptom is that it's a null-pointer de-reference. Other symptoms indicate that certain
variables which a thread expects to be set are not.
The
astute reader will note that in a multi-threaded environment, these kinds of
problems can be (and often are) caused by race conditions between processing
threads, e.g. someone NOT putting a mutex around a
set of code which does resource allocation.
One
piece of new code we found:
ptr->foo
= bar;
/*
some lines down */
if(ptr == NULL) {
/* log error, return failure */
}
The
core dump occurred at the first statement, and no one here should be surprised.
Next
conference call -
Me:
"This should have been caught during a code inspection."
They:
"With such a tight schedule, we've had to cut some corners."
Aaaaaargh!
Here we go again!
Meanwhile,
while all these tests were going on, we were limited by lab resources. Other projects needed the machines to do
their own testing, sometimes on a different load-line, so that the set-up time
to do a re-install ate up a significant portion of the lab time. Spit!
P.S.
- In case you haven't noticed, we haven't even adressed
the "too many/few mutexes" issue, or
database abuse issue, or the malloc/free abuse issue
yet in terms of capacity/throughput.
This one is going to be a long journey.
My
latest soapbox:
"The only thing worse than not delivering a
product on time is delivering a product on time which doesn't work."
Thank
you all for reading/listening to this ranting.
I
hope you all had a Happy Easter with your family.
[author sig]