Re: Idea for Tinderbox Management & Performance Testing

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Dave Liebreich
[crossposting to mozilla.dev.quality]

L. David Baron wrote:

> On Wednesday 2006-04-26 10:54 -0700, Ben Goodger wrote:
>> Make all of the tinderbox server + client scripts installable through a
>> convenient rpm or msi installer that any developer can easily install on
>> their development box. They can then run their code through the same
>> tests the servers do.
>
> Robert Helmer suggested to me that the tests that tinderbox runs should
> be part of the source tree rather than the tinderbox scripts.  This
> could be done in a way that makes it easier for developers to run the
> tests, and I think it would solve the problem raised here.  It would
> also make it easier to maintain the tinderbox scripts, and make it
> easier to run the tests in automation frameworks other than tinderbox.
>
> -David
>

Please keep mozilla.dev.quality in the loop.  This will help achieve
critical mass of conversation about automated testing in that group.

I'm working on a simple structure for tests written in javascript, that
should be portable between various test harnesses.  I was thinking of
writing an interface for tinderbox's FileBasedTest, but I'm also open to
changing the way tinderbox runs tests.

Thanks

--
Dave Liebreich
Test Architect, Mozilla Corporation
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Dave Liebreich
[crossposting to mozilla.dev.quality]

Ben Goodger wrote:
> I understand no one ever has time to implement these things, but I
> thought this was a good idea...

Please keep mozilla.dev.quality in the loop.  This will help achieve
critical mass of conversation about automated testing in that group.

I'm working on a simple structure for tests written in javascript, that
should be portable between various test harnesses.

> Make all of the tinderbox server + client scripts installable through a convenient rpm or msi installer that any developer can easily install on their development box. They can then run their code through the same tests the servers do.

I would like to see tests that can be run on a dev box - that way they
can be run against a private build before the changes are checked in to
the repository.  This is why I added mozilla/tools/test-harness to the
list of directories checked out for most projects.

Please let me know if you can help work on this by posting in
mozilla.dev.quality or sending mail to dev-quality.

Thanks
--
Dave Liebreich
Test Architect, Mozilla Corporation
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Boris Zbarsky
In reply to this post by Dave Liebreich
[cross-posting to all three groups where we've been having this conversation,
but maybe we should pick one and set followups to it?]

Mike Shaver wrote:
> To be honest, it's not clear to me that our current performance tests
> measure what we care about most, or perhaps at all (c.f. our
> ground-breaking results in the field of "effects of pathnames on
> ancient Linux filesystem performance").  Figuring out what we want to
> measure  and then looking at how to test those things would seem like
> a wise course, given that we've had basically the same test model for
> the last 5 years.  For me, they would likely focus more on percieved
> responsiveness than on the raw-throughput numbers that we seem to be
> gathering with Ts and Tp today.

I think we should probably measure both, esp. for something like Tdhtml where
the people who really care are the DHTML script writers and who _are_ generally
complaining about raw throughput numbers in addition to perceived performance.
Measuring raw throughput is also more or less the idea of things like Trender
and Tgfx where we're trying to measure the performance of a subsystem.  But yes,
generally for performance it's more useful to include whole-system tests, and if
we can somehow include a user or a reasonable approximation of one in our system
that would be great.

I note that there's not been much pageload benchmark chest-banging from major
browser vendors after Safari initially shipped.  Not sure whether that means
that we've all outgrown fudgeable benchmark numbers or something else.  ;)

-Boris
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Boris Zbarsky
In reply to this post by Dave Liebreich
Mike Shaver wrote:
> So I wonder why that's the case for us, when word on the street is
> that the WebKit guys have extremely consistent (1-2ms) performance
> results for their work.

Good question!

Looking at the page you cite, the 24Fun test involves things like opening new
windows; I've never been able to get that more than about within 25% consistent
on Linux, no matter what the WM or WM settings...  In fact, I'm lucky if I get
to within one second accuracy on some of the 24Fun tests in two consecutive
runs.  I've tried to set up i-bench in the past and failed miserably; it being
unmaintained (to the point where the opendarwin link to it is broken doesn't
help).  So I can't speak to how reproducible its numbers are...

That said, if we assume that my system is not loaded and that variation in the
tests is just random, then the right solution is probably a lot of test runs,
with averaging.  We can't do that on tinderbox because it would slow down cycles
too much, as far as I can tell.  Maybe if we ran once-daily perf tests on
dedicated machines we could do more of that sort of thing.  In fact, the idea of
running perf tests on dedicated machines has come up before -- have a fast
tinderbox that ships its nightlies to a slower test machine (where we get better
granularity on performance).  If we can do something like that, we really
should, imo.

Even without major tinderbox setup changes, it might be possible to increase our
precision in some simple ways.  Here's some typical data from our pageload test
(argo box on the firefox tinderbox, three different tinderbox cycles for the
same page, raw data, not much happening with the tree):

0 home.netscape.com  382    306    307    291    306
0 home.netscape.com  370    300    279    286    286
0 home.netscape.com  378    309    291    287    311

Note that the first run is slower than the others; that makes sense given clean
profiles and cache.  Also note that the remaining numbers have about 10%
variation in them.

The way we currently deal with that problem is by taking the median of the 5
times for each page and averaging them to get the total Tp number.  In this
case, the three medians are 306, 286, and 309.  Those three numbers have an
average of 300.3 and a standard deviation of 12.5 or so.

Now say we took the average of all but the first run in all three cases, instead
of the median.  That would give us 302.5, 287.8, 299.5, which have an average of
296.6 and a standard deviation of 8.  Somewhat better....

It might be worth looking at more pageload data, but I suspect that in general
taking all but the first run and averaging them will give us more stable numbers
than just taking the median.  This is especially true if we increase the number
of runs from the current 5, since averaging over more runs should give us better
random-noise smoothing than taking the median of more runs, I think.

-Boris
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Mike Shaver
In reply to this post by Boris Zbarsky
On 4/26/06, Boris Zbarsky <[hidden email]> wrote:
> [cross-posting to all three groups where we've been having this conversation,
> but maybe we should pick one and set followups to it?]

I think we should pick .quality, but I don't think I'm able to set
followups usefully through the email interface.  I'll follow your lead
when you reply, though!

(I gotta tithe to the bug about the news<->mail gateway really not
handling crossposts.)

> I think we should probably measure both, esp. for something like Tdhtml where
> the people who really care are the DHTML script writers and who _are_ generally
> complaining about raw throughput numbers in addition to perceived performance.

I don't think I agree with that, but since we're spitballing, let's
say they care about both equally.  (For example, I think they would
prefer 5 frames of 11ms each to 4 frames of 5ms and one of 30ms.)

> Measuring raw throughput is also more or less the idea of things like Trender
> and Tgfx where we're trying to measure the performance of a subsystem.

For performance of subsystems, I still think that we care a lot about
responsiveness.  We're just a lot better at measuring throughput. :)

For throughput measurement that's less randomized, I wonder if it'd be
interesting to count cycles or something, in addition to wall-clock
time.

>  But yes,
> generally for performance it's more useful to include whole-system tests, and if
> we can somehow include a user or a reasonable approximation of one in our system
> that would be great.

Yeah, session-replay and event synthesis would be great.

Mike
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Mike Shaver
In reply to this post by Boris Zbarsky
On 4/26/06, Boris Zbarsky <[hidden email]> wrote:
> dedicated machines we could do more of that sort of thing.  In fact, the idea of
> running perf tests on dedicated machines has come up before -- have a fast
> tinderbox that ships its nightlies to a slower test machine (where we get better
> granularity on performance).  If we can do something like that, we really
> should, imo.

People talked at one point about underclocking machines to do this,
but it seems to me that we might meet success with something like the
VMWare ESX resource limits (which are controllable through an API, so
you could build at full-available-speed, and then lock down the guest
for a more sensitive performance-testing environment).

Mike
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Boris Zbarsky
In reply to this post by Boris Zbarsky
Mike Shaver wrote:
> People talked at one point about underclocking machines to do this,
> but it seems to me that we might meet success with something like the
> VMWare ESX resource limits

Yeah, that would work.

-Boris
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Boris Zbarsky
In reply to this post by Boris Zbarsky
[followups to .quality]

Mike Shaver wrote:
> I don't think I agree with that, but since we're spitballing, let's
> say they care about both equally.  (For example, I think they would
> prefer 5 frames of 11ms each to 4 frames of 5ms and one of 30ms.)

I'll buy that, sure.

> For throughput measurement that's less randomized, I wonder if it'd be
> interesting to count cycles or something, in addition to wall-clock
> time.

graydon was mentioning something about running under valgrind to do just that,
at some point.  Note that we use wall-clock times in our app in perf-sensitive
ways (e.g. interruption of the parser happens via a poll-and-timeout setup), so
we should watch out for that if our cycle-counting mechanism has the sort of lag
valgrind does.

-Boris
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Axel Hecht-2
In reply to this post by Boris Zbarsky
Boris Zbarsky wrote:

> Mike Shaver wrote:
>> So I wonder why that's the case for us, when word on the street is
>> that the WebKit guys have extremely consistent (1-2ms) performance
>> results for their work.
>
> Good question!
>
> Looking at the page you cite, the 24Fun test involves things like
> opening new windows; I've never been able to get that more than about
> within 25% consistent on Linux, no matter what the WM or WM settings...  
> In fact, I'm lucky if I get to within one second accuracy on some of the
> 24Fun tests in two consecutive runs.  I've tried to set up i-bench in
> the past and failed miserably; it being unmaintained (to the point where
> the opendarwin link to it is broken doesn't help).  So I can't speak to
> how reproducible its numbers are...
>
> That said, if we assume that my system is not loaded and that variation
> in the tests is just random, then the right solution is probably a lot
> of test runs, with averaging.  We can't do that on tinderbox because it
> would slow down cycles too much, as far as I can tell.  Maybe if we ran
> once-daily perf tests on dedicated machines we could do more of that
> sort of thing.  In fact, the idea of running perf tests on dedicated
> machines has come up before -- have a fast tinderbox that ships its
> nightlies to a slower test machine (where we get better granularity on
> performance).  If we can do something like that, we really should, imo.
>
> Even without major tinderbox setup changes, it might be possible to
> increase our precision in some simple ways.  Here's some typical data
> from our pageload test (argo box on the firefox tinderbox, three
> different tinderbox cycles for the same page, raw data, not much
> happening with the tree):
>
> 0 home.netscape.com  382    306    307    291    306
> 0 home.netscape.com  370    300    279    286    286
> 0 home.netscape.com  378    309    291    287    311
>
> Note that the first run is slower than the others; that makes sense
> given clean profiles and cache.  Also note that the remaining numbers
> have about 10% variation in them.
>
> The way we currently deal with that problem is by taking the median of
> the 5 times for each page and averaging them to get the total Tp
> number.  In this case, the three medians are 306, 286, and 309.  Those
> three numbers have an average of 300.3 and a standard deviation of 12.5
> or so.
>
> Now say we took the average of all but the first run in all three cases,
> instead of the median.  That would give us 302.5, 287.8, 299.5, which
> have an average of 296.6 and a standard deviation of 8.  Somewhat
> better....
>
> It might be worth looking at more pageload data, but I suspect that in
> general taking all but the first run and averaging them will give us
> more stable numbers than just taking the median.  This is especially
> true if we increase the number of runs from the current 5, since
> averaging over more runs should give us better random-noise smoothing
> than taking the median of more runs, I think.

I don't think we should really fight noise with statistics, or picking
values. We're doing computing, not medicine.
Noise is valuable data, too, and it should help us in deciding when data
is actually regressing and when it's just noise. Arbitrarily reducing
the noise by statistics may trick us.

Comparing the noise with ping may be one. Or just timing HEAD requests,
so that you get server noise.

So, on top of finding out what we want to measure, we should analyze how
we measure.

On the tests itself, I recall some discussion during the onsite that we
used badly outdated tests which could only be used locally due to
licensing problems. Is that true?

I wonder if we can get some tests with a better licensing scheme.
Snapshots of wikipedia pages come to my mind, slashdot seems to be
unworkable, digg.com sounds less troublesome, http://digg.com/tos says
that all submitted content is CC public domain.

Maybe hish and his folks can help with some AJAX perf tests?

Axel
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Axel Hecht-2
In reply to this post by Dave Liebreich
Ben Goodger wrote:

> I understand no one ever has time to implement these things, but I
> thought this was a good idea...
>
> We often have large changes that we want comprehensive testing run on
> for regressions in Ts, Tp, Txul, leaks, etc. It's usually much easier to
> crash land code and see how the tinderbox responds than to test for
> these regressions beforehand. This is bad, because it costs other
> developers productivity with tree closures etc.
>
> Tinderboxes should be configurable to an extent from the web, at least
> by anyone who has a cvs account.
>
> There should be a pool of tinderboxes that are available for developer
> testing. These machines are set up to run a variety of different tests
> on different platforms. There should be many of these.
>
> Developers would be able to request a tinderbox and then configure it
> through a web app - specifying what branch to build, .mozconfig etc.
> They then get the machine for up to a certain amount of time (since
> sometimes it can take a few cycles for numbers to stabilize, and you
> probably want to get pre-and-post change numbers too)... they can
> release the machine before that but they should not be able to hog it
> for days without a special reason.
>
> Alternatively...
>
> Make all of the tinderbox server + client scripts installable through a
> convenient rpm or msi installer that any developer can easily install on
> their development box. They can then run their code through the same
> tests the servers do.

As most of our tinderboxens are in VMs now, with configuration in CVS,
it should be possible to clone compile environments on demand. At least
that's what I assume preed and rhelmer are doing this for.

Tinderboxens should be more dynamics (or build bots), and possibly put
the generated builds on the upload area of people, for example.

Of course, those make crappy test environments.

The test enviroment is much more like a well shielded batch-operating
scenario. Which may want to pull builds from peoples upload or stage.

I envision the build farm to be more of a puddle of mud. The test
enviroment should be more like
http://www.mpq.mpg.de/atomlaser/html/optischer_tisch.html

(half a ton of table, on air suspension)

Axel
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Robert Kaiser
In reply to this post by Boris Zbarsky
Boris Zbarsky schrieb:
> It might be worth looking at more pageload data, but I suspect that in
> general taking all but the first run and averaging them will give us
> more stable numbers than just taking the median.  This is especially
> true if we increase the number of runs from the current 5, since
> averaging over more runs should give us better random-noise smoothing
> than taking the median of more runs, I think.

The other problem is that first-run data is valuable, as the vast number
of page loads by users are first-run loads, I guess...

Robert Kaiser
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Boris Zbarsky
Robert Kaiser wrote:
> The other problem is that first-run data is valuable, as the vast number
> of page loads by users are first-run loads, I guess...

True.  Perhaps we should report both numbers from Tp.  That would make first-run
fluctuate a lot, unless we did a bunch of tests with clearing the profile in
between (which is another thought, of course).

-Boris
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Mike Shaver
On 4/27/06, Boris Zbarsky <[hidden email]> wrote:
> Robert Kaiser wrote:
> > The other problem is that first-run data is valuable, as the vast number
> > of page loads by users are first-run loads, I guess...
>
> True.  Perhaps we should report both numbers from Tp.  That would make first-run
> fluctuate a lot, unless we did a bunch of tests with clearing the profile in
> between (which is another thought, of course).

Is that really true?  The vast majority of page loads by users might
be the first time they load a *given* page -- not sure I believe that
either -- but I think that it's a vanishing fraction of the loads that
are the first-load-per-run.  If we don't want to conflate those
things, we could load a different page at first to "warm up" the
browser, and then start doing Tp repetitions.

Mike
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Boris Zbarsky
In reply to this post by Boris Zbarsky
Mike Shaver wrote:
> If we don't want to conflate those
> things, we could load a different page at first to "warm up" the
> browser, and then start doing Tp repetitions.

Doesn't Tp already have just such a "not part of test" first page?

-Boris
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Mike Shaver
On 4/27/06, Boris Zbarsky <[hidden email]> wrote:
> Doesn't Tp already have just such a "not part of test" first page?

Curses, foiled again!

Mike
_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Robert Helmer-5
In reply to this post by Axel Hecht-2
Since I've been invoked in this thread, I figured I'd give an update on
where things are :) I'd really love for the build system to be at a
state where we can act quickly on these kinds of ideas.

We're pushing hard to get the rest of the tinderboxes over to VMs and
the configs under revision control over the next few weeks (was going
to happen this week, but 1503 comes first). preed has of course put a
ton of work into this while getting releases out the door at the same
time, it just needs that final focused push.

However, we have most of the important tinderboxes in VMs now, the real
blockers to being able to clone an environment in the ways discussed
are 1) lack of a good reference build platform (OS/toolchain versions,
etc), 2) getting the configs into public CVS, and 3) scripting out
tinderbox installs to the point where they can be installed on the
reference platform with no trouble (maybe this means having packages,
this has been brought up recently).

Once these blockers are gone (they are being working on now), creating
new tinderboxes becomes a lot less painful. We are also thinking about
how to manage these instances over time as they multiply and drift away
from the base config.

Most of the tinderbox-related work we've been doing lately (the ~1
month since I've been actively working on it anyway) is around getting
the current environment under control. Since we're doing releases from
the same environment, and it's not in a 100% consistent state, it's
risky to make changes, and babysitting builds through the release
process takes up a lot of time that could be used for more productive
work. The sooner we break out of this cycle the better.

On tinderbox feature development: there has been a lot of discussion
lately over whether it's worth continuing tinderbox development or
moving to a different system. Buildbot has a lot going for it, although
to address the needs in this thread (and to have parity with tinderbox)
it'd need some work (which hopefully would be adoptable upstream).
Going either way is going to be difficult and disruptive; until the
direction is decided I think there is going to be a lot of resistance
to adding the features we need to make the build system more useful for
everyone.

--
Rob Helmer

_______________________________________________
dev-quality mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-quality