Idea for Tinderbox Management & Performance Testing

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Idea for Tinderbox Management & Performance Testing

Ben Goodger
I understand no one ever has time to implement these things, but I
thought this was a good idea...

We often have large changes that we want comprehensive testing run on
for regressions in Ts, Tp, Txul, leaks, etc. It's usually much easier to
crash land code and see how the tinderbox responds than to test for
these regressions beforehand. This is bad, because it costs other
developers productivity with tree closures etc.

Tinderboxes should be configurable to an extent from the web, at least
by anyone who has a cvs account.

There should be a pool of tinderboxes that are available for developer
testing. These machines are set up to run a variety of different tests
on different platforms. There should be many of these.

Developers would be able to request a tinderbox and then configure it
through a web app - specifying what branch to build, .mozconfig etc.
They then get the machine for up to a certain amount of time (since
sometimes it can take a few cycles for numbers to stabilize, and you
probably want to get pre-and-post change numbers too)... they can
release the machine before that but they should not be able to hog it
for days without a special reason.

Alternatively...

Make all of the tinderbox server + client scripts installable through a
convenient rpm or msi installer that any developer can easily install on
their development box. They can then run their code through the same
tests the servers do.

-Ben
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Boris Zbarsky
Ben Goodger wrote:
> Make all of the tinderbox server + client scripts installable through a
> convenient rpm or msi installer that any developer can easily install on
> their development box. They can then run their code through the same
> tests the servers do.

It's already pretty easy to set up a tinderbox.  The hard part is getting
low-noise data out of it.  Doing that at least requires turning off all sorts of
daemons, etc, that are typically running on modern operating systems.  On Unix,
we used to not run GNOME on tinderboxen because it made the noise unbearable;
not sure if that's the case now.

So the short of it is, to usefully run a perfromance tinderbox on your
development box you'd need a separate account and separate runlevel for it (in
Linux speak).  And do absolutely nothing else with the machine while it's
running, of course.

-Boris
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Mike Shaver
In reply to this post by Ben Goodger
On 4/26/06, Ben Goodger <[hidden email]> wrote:
> There should be a pool of tinderboxes that are available for developer
> testing. These machines are set up to run a variety of different tests
> on different platforms. There should be many of these.

preed has a plan somewhere for doing this sort of thing, but for
performance tests we still need to make sure that we get reliable
results in a strictly-rationed vmware guest.

(Otherwise, the maintenance of "many" of these machines is not a
burden we can bear these days, to say nothing of the hardware and
operational costs.)

I think buildbot might be a better base for this, since it already has
the infrastructure for kicking off builds on demand, but that's not
especially material.

The idea certainly has merit, though I wonder if we shouldn't be
trying to get performance tests that are less fragile and can be run
in most cases without setting up a server, and more self-contained
leak tests.

To be honest, it's not clear to me that our current performance tests
measure what we care about most, or perhaps at all (c.f. our
ground-breaking results in the field of "effects of pathnames on
ancient Linux filesystem performance").  Figuring out what we want to
measure  and then looking at how to test those things would seem like
a wise course, given that we've had basically the same test model for
the last 5 years.  For me, they would likely focus more on percieved
responsiveness than on the raw-throughput numbers that we seem to be
gathering with Ts and Tp today.

I don't know if there are plans afoot in the QA community for such
things, but I haven't really looked to find out.  People may already
be collecting use cases for test stuff somewhere.

Mike
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Mike Shaver
In reply to this post by Boris Zbarsky
On 4/26/06, Boris Zbarsky <[hidden email]> wrote:

> It's already pretty easy to set up a tinderbox.  The hard part is getting
> low-noise data out of it.  Doing that at least requires turning off all sorts of
> daemons, etc, that are typically running on modern operating systems.  On Unix,
> we used to not run GNOME on tinderboxen because it made the noise unbearable;
> not sure if that's the case now.
>
> So the short of it is, to usefully run a perfromance tinderbox on your
> development box you'd need a separate account and separate runlevel for it (in
> Linux speak).  And do absolutely nothing else with the machine while it's
> running, of course.

So I wonder why that's the case for us, when word on the street is
that the WebKit guys have extremely consistent (1-2ms) performance
results for their work.

http://webkit.opendarwin.org/projects/performance/index.html has their
tests, though not a lot of information on their result sets or trends.

Mike
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Dave Liebreich
In reply to this post by Ben Goodger
[crossposting to mozilla.dev.quality]

Ben Goodger wrote:
> I understand no one ever has time to implement these things, but I
> thought this was a good idea...

Please keep mozilla.dev.quality in the loop.  This will help achieve
critical mass of conversation about automated testing in that group.

I'm working on a simple structure for tests written in javascript, that
should be portable between various test harnesses.

> Make all of the tinderbox server + client scripts installable through a convenient rpm or msi installer that any developer can easily install on their development box. They can then run their code through the same tests the servers do.

I would like to see tests that can be run on a dev box - that way they
can be run against a private build before the changes are checked in to
the repository.  This is why I added mozilla/tools/test-harness to the
list of directories checked out for most projects.

Please let me know if you can help work on this by posting in
mozilla.dev.quality or sending mail to dev-quality.

Thanks
--
Dave Liebreich
Test Architect, Mozilla Corporation
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Philip Chee
In reply to this post by Boris Zbarsky
On Wed, 26 Apr 2006 13:04:37 -0500, Boris Zbarsky wrote:

> So the short of it is, to usefully run a perfromance tinderbox on your
> development box you'd need a separate account and separate runlevel for it (in
> Linux speak).  And do absolutely nothing else with the machine while it's
> running, of course.

Will running a tinderbox in a VM (VMWare/XEN/etc) throw off the
statistics? I'm thinking of a barebones linux+tinderbox in XEN.

Phil
--
Philip Chee <[hidden email]>, <[hidden email]>
http://flashblock.mozdev.org/ http://xsidebar.mozdev.org
Guard us from the she-wolf and the wolf, and guard us from the thief,
oh Night, and so be good for us to pass.
[ ]I wouldn't touch the Metric System with a 3.048m pole!
* TagZilla 0.059
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Andrew Schultz-2
Philip Chee wrote:
> On Wed, 26 Apr 2006 13:04:37 -0500, Boris Zbarsky wrote:
>
>> So the short of it is, to usefully run a perfromance tinderbox on your
>> development box you'd need a separate account and separate runlevel for it (in
>> Linux speak).  And do absolutely nothing else with the machine while it's
>> running, of course.
>
> Will running a tinderbox in a VM (VMWare/XEN/etc) throw off the
> statistics? I'm thinking of a barebones linux+tinderbox in XEN.

Performance statistics, yes.  You could still collect leak statistics.
It's quite difficult to get a sufficiently stable environment to collect
solid performance timings (see tinderbox.mozilla.org for evidence of
this :)).

--
Andrew Schultz
[hidden email]
http://www.sens.buffalo.edu/~ajs42/
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Boris Zbarsky
In reply to this post by Ben Goodger
[cross-posting to all three groups where we've been having this conversation,
but maybe we should pick one and set followups to it?]

Mike Shaver wrote:
> To be honest, it's not clear to me that our current performance tests
> measure what we care about most, or perhaps at all (c.f. our
> ground-breaking results in the field of "effects of pathnames on
> ancient Linux filesystem performance").  Figuring out what we want to
> measure  and then looking at how to test those things would seem like
> a wise course, given that we've had basically the same test model for
> the last 5 years.  For me, they would likely focus more on percieved
> responsiveness than on the raw-throughput numbers that we seem to be
> gathering with Ts and Tp today.

I think we should probably measure both, esp. for something like Tdhtml where
the people who really care are the DHTML script writers and who _are_ generally
complaining about raw throughput numbers in addition to perceived performance.
Measuring raw throughput is also more or less the idea of things like Trender
and Tgfx where we're trying to measure the performance of a subsystem.  But yes,
generally for performance it's more useful to include whole-system tests, and if
we can somehow include a user or a reasonable approximation of one in our system
that would be great.

I note that there's not been much pageload benchmark chest-banging from major
browser vendors after Safari initially shipped.  Not sure whether that means
that we've all outgrown fudgeable benchmark numbers or something else.  ;)

-Boris
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Boris Zbarsky
In reply to this post by Boris Zbarsky
Mike Shaver wrote:
> So I wonder why that's the case for us, when word on the street is
> that the WebKit guys have extremely consistent (1-2ms) performance
> results for their work.

Good question!

Looking at the page you cite, the 24Fun test involves things like opening new
windows; I've never been able to get that more than about within 25% consistent
on Linux, no matter what the WM or WM settings...  In fact, I'm lucky if I get
to within one second accuracy on some of the 24Fun tests in two consecutive
runs.  I've tried to set up i-bench in the past and failed miserably; it being
unmaintained (to the point where the opendarwin link to it is broken doesn't
help).  So I can't speak to how reproducible its numbers are...

That said, if we assume that my system is not loaded and that variation in the
tests is just random, then the right solution is probably a lot of test runs,
with averaging.  We can't do that on tinderbox because it would slow down cycles
too much, as far as I can tell.  Maybe if we ran once-daily perf tests on
dedicated machines we could do more of that sort of thing.  In fact, the idea of
running perf tests on dedicated machines has come up before -- have a fast
tinderbox that ships its nightlies to a slower test machine (where we get better
granularity on performance).  If we can do something like that, we really
should, imo.

Even without major tinderbox setup changes, it might be possible to increase our
precision in some simple ways.  Here's some typical data from our pageload test
(argo box on the firefox tinderbox, three different tinderbox cycles for the
same page, raw data, not much happening with the tree):

0 home.netscape.com  382    306    307    291    306
0 home.netscape.com  370    300    279    286    286
0 home.netscape.com  378    309    291    287    311

Note that the first run is slower than the others; that makes sense given clean
profiles and cache.  Also note that the remaining numbers have about 10%
variation in them.

The way we currently deal with that problem is by taking the median of the 5
times for each page and averaging them to get the total Tp number.  In this
case, the three medians are 306, 286, and 309.  Those three numbers have an
average of 300.3 and a standard deviation of 12.5 or so.

Now say we took the average of all but the first run in all three cases, instead
of the median.  That would give us 302.5, 287.8, 299.5, which have an average of
296.6 and a standard deviation of 8.  Somewhat better....

It might be worth looking at more pageload data, but I suspect that in general
taking all but the first run and averaging them will give us more stable numbers
than just taking the median.  This is especially true if we increase the number
of runs from the current 5, since averaging over more runs should give us better
random-noise smoothing than taking the median of more runs, I think.

-Boris
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Mike Shaver
In reply to this post by Andrew Schultz-2
On 4/26/06, Andrew Schultz <[hidden email]> wrote:
> Philip Chee wrote:
> > Will running a tinderbox in a VM (VMWare/XEN/etc) throw off the
> > statistics? I'm thinking of a barebones linux+tinderbox in XEN.
>
> Performance statistics, yes.  You could still collect leak statistics.
> It's quite difficult to get a sufficiently stable environment to collect
> solid performance timings (see tinderbox.mozilla.org for evidence of
> this :)).

VMWare ESX's resource limit settings are supposed to give very
reliable performance characteristics, in terms of CPU, RAM, and I/O,
actually.  I wouldn't be surprised to discover that we could get
steadier numbers out of those than out of the current setup, if we
held a single virtual machine fixed at say 70%.

http://www.vmware.com/support/esx25/doc/admin/esx25admin_res.html#1040218

Mike
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Mike Shaver
In reply to this post by Boris Zbarsky
On 4/26/06, Boris Zbarsky <[hidden email]> wrote:
> [cross-posting to all three groups where we've been having this conversation,
> but maybe we should pick one and set followups to it?]

I think we should pick .quality, but I don't think I'm able to set
followups usefully through the email interface.  I'll follow your lead
when you reply, though!

(I gotta tithe to the bug about the news<->mail gateway really not
handling crossposts.)

> I think we should probably measure both, esp. for something like Tdhtml where
> the people who really care are the DHTML script writers and who _are_ generally
> complaining about raw throughput numbers in addition to perceived performance.

I don't think I agree with that, but since we're spitballing, let's
say they care about both equally.  (For example, I think they would
prefer 5 frames of 11ms each to 4 frames of 5ms and one of 30ms.)

> Measuring raw throughput is also more or less the idea of things like Trender
> and Tgfx where we're trying to measure the performance of a subsystem.

For performance of subsystems, I still think that we care a lot about
responsiveness.  We're just a lot better at measuring throughput. :)

For throughput measurement that's less randomized, I wonder if it'd be
interesting to count cycles or something, in addition to wall-clock
time.

>  But yes,
> generally for performance it's more useful to include whole-system tests, and if
> we can somehow include a user or a reasonable approximation of one in our system
> that would be great.

Yeah, session-replay and event synthesis would be great.

Mike
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Mike Shaver
In reply to this post by Boris Zbarsky
On 4/26/06, Boris Zbarsky <[hidden email]> wrote:
> dedicated machines we could do more of that sort of thing.  In fact, the idea of
> running perf tests on dedicated machines has come up before -- have a fast
> tinderbox that ships its nightlies to a slower test machine (where we get better
> granularity on performance).  If we can do something like that, we really
> should, imo.

People talked at one point about underclocking machines to do this,
but it seems to me that we might meet success with something like the
VMWare ESX resource limits (which are controllable through an API, so
you could build at full-available-speed, and then lock down the guest
for a more sensitive performance-testing environment).

Mike
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Boris Zbarsky
In reply to this post by Boris Zbarsky
Mike Shaver wrote:
> People talked at one point about underclocking machines to do this,
> but it seems to me that we might meet success with something like the
> VMWare ESX resource limits

Yeah, that would work.

-Boris
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Boris Zbarsky
In reply to this post by Boris Zbarsky
[followups to .quality]

Mike Shaver wrote:
> I don't think I agree with that, but since we're spitballing, let's
> say they care about both equally.  (For example, I think they would
> prefer 5 frames of 11ms each to 4 frames of 5ms and one of 30ms.)

I'll buy that, sure.

> For throughput measurement that's less randomized, I wonder if it'd be
> interesting to count cycles or something, in addition to wall-clock
> time.

graydon was mentioning something about running under valgrind to do just that,
at some point.  Note that we use wall-clock times in our app in perf-sensitive
ways (e.g. interruption of the parser happens via a poll-and-timeout setup), so
we should watch out for that if our cycle-counting mechanism has the sort of lag
valgrind does.

-Boris
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Axel Hecht-2
In reply to this post by Andrew Schultz-2
Mike Shaver wrote:

> On 4/26/06, Andrew Schultz <[hidden email]> wrote:
>> Philip Chee wrote:
>>> Will running a tinderbox in a VM (VMWare/XEN/etc) throw off the
>>> statistics? I'm thinking of a barebones linux+tinderbox in XEN.
>> Performance statistics, yes.  You could still collect leak statistics.
>> It's quite difficult to get a sufficiently stable environment to collect
>> solid performance timings (see tinderbox.mozilla.org for evidence of
>> this :)).
>
> VMWare ESX's resource limit settings are supposed to give very
> reliable performance characteristics, in terms of CPU, RAM, and I/O,
> actually.  I wouldn't be surprised to discover that we could get
> steadier numbers out of those than out of the current setup, if we
> held a single virtual machine fixed at say 70%.
>
> http://www.vmware.com/support/esx25/doc/admin/esx25admin_res.html#1040218
>
> Mike

How does running perf tests in a VM impact things like cairo being able
to talk to GL accelerated drivers and display in general?

CPU throtteling sounds easy, but I'm unsure on the impact on IO in
general, and display in particular.

Axel
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Axel Hecht-2
In reply to this post by Ben Goodger
Ben Goodger wrote:

> I understand no one ever has time to implement these things, but I
> thought this was a good idea...
>
> We often have large changes that we want comprehensive testing run on
> for regressions in Ts, Tp, Txul, leaks, etc. It's usually much easier to
> crash land code and see how the tinderbox responds than to test for
> these regressions beforehand. This is bad, because it costs other
> developers productivity with tree closures etc.
>
> Tinderboxes should be configurable to an extent from the web, at least
> by anyone who has a cvs account.
>
> There should be a pool of tinderboxes that are available for developer
> testing. These machines are set up to run a variety of different tests
> on different platforms. There should be many of these.
>
> Developers would be able to request a tinderbox and then configure it
> through a web app - specifying what branch to build, .mozconfig etc.
> They then get the machine for up to a certain amount of time (since
> sometimes it can take a few cycles for numbers to stabilize, and you
> probably want to get pre-and-post change numbers too)... they can
> release the machine before that but they should not be able to hog it
> for days without a special reason.
>
> Alternatively...
>
> Make all of the tinderbox server + client scripts installable through a
> convenient rpm or msi installer that any developer can easily install on
> their development box. They can then run their code through the same
> tests the servers do.

As most of our tinderboxens are in VMs now, with configuration in CVS,
it should be possible to clone compile environments on demand. At least
that's what I assume preed and rhelmer are doing this for.

Tinderboxens should be more dynamics (or build bots), and possibly put
the generated builds on the upload area of people, for example.

Of course, those make crappy test environments.

The test enviroment is much more like a well shielded batch-operating
scenario. Which may want to pull builds from peoples upload or stage.

I envision the build farm to be more of a puddle of mud. The test
enviroment should be more like
http://www.mpq.mpg.de/atomlaser/html/optischer_tisch.html

(half a ton of table, on air suspension)

Axel
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Robert Kaiser
In reply to this post by Boris Zbarsky
Boris Zbarsky schrieb:
> It might be worth looking at more pageload data, but I suspect that in
> general taking all but the first run and averaging them will give us
> more stable numbers than just taking the median.  This is especially
> true if we increase the number of runs from the current 5, since
> averaging over more runs should give us better random-noise smoothing
> than taking the median of more runs, I think.

The other problem is that first-run data is valuable, as the vast number
of page loads by users are first-run loads, I guess...

Robert Kaiser
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Boris Zbarsky
Robert Kaiser wrote:
> The other problem is that first-run data is valuable, as the vast number
> of page loads by users are first-run loads, I guess...

True.  Perhaps we should report both numbers from Tp.  That would make first-run
fluctuate a lot, unless we did a bunch of tests with clearing the profile in
between (which is another thought, of course).

-Boris
_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds
Reply | Threaded
Open this post in threaded view
|

Re: Idea for Tinderbox Management & Performance Testing

Robert Helmer-5
In reply to this post by Axel Hecht-2
Since I've been invoked in this thread, I figured I'd give an update on
where things are :) I'd really love for the build system to be at a
state where we can act quickly on these kinds of ideas.

We're pushing hard to get the rest of the tinderboxes over to VMs and
the configs under revision control over the next few weeks (was going
to happen this week, but 1503 comes first). preed has of course put a
ton of work into this while getting releases out the door at the same
time, it just needs that final focused push.

However, we have most of the important tinderboxes in VMs now, the real
blockers to being able to clone an environment in the ways discussed
are 1) lack of a good reference build platform (OS/toolchain versions,
etc), 2) getting the configs into public CVS, and 3) scripting out
tinderbox installs to the point where they can be installed on the
reference platform with no trouble (maybe this means having packages,
this has been brought up recently).

Once these blockers are gone (they are being working on now), creating
new tinderboxes becomes a lot less painful. We are also thinking about
how to manage these instances over time as they multiply and drift away
from the base config.

Most of the tinderbox-related work we've been doing lately (the ~1
month since I've been actively working on it anyway) is around getting
the current environment under control. Since we're doing releases from
the same environment, and it's not in a 100% consistent state, it's
risky to make changes, and babysitting builds through the release
process takes up a lot of time that could be used for more productive
work. The sooner we break out of this cycle the better.

On tinderbox feature development: there has been a lot of discussion
lately over whether it's worth continuing tinderbox development or
moving to a different system. Buildbot has a lot going for it, although
to address the needs in this thread (and to have parity with tinderbox)
it'd need some work (which hopefully would be adoptable upstream).
Going either way is going to be difficult and disruptive; until the
direction is decided I think there is going to be a lot of resistance
to adding the features we need to make the build system more useful for
everyone.

--
Rob Helmer

_______________________________________________
dev-builds mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-builds