Search.pm SQL queries

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Search.pm SQL queries

lemma (Bugzilla)
Hi,

I've recently been scouting around Search.pm, trying to do something about
https://bugzilla.mozilla.org/show_bug.cgi?id=476722 (no promises something
will come out).

I stumbled upon some queries where I'm not entirely sure why they are
implemented like that, eg. _commenter:
---------------
    $$f = "login_name";
    $$ff = "profiles.login_name";
    $$funcsbykey{",$$t"}($self, %func_args);
    push(@$supptables, "LEFT JOIN longdescs AS $table " .
                       "ON $table.bug_id = bugs.bug_id $extra " .
                       "AND $table.who IN" .
                       "(SELECT userid FROM profiles WHERE $$term)"
                       );
    $$term = "$table.who IS NOT NULL";
---------------

I don't really understand why there's a LEFT JOIN followed by a "who IS NOT
NULL" which seems to be the same as an INNER JOIN. Not being a database guru
I'm also not sure if the subquery is supposed to perform faster than another
join.

I think this could be rewritten as:
INNER JOIN longdescs AS ld ON ld.bug_id = bugs.bug_id INNER JOIN profiles AS
pf ON pf.userid = ld.who AND $$term

or even as:
INNER JOIN longdescs AS ld ON ld.bug_id = bugs.bug_id INNER JOIN profiles AS
pf ON pf.userid = ld.who WHERE $$term
(which would provide the best separation and be easiest to work with).

It seems that the second and third query perform better (on MySQL 5.1).

As I said I'm not an expert on databases so I'd like to get some ideas from
people who know more than I do.

Regards,
Michael
-
To view or change your list settings, click here:
<http://bugzilla.org/cgi-bin/mj_wwwusr?user=lists@...>
Reply | Threaded
Open this post in threaded view
|

Re: Search.pm SQL queries

Joel Peshkin

Michael,

  Boolean charts surround entire WHERE clauses with operators including
NOT( ).  Add the debug option to some queries where you look for a bug
assigned to someone who has never commented and look at the resulting SQL.

  That said, the logic goes back to MySQL 3 days.   There may be
opportunities to optimize more with subqueries.

-Joel Peshkin


Michael Leupold wrote:

> Hi,
>
> I've recently been scouting around Search.pm, trying to do something about
> https://bugzilla.mozilla.org/show_bug.cgi?id=476722 (no promises something
> will come out).
>
> I stumbled upon some queries where I'm not entirely sure why they are
> implemented like that, eg. _commenter:
> ---------------
>     $$f = "login_name";
>     $$ff = "profiles.login_name";
>     $$funcsbykey{",$$t"}($self, %func_args);
>     push(@$supptables, "LEFT JOIN longdescs AS $table " .
>                        "ON $table.bug_id = bugs.bug_id $extra " .
>                        "AND $table.who IN" .
>                        "(SELECT userid FROM profiles WHERE $$term)"
>                        );
>     $$term = "$table.who IS NOT NULL";
> ---------------
>
> I don't really understand why there's a LEFT JOIN followed by a "who IS NOT
> NULL" which seems to be the same as an INNER JOIN. Not being a database guru
> I'm also not sure if the subquery is supposed to perform faster than another
> join.
>
> I think this could be rewritten as:
> INNER JOIN longdescs AS ld ON ld.bug_id = bugs.bug_id INNER JOIN profiles AS
> pf ON pf.userid = ld.who AND $$term
>
> or even as:
> INNER JOIN longdescs AS ld ON ld.bug_id = bugs.bug_id INNER JOIN profiles AS
> pf ON pf.userid = ld.who WHERE $$term
> (which would provide the best separation and be easiest to work with).
>
> It seems that the second and third query perform better (on MySQL 5.1).
>
> As I said I'm not an expert on databases so I'd like to get some ideas from
> people who know more than I do.
>
> Regards,
> Michael
> -
> To view or change your list settings, click here:
> <http://bugzilla.org/cgi-bin/mj_wwwusr?user=bugreport@...>
>  

-
To view or change your list settings, click here:
<http://bugzilla.org/cgi-bin/mj_wwwusr?user=lists@...>
Reply | Threaded
Open this post in threaded view
|

Re: Search.pm SQL queries

Frédéric Buclin
Le 25. 02. 09 14:28, Joel Peshkin a écrit :
> That said, the logic goes back to MySQL 3 days. There may be
> opportunities to optimize more with subqueries.

Joel, we would love if you could contribute again to optimize Search.pm. ;)

LpSolit
-
To view or change your list settings, click here:
<http://bugzilla.org/cgi-bin/mj_wwwusr?user=lists@...>
Reply | Threaded
Open this post in threaded view
|

Re: Search.pm SQL queries

lemma (Bugzilla)
In reply to this post by Joel Peshkin
On Wednesday 25 February 2009 14:28:49 Joel Peshkin wrote:

> > I've recently been scouting around Search.pm, trying to do something
> > about https://bugzilla.mozilla.org/show_bug.cgi?id=476722 (no promises
> > something will come out).
> >
> > I stumbled upon some queries where I'm not entirely sure why they are
> > implemented like that, eg. _commenter:
> > ---------------
> >     $$f = "login_name";
> >     $$ff = "profiles.login_name";
> >     $$funcsbykey{",$$t"}($self, %func_args);
> >     push(@$supptables, "LEFT JOIN longdescs AS $table " .
> >                        "ON $table.bug_id = bugs.bug_id $extra " .
> >                        "AND $table.who IN" .
> >                        "(SELECT userid FROM profiles WHERE $$term)"
> >                        );
> >     $$term = "$table.who IS NOT NULL";
> > ---------------
> >
> > I don't really understand why there's a LEFT JOIN followed by a "who IS
> > NOT NULL" which seems to be the same as an INNER JOIN. Not being a
> > database guru I'm also not sure if the subquery is supposed to perform
> > faster than another join.
>   Boolean charts surround entire WHERE clauses with operators including
> NOT( ).  Add the debug option to some queries where you look for a bug
> assigned to someone who has never commented and look at the resulting SQL.
>   That said, the logic goes back to MySQL 3 days.   There may be
> opportunities to optimize more with subqueries.
Thanks for the pointer. I really didn't consider negation at all.

That said it seems the subquery is pretty slow for me. The following seems to
run considerably faster while providing the same functionality:
LEFT JOIN (longdescs
INNER JOIN profiles
ON longdescs.who = profiles.userid
AND $$term)
ON longdescs.bug_id = bugs.bug_id
WHERE longdescs.who IS NOT NULL

I only tried this on MySQL 5.1 as I don't have a postgres bugzilla database
around currently.
Unfortunately like this it's still not possible to move the actual query
condition $term into a WHERE clause which is what I tried to achieve in the
first place.

Regards,
Michael


signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Search.pm SQL queries

Max Kanat-Alexander
On Wed, 25 Feb 2009 23:00:05 +0100 Michael Leupold <[hidden email]>
wrote:
> Unfortunately like this it's still not possible to move the actual
> query condition $term into a WHERE clause which is what I tried to
> achieve in the first place.

        Yeah, I wouldn't worry too much about optimization at this
point. That would be a separate bug from the one you linked to, anyhow.
If we want to move things into the WHERE clause in a way that works
with NOT(), we'd have to re-work every function inside Search.pm, I
think. It might be possible, but it's something to be careful of and
that requires its own bug.

        -Max
--
http://www.everythingsolved.com/
Competent, Friendly Bugzilla and Perl Services. Everything Else, too.

signature.asc (204 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Search.pm SQL queries

David Marshall-5
In reply to this post by lemma (Bugzilla)
There are several areas for improvement with Search.pm.  I have fixed some
of them for Yahoo!, but because we are so far behind HEAD, I don't have
patches to submit yet.  (I am working semi-furiously on catching us up and
will expose whatever code I can as soon as possible.)

This is a lengthy discussion aimed only at Search.pm geeks.  You have been
warned!

Data: We use MySQL 4.1 and have about 2.5M rows in table bugs, about 11M
rows in table longdescs.

I anecdotally suspect that for at least MySQL 4.1, when the number of
possible query plans exceeds some number (around 24), MySQL gives up and
just goes with some arbitrary join order, usually the order as written in
the query.

This means that if Search.pm produces a query such as

SELECT (many columns)
FROM (many joined tables)
WHERE (nasty predicate)

the executed query plan will probably suck.

For some queries we now do

SELECT bugs.bug_id
FROM (minimum joined tables)
WHERE (nasty predicate) LIMIT 2X

followed by

SELECT (several columns)
FROM (many joined tables)
WHERE bugs.bug_id IN (list) LIMIT X

We don't use a subquery because if the number of potential bugs is very
large, it may cost a lot more to do it this way.

Another problem is how confusing the predicate can be - there can be lots of
AND/OR/NOT with various paretheses involved.  For the biggest improvement we
use, that's bad.  We convert the charts to a tree of AND/OR nodes,
transforming subtrees for negation as needed, combining siblings where
possible, and so on.  What results is a least complicated predicate,
including changing (for instance) "NOT foo IN (1,2,3)" to "foo NOT IN
(1,2,3)".  That may be a poor example, because the database probably sees
those as equivalents.

Finally, what has made the biggest difference is based on MySQL 4.1's
inability to use multiple indexes at once.  Our prime example was to search
for bugs where user 123 is either the owner or the QA contact:

SELECT (columns)
FROM (tables)
WHERE (blah) AND (bugs.assignee = 123 OR bugs.qacontact = 123)

Disaster!  Such queries almost always uses bug_status as an index, which
doesn't work very well for us.

We rewrite this as (essentially)

SELECT (columns)
FROM (tables)
WHERE (blah) AND bugs.assignee = 'foo'
UNION
SELECT (columns)
FROM (tables)
WHERE (blah) AND bugs.qacontact = 'foo'

Actually, we do the first query (and assuming it returns some bugs) do the
next query as

SELECT bugs.bug_id
FROM (tables)
WHERE (blah) AND bugs.qacontact = 'foo' AND bugs.bug_id NOT IN (list)

so that if (blah) is expensive we don't have to pay it for rows we already
know are in the result set.  Of course, we lock all the tables for these
queries to get a consistent view of data.

Summary and future direction:

Problems with Search.pm have less to do with the transformation of some
chart tuple to SQL and more to do with the philosophy on constructing one
big query and hoping for the best.  Making this more complicated is that for
some searches, it is better to just leave things alone and let the database
figure it out!

I used to say that Search.pm should only return a list of bug IDs that match
the search criteria, and then buglist.cgi should get the data needed for
display.  I'm wavering on that now, because as long as Search.pm has gone to
the trouble of figuring everything out, why not just get the data?

The future direction of Search.pm for us is to improve the tree mechanism to
produce faster queries.  This is important for us because our database is
growing very quickly.  This will probably involve locking all the tables
involved and doing possibly multiple queries to determine the list of bugs
followed by one grand query to get all the data for those bugs, possibly
including stuff in memcached or Sphinx, for instance.



On 2/25/09 2:00 PM, "Michael Leupold" <[hidden email]> wrote:

> On Wednesday 25 February 2009 14:28:49 Joel Peshkin wrote:
>>> I've recently been scouting around Search.pm, trying to do something
>>> about https://bugzilla.mozilla.org/show_bug.cgi?id=476722 (no promises
>>> something will come out).
>>>
>>> I stumbled upon some queries where I'm not entirely sure why they are
>>> implemented like that, eg. _commenter:
>>> ---------------
>>>     $$f = "login_name";
>>>     $$ff = "profiles.login_name";
>>>     $$funcsbykey{",$$t"}($self, %func_args);
>>>     push(@$supptables, "LEFT JOIN longdescs AS $table " .
>>>                        "ON $table.bug_id = bugs.bug_id $extra " .
>>>                        "AND $table.who IN" .
>>>                        "(SELECT userid FROM profiles WHERE $$term)"
>>>                        );
>>>     $$term = "$table.who IS NOT NULL";
>>> ---------------
>>>
>>> I don't really understand why there's a LEFT JOIN followed by a "who IS
>>> NOT NULL" which seems to be the same as an INNER JOIN. Not being a
>>> database guru I'm also not sure if the subquery is supposed to perform
>>> faster than another join.
>>   Boolean charts surround entire WHERE clauses with operators including
>> NOT( ).  Add the debug option to some queries where you look for a bug
>> assigned to someone who has never commented and look at the resulting SQL.
>>   That said, the logic goes back to MySQL 3 days.   There may be
>> opportunities to optimize more with subqueries.
>
> Thanks for the pointer. I really didn't consider negation at all.
>
> That said it seems the subquery is pretty slow for me. The following seems to
> run considerably faster while providing the same functionality:
> LEFT JOIN (longdescs
> INNER JOIN profiles
> ON longdescs.who = profiles.userid
> AND $$term)
> ON longdescs.bug_id = bugs.bug_id
> WHERE longdescs.who IS NOT NULL
>
> I only tried this on MySQL 5.1 as I don't have a postgres bugzilla database
> around currently.
> Unfortunately like this it's still not possible to move the actual query
> condition $term into a WHERE clause which is what I tried to achieve in the
> first place.
>
> Regards,
> Michael
>

-
To view or change your list settings, click here:
<http://bugzilla.org/cgi-bin/mj_wwwusr?user=lists@...>
Reply | Threaded
Open this post in threaded view
|

Re: Search.pm SQL queries

Max Kanat-Alexander
On Wed, 25 Feb 2009 14:40:49 -0800 David Marshall
<[hidden email]> wrote:
> (I am working semi-furiously on
> catching us up and will expose whatever code I can as soon as
> possible.)

        Cool. We refactored Search.pm, so if you gave us a patch right
now it wouldn't apply, but the actual internal logic of all the search
operators hasn't really changed much, so it shouldn't be too hard to
port.

> I anecdotally suspect that for at least MySQL 4.1, when the number of
> possible query plans exceeds some number (around 24), MySQL gives up
> and just goes with some arbitrary join order, usually the order as
> written in the query.

        BTW, this makes me wonder--have you tried an experimental
import into PostgreSQL to see if it gives you better query plans/times?
I know that Yahoo has a big investment in MySQL and so wouldn't move
off it, but I'd be interested to know if there really is a significant
difference. I saw a fair difference in some of my testing a few years
back, but I didn't have a DB like you guys do.

> WHERE bugs.bug_id IN (list)

        Isn't that IN slow, though? I thought MySQL stopped using
indexes on IN clauses over a certain size, or something.

> Finally, what has made the biggest difference is based on MySQL 4.1's
> inability to use multiple indexes at once.

        Supposedly fixed in 5.1 or 5.0, though. I'm not sure I've seen
any query plans where it actually does that though.

> Problems with Search.pm have less to do with the transformation of
> some chart tuple to SQL and more to do with the philosophy on
> constructing one big query and hoping for the best.  Making this more
> complicated is that for some searches, it is better to just leave
> things alone and let the database figure it out!

        Yeah, although I'd like to think in an ideal world that the
database should always be the best at figuring these things out...or at
least that query optimizers will improve in the future.

> I used to say that Search.pm should only return a list of bug IDs
> that match the search criteria, and then buglist.cgi should get the
> data needed for display.  I'm wavering on that now, because as long
> as Search.pm has gone to the trouble of figuring everything out, why
> not just get the data?

        Yeah, I think eventually Search.pm will just return Bug
objects.

> The future direction of Search.pm for us is to improve the tree
> mechanism to produce faster queries.

        And perhaps to upgrade to MySQL 5.x and see if that improves
things?

>  This will probably involve locking all the tables involved

        Probably unnecessary, since 3.2 uses InnoDB? Or you could do
READ UNCOMMITTED and see if that makes a difference?

> and doing possibly multiple
> queries to determine the list of bugs followed by one grand query to
> get all the data for those bugs,

        Yeah, that might be reasonable. I'd want to see what the actual
pathological queries are, though, if we're going to adjust things to
work better upstream. I'm sure you guys have a lot of data that would
be valuable to us in terms of optimization.

        I'd be opposed to any optimization before we finish refactoring
Search.pm architecturally, though, so that should happen first.

        -Max
--
http://www.everythingsolved.com/
Competent, Friendly Bugzilla and Perl Services. Everything Else, too.
-
To view or change your list settings, click here:
<http://bugzilla.org/cgi-bin/mj_wwwusr?user=lists@...>
Reply | Threaded
Open this post in threaded view
|

Re: Search.pm SQL queries

lemma (Bugzilla)
In reply to this post by Max Kanat-Alexander
On Wednesday 25 February 2009 23:06:32 Max Kanat-Alexander wrote:

> On Wed, 25 Feb 2009 23:00:05 +0100 Michael Leupold <[hidden email]>
> wrote:
> > Unfortunately like this it's still not possible to move the actual
> > query condition $term into a WHERE clause which is what I tried to
> > achieve in the first place.
> Yeah, I wouldn't worry too much about optimization at this
> point. That would be a separate bug from the one you linked to, anyhow.
> If we want to move things into the WHERE clause in a way that works
> with NOT(), we'd have to re-work every function inside Search.pm, I
> think. It might be possible, but it's something to be careful of and
> that requires its own bug.
Actually I was just considering what I put into c#1 of that bug (476722).
Separating the join and the search term would allow for more generic joins.
After looking through all of the search functions this seems fairly
unrealistic though.

Regards,
Michael

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Search.pm SQL queries

Max Kanat-Alexander
On Thu, 26 Feb 2009 10:20:01 +0100 Michael Leupold <[hidden email]>
wrote:
> Actually I was just considering what I put into c#1 of that bug
> (476722). Separating the join and the search term would allow for
> more generic joins. After looking through all of the search functions
> this seems fairly unrealistic though.

        Ahh. Yeah, for now let's not do that. We just want a new way to
point to the functions we already have. Then once that's done we can
consider additional refactoring.

        -Max
--
http://www.everythingsolved.com/
Competent, Friendly Bugzilla and Perl Services. Everything Else, too.

signature.asc (204 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Search.pm SQL queries

David Marshall-5
In reply to this post by Max Kanat-Alexander



On 2/25/09 6:56 PM, "Max Kanat-Alexander" <[hidden email]> wrote:

> On Wed, 25 Feb 2009 14:40:49 -0800 David Marshall
> <[hidden email]> wrote:
>> (I am working semi-furiously on
>> catching us up and will expose whatever code I can as soon as
>> possible.)
>
> Cool. We refactored Search.pm, so if you gave us a patch right
> now it wouldn't apply, but the actual internal logic of all the search
> operators hasn't really changed much, so it shouldn't be too hard to
> port.

The greater likelihood for this summer is that we'll be able to throw some
code out as inspiration for others to make patches, but barring that, I'll
eventually produce patches myself.  OTOH, every time I have thought, "Cool,
I can starting pushing stuff upstream in a few weeks," some new important
thing to do shows up and upstream patches move farther back on my calendar.

>
>> I anecdotally suspect that for at least MySQL 4.1, when the number of
>> possible query plans exceeds some number (around 24), MySQL gives up
>> and just goes with some arbitrary join order, usually the order as
>> written in the query.
>
> BTW, this makes me wonder--have you tried an experimental
> import into PostgreSQL to see if it gives you better query plans/times?
> I know that Yahoo has a big investment in MySQL and so wouldn't move
> off it, but I'd be interested to know if there really is a significant
> difference. I saw a fair difference in some of my testing a few years
> back, but I didn't have a DB like you guys do.

Too much secret MySQL sauce for this to be a realistic possibility.  Oracle
would be more likely, but that's not much more realistic.

>
>> WHERE bugs.bug_id IN (list)
>
> Isn't that IN slow, though? I thought MySQL stopped using
> indexes on IN clauses over a certain size, or something.

It could be, but it's probably not going to be any slower than the
alternative.  I admit that it's probably not the best answer, but it
addresses the problem we have now with the technology we have.

>
>> Finally, what has made the biggest difference is based on MySQL 4.1's
>> inability to use multiple indexes at once.
>
> Supposedly fixed in 5.1 or 5.0, though. I'm not sure I've seen
> any query plans where it actually does that though.
>
>> Problems with Search.pm have less to do with the transformation of
>> some chart tuple to SQL and more to do with the philosophy on
>> constructing one big query and hoping for the best.  Making this more
>> complicated is that for some searches, it is better to just leave
>> things alone and let the database figure it out!
>
> Yeah, although I'd like to think in an ideal world that the
> database should always be the best at figuring these things out...or at
> least that query optimizers will improve in the future.

Agreed 100%.  The ultimate answer isn't fiddling with the SQL statement -
it's having a query optimizer that can do the same thing automagically.  I
have all kinds of clever ideas (for instance, a Bayesian SQL rewriter that
could choose the best SQL for your particular search), but some of those are
best reserved for if I ever forget why I don't want to get a PhD.

>
>> I used to say that Search.pm should only return a list of bug IDs
>> that match the search criteria, and then buglist.cgi should get the
>> data needed for display.  I'm wavering on that now, because as long
>> as Search.pm has gone to the trouble of figuring everything out, why
>> not just get the data?
>
> Yeah, I think eventually Search.pm will just return Bug
> objects.

Generally speaking, this is a good idea.  That would add a little bit of
cost, but I think it is worth it.

>
>> The future direction of Search.pm for us is to improve the tree
>> mechanism to produce faster queries.
>
> And perhaps to upgrade to MySQL 5.x and see if that improves
> things?

We're moving some of our shadow databases to MySQL 5.1 very soon to see what
happens.  My only concern is whether "improving" queries for 4.1 will
actually be counterproductive in 5.1.  I see a lot of slow-query log
browsing in my future.

>
>>  This will probably involve locking all the tables involved
>
> Probably unnecessary, since 3.2 uses InnoDB? Or you could do
> READ UNCOMMITTED and see if that makes a difference?
>
>> and doing possibly multiple
>> queries to determine the list of bugs followed by one grand query to
>> get all the data for those bugs,
>
> Yeah, that might be reasonable. I'd want to see what the actual
> pathological queries are, though, if we're going to adjust things to
> work better upstream. I'm sure you guys have a lot of data that would
> be valuable to us in terms of optimization.
>
> I'd be opposed to any optimization before we finish refactoring
> Search.pm architecturally, though, so that should happen first.

I absolutely agree but wanted to provide some information about what we've
learned about Search.pm for folks who are thinking about it now.  Effort is
better spent, in my opinion, on the refactoring to make it more easily
understood and maintained than on having it emit better SQL, at least for
those without large databases.

>
> -Max

-
To view or change your list settings, click here:
<http://bugzilla.org/cgi-bin/mj_wwwusr?user=lists@...>
Reply | Threaded
Open this post in threaded view
|

Re: Search.pm SQL queries

Max Kanat-Alexander
On Thu, 26 Feb 2009 12:24:18 -0800 David Marshall
<[hidden email]> wrote:
> The greater likelihood for this summer is that we'll be able to throw
> some code out as inspiration for others to make patches, but barring
> that, I'll eventually produce patches myself.

        Cool. Throwing out some code is at least better than
nothing. :-) Historically it doesn't have much effect, though.

> Too much secret MySQL sauce for this to be a realistic possibility.

        You mean in the queries you're generating? I was mostly just
curious if the query planner does better.

> We're moving some of our shadow databases to MySQL 5.1 very soon to
> see what happens.  My only concern is whether "improving" queries for
> 4.1 will actually be counterproductive in 5.1.  I see a lot of
> slow-query log browsing in my future.

        Cool! Yeah, that was my concern too--one of the reasons I was
curious to know what happens when you guys go to 5.x.

> I absolutely agree but wanted to provide some information about what
> we've learned about Search.pm for folks who are thinking about it
> now.  Effort is better spent, in my opinion, on the refactoring to
> make it more easily understood and maintained than on having it emit
> better SQL, at least for those without large databases.

        Awesome, I'm glad we agree. I totally appreciate the data about
what you've learned, too--it's exactly this sort of information that
helps guide our future directions, even in refactoring! :-)

        -Max
--
http://www.everythingsolved.com/
Competent, Friendly Bugzilla and Perl Services. Everything Else, too.
-
To view or change your list settings, click here:
<http://bugzilla.org/cgi-bin/mj_wwwusr?user=lists@...>