Towards faster keyword searches

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Towards faster keyword searches

Dylan Hardison
When you have a keyword on 50,000+ bugs it means you're building a sql query with a 50,000 list in the form of IN (50,000 ids).
This is gets close to the limits mysql has on queries, at least and it's pretty slow.

There are a whole class of these -- anything that is a 'multiselect' type search. I believe someone suggested embedding these queries (at least as an option)
as a sub select and in general I think that's a good idea. An even nicer idea is to just let elastic search do the searching.
One of these approaches I hope to explore with an outreachy intern.

However, for the moment I really need keyword searches to be fast, so I started looking at the search code[1].
I started at the problem backwards -- first by looking at where we build search objects (the tree-like Bugzilla::Search::Clause::* stuff).
For searches with a single keyword search, I thought it would be nice to turn that into a join.

Then I wrote that, and it appears the code I wrote actually works generally for multiple keywords (although you then start wondering what the max number of joins is.)

Anyway, it's not finished code but it does work on a test install.

It's in a github branch: https://github.com/dylanwh/bugzilla/tree/fast-keywords,
you can look at the diff here: https://github.com/dylanwh/bugzilla/commit/e88bdf7168c6723d9930bc771ea93c93c3dec6b0

I will be polishing this up for a review, but I wanted to have other people look at it first.

Here's an example of a query it builds, for single_keyword=batman AND single_keyword:frog

SELECT bugs.bug_id AS bug_id, bugs.priority AS priority, bugs.bug_severity AS bug_severity
  FROM bugs
LEFT JOIN bug_group_map AS security_map ON bugs.bug_id = security_map.bug_id
LEFT JOIN cc AS security_cc ON bugs.bug_id = security_cc.bug_id AND security_cc.who = 1
INNER JOIN priority AS map_priority ON bugs.priority = map_priority.value
INNER JOIN bug_severity AS map_bug_severity ON bugs.bug_severity = map_bug_severity.value
LEFT JOIN keywords AS keywords_1 ON bugs.bug_id = keywords_1.bug_id
LEFT JOIN keyworddefs AS keyworddefs_1 ON keywords_1.keywordid = keyworddefs_1.id
LEFT JOIN keywords AS keywords_2 ON bugs.bug_id = keywords_2.bug_id
LEFT JOIN keyworddefs AS keyworddefs_2 ON keywords_2.keywordid = keyworddefs_2.id
 WHERE bugs.creation_ts IS NOT NULL
   AND ( (security_map.group_id IS NULL OR security_map.group_id IN (1,10,11,14,12,13,9,4,8,5,6,7,3,2))
        OR (bugs.reporter_accessible = 1 AND bugs.reporter = 1)
        OR (bugs.cclist_accessible = 1 AND security_cc.who IS NOT NULL)
        OR bugs.assigned_to = 1
)
   AND  bugs.resolution IN ('')  AND keyworddefs_1.name = 'batman' AND INSTR(keyworddefs_2.name, 'frog') = 0
GROUP BY bugs.bug_id
ORDER BY map_priority.sortkey, map_priority.value, map_bug_severity.sortkey, map_bug_severity.value
LIMIT 500-
To view or change your list settings, click here:
<https://lists.bugzilla.org/cgi-bin/mj_wwwusr?user=lists+s6506n84121h51@...>
Reply | Threaded
Open this post in threaded view
|

Re: Towards faster keyword searches

Jeff Fearn
On 26/09/2016 12:46, Dylan Hardison wrote:
> When you have a keyword on 50,000+ bugs it means you're building a sql query with a 50,000 list in the form of IN (50,000 ids).
> This is gets close to the limits mysql has on queries, at least and it's pretty slow.

Not only that, when you get to 65K you hit parameter limits in DBD::Pg :(

I've just had to fix a bug for this related to viewing bugs in a group.
I solved it by replacing the use of new_from_list with new_from_where
which takes the SQL used to generate the bug id list and uses it in the
query.

https://jfearn.fedorapeople.org/65KBugsInAGroup.patch

I'm not very happy with it because it requires you knowing exactly how
the sql will be used and coupling is bad, mmmk.

A more general approach to doing sub-selects instead of ID lists would
be great.

> There are a whole class of these -- anything that is a 'multiselect' type search. I believe someone suggested embedding these queries (at least as an option)
> as a sub select and in general I think that's a good idea. An even nicer idea is to just let elastic search do the searching.
> One of these approaches I hope to explore with an outreachy intern.

The effect of using ID lists instead of sub selects is wider than just
searching, as the above patch demonstrates.

> However, for the moment I really need keyword searches to be fast, so I started looking at the search code[1].
> I started at the problem backwards -- first by looking at where we build search objects (the tree-like Bugzilla::Search::Clause::* stuff).
> For searches with a single keyword search, I thought it would be nice to turn that into a join.
>
> Then I wrote that, and it appears the code I wrote actually works generally for multiple keywords (although you then start wondering what the max number of joins is.)
>
> Anyway, it's not finished code but it does work on a test install.
>
> It's in a github branch: https://github.com/dylanwh/bugzilla/tree/fast-keywords,
> you can look at the diff here: https://github.com/dylanwh/bugzilla/commit/e88bdf7168c6723d9930bc771ea93c93c3dec6b0
>
> I will be polishing this up for a review, but I wanted to have other people look at it first.
>
> Here's an example of a query it builds, for single_keyword=batman AND single_keyword:frog
>
> SELECT bugs.bug_id AS bug_id, bugs.priority AS priority, bugs.bug_severity AS bug_severity
>   FROM bugs
> LEFT JOIN bug_group_map AS security_map ON bugs.bug_id = security_map.bug_id
> LEFT JOIN cc AS security_cc ON bugs.bug_id = security_cc.bug_id AND security_cc.who = 1
> INNER JOIN priority AS map_priority ON bugs.priority = map_priority.value
> INNER JOIN bug_severity AS map_bug_severity ON bugs.bug_severity = map_bug_severity.value
> LEFT JOIN keywords AS keywords_1 ON bugs.bug_id = keywords_1.bug_id
> LEFT JOIN keyworddefs AS keyworddefs_1 ON keywords_1.keywordid = keyworddefs_1.id
> LEFT JOIN keywords AS keywords_2 ON bugs.bug_id = keywords_2.bug_id
> LEFT JOIN keyworddefs AS keyworddefs_2 ON keywords_2.keywordid = keyworddefs_2.id
>  WHERE bugs.creation_ts IS NOT NULL
>    AND ( (security_map.group_id IS NULL OR security_map.group_id IN (1,10,11,14,12,13,9,4,8,5,6,7,3,2))
>         OR (bugs.reporter_accessible = 1 AND bugs.reporter = 1)
>         OR (bugs.cclist_accessible = 1 AND security_cc.who IS NOT NULL)
>         OR bugs.assigned_to = 1
> )
>    AND  bugs.resolution IN ('')  AND keyworddefs_1.name = 'batman' AND INSTR(keyworddefs_2.name, 'frog') = 0
> GROUP BY bugs.bug_id
> ORDER BY map_priority.sortkey, map_priority.value, map_bug_severity.sortkey, map_bug_severity.value
> LIMIT 500-
> To view or change your list settings, click here:
> <https://lists.bugzilla.org/cgi-bin/mj_wwwusr?user=jfearn@...>
>
I ran this on a copy of our DB and it seems to perform well. I did have
to chop off the order by as it's not leagl in Pg to have a group by
clause and then order by things not in the group by clause or an
aggregate function.

Cheers, Jeff.




signature.asc (484 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Towards faster keyword searches

Dylan Hardison

> On Sep 25, 2016, at 23:51, Jeff Fearn <[hidden email]> wrote:
>
>> There are a whole class of these -- anything that is a 'multiselect' type search. I believe someone suggested embedding these queries (at least as an option)
>> as a sub select and in general I think that's a good idea. An even nicer idea is to just let elastic search do the searching.
>> One of these approaches I hope to explore with an outreachy intern.
>
> The effect of using ID lists instead of sub selects is wider than just
> searching, as the above patch demonstrates.
>
Noted :-)


>
> I ran this on a copy of our DB and it seems to perform well. I did have
> to chop off the order by as it's not leagl in Pg to have a group by
> clause and then order by things not in the group by clause or an
> aggregate function.

I would hope that offending code is only generated by the (mysql) parts of Bugzilla::DB::*. I'll verify that before getting that ready for review.-
To view or change your list settings, click here:
<https://lists.bugzilla.org/cgi-bin/mj_wwwusr?user=lists+s6506n84121h51@...>
Reply | Threaded
Open this post in threaded view
|

Re: Towards faster keyword searches

Donald Lindsay
Here are two examples of IN lists, that I see in our MySQL slow-query log:

DELETE FROM `dist_abuild_rpm` WHERE `abuild_id` IN (1370624,
1315692, 1315697, 1315705, 1315708, 1315709, 1315718,
1315718,1315723, 1315735, 1315747, 1315751, 1315752,
1315768, 1315774, 1315784, 1315834, 791557, 1315865,
1355269, 1370631, 1315894, 1315922, 1315955, 1315981,
1315999, 1377469, 1316063, 1316170, 1355324, 1316209,
1316213, 1355329, 1316240, 1316248, 1355332, 1316262,
1316272, 1316281, 791998, 1316290, 792008, 1316304,
792017, 792019, 1355344, 1316331, 1316348, 792062, 792072,
1316371, 792093, 1316384, 792100, 792117, 792119, 1316413,
1316414, 792130, 792132, 792133, 1316427, 1316448, 792162,
792187, 792201, 792203, 1316512, 1316535, 1316536, 792256,
792260, 792266, 792275, 792282, 792288, 1355387, 792293,
792296, 792308, 792319, 792323, 792324, 792338, 792345,
792350, 792359, 1316649, 792366, 792375, 792377, 1316671,
792393, 792397, 792415, 1363507, 1355411, 792444, 1316738,
1355414, 792455); 


SELECT AutoTest.dut.dutspec as dutspec, propertyName,
value, startTime, endTime FROM rdam.dut
JOIN rdam.dut_property on dut_id=rdam.dut.id
JOIN AutoTest.dut ON AutoTest.dut.dutspec
LIKE CONCAT( '%', rdam.dut.name ) WHERE AutoTest.dut.dutspec
IN ( 'rdam://ck414', 'rdam://ck417', 'rdam://in343',
'rdam://ck411', 'rdam://ckp302', 'rdam://ol391',
'rdam://ckp203', 'rdam://in351', 'rdam://hs209',
'rdam://yr306', 'rdam://hs211', 'rdam://hs228',
'rdam://up486', 'rdam://hs207', 'rdam://snp104',
'rdam://hs205', 'rdam://snp105', 'rdam://ckp329',
'rdam://ckp304', 'rdam://nv443', 'rdam://ckp201',
'rdam://ckp323', 'rdam://ckp320', 'rdam://cd273',
'rdam://ckp308', 'rdam://ckp204', 'rdam://up301',
'rdam://gb305', 'rdam://nv406', 'rdam://hs106',
'rdam://fm393', 'rdam://bn102', 'rdam://fm202',
'rdam://ht103', 'rdam://in353', 'rdam://ck408',
'rdam://do411', 'rdam://do401', 'rdam://do466',
'rdam://do402', 'rdam://ol161', 'rdam://hs103',
'rdam://ckp328', 'rdam://lf327', 'rdam://ckp303',
'rdam://pts101', 'rdam://tg274', 'rdam://ckp322',
'rdam://wa401', 'rdam://hs212', 'rdam://hs214',
'rdam://lf120', 'rdam://wa444', 'rdam://ckp202',
'rdam://fm204', 'rdam://ol421', 'rdam://tg422',
'rdam://tcp106', 'rdam://yr302', 'rdam://cd343',
'rdam://cp131', 'rdam://cd369', 'rdam://psp103',
'rdam://hs112', 'rdam://psp101', 'rdam://ht238',
'rdam://lf244', 'rdam://nv210', 'rdam://in472',
'rdam://lp110', 'rdam://ht422', 'rdam://tcp105',
'rdam://in474' );      

If anyone feels like speeding them up, I'm interested.                 

On Sun, Sep 25, 2016 at 9:05 PM, Dylan Hardison <[hidden email]> wrote:

> On Sep 25, 2016, at 23:51, Jeff Fearn <[hidden email]> wrote:
>
>> There are a whole class of these -- anything that is a 'multiselect' type search. I believe someone suggested embedding these queries (at least as an option)
>> as a sub select and in general I think that's a good idea. An even nicer idea is to just let elastic search do the searching.
>> One of these approaches I hope to explore with an outreachy intern.
>
> The effect of using ID lists instead of sub selects is wider than just
> searching, as the above patch demonstrates.
>
Noted :-)


>
> I ran this on a copy of our DB and it seems to perform well. I did have
> to chop off the order by as it's not leagl in Pg to have a group by
> clause and then order by things not in the group by clause or an
> aggregate function.

I would hope that offending code is only generated by the (mysql) parts of Bugzilla::DB::*. I'll verify that before getting that ready for review.-
To view or change your list settings, click here:
<https://lists.bugzilla.org/cgi-bin/mj_wwwusr?user=lindsay@...>

Reply | Threaded
Open this post in threaded view
|

Re: Towards faster keyword searches: Followup

Dylan Hardison
In reply to this post by Dylan Hardison
I have a (still experimental) patch applied to bugzilla-dev to gauge usefulness.

https://bugzilla-dev.allizom.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&f0=OP&f1=OP&f2=keywords&f3=CP&f4=CP&f5=OP&f6=OP&f7=keywords&f8=CP&f9=CP&j1=OR&j6=OR&o2=equals&o7=equals&v2=regression&v7=intermittent-failure&list_id=725

numbers:

Total execution time: 0.087219 seconds (best)
Total execution time: 0.096627 seconds (most common)
Total execution time: 0.159978 seconds (worst)

production numbers:

Total execution time: 1.993044 seconds (best)
Total execution time: 2.047537 seconds (worst and also most common)

I think this is a keeper.

-
To view or change your list settings, click here:
<https://lists.bugzilla.org/cgi-bin/mj_wwwusr?user=lists+s6506n84121h51@...>
Reply | Threaded
Open this post in threaded view
|

Re: Towards faster keyword searches

Steve Wendt
In reply to this post by Dylan Hardison
> although you then start wondering  what the max number of joins is.

https://dev.mysql.com/doc/refman/5.5/en/joins-limits.html

_______________________________________________
dev-apps-bugzilla mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-apps-bugzilla
-
To view or change your list settings, click here:
<https://lists.bugzilla.org/cgi-bin/mj_wwwusr?user=lists+s6506n84121h51@...>