Friday, February 17, 2012

Efficiency of a ''SELECT TOP'' style GROUP BY query: FREETEXT vs. FREETEXTTABLE

Hi,

Please have a look at the following two queries, the purpose of which is to find which ten users (represented by 'Username') have created the most records which contain the term 'foo':

SELECT TOP 10 Username, COUNT(*) AS [Count] FROM Options

WHERE FREETEXT(*, 'foo')

GROUP BY Username

ORDER BY [Count] DESC

SELECT TOP 10 Username, COUNT(*) AS [Count] FROM Options

JOIN FREETEXTTABLE (Options, *, 'foo', 500) ct

ON OptionID = ct.[KEY]

GROUP BY Username

ORDER BY [Count] DESC

They both produce the same result set. However, I am wondering which is more performant. At first glance, it would seem the first one would be. It doesn't involve a JOIN and should, therefore, be more efficient.

But this depends on how the FREETEXT expression is evaluated. My concern is that internally, SQL Server would generate an entire recordset based on 'WHERE FREETEXT(*, 'foo')', which could be thousands of records, and only then restrict this to the TOP 10 by COUNT.

If this does happen, then it would be better to join to a FREETEXTTABLE, where I can at least restrict the result set using the 'top_n_by_rank' parameter (which is set as '500' in this case, as this seems a good balance of performance against the likely number of duplicates I will get in my FREETEXTTABLE results).

So... I am worrying about this unnecessarily? Should I just use the simpler first version?

Any thoughts appreciated.

Thanks

They are almost identical. Implicitly when you do freetext(*,'foo') the system will convert it to a join based on the key. That means the first query will be implicitly converted to:

Code Snippet

select top 10 username, count(*) [count]

from options o join freetexttable(options,*,'foo') ct on o.optionid=ct.key

group by username

order by [count] desc

which is obviously not as efficient as the second query. However, the result between the two will also differ. For the first query, you're doing the join, group by, and then select the top N. The second you limit the first 500 from FTS query before doing the join, group by, and then select top N. So, you've been lucky thus far because the top 500 returned by the FTS contain the top desired 10 for your second query.

|||

Hi,

Thanks very much - that is exactly the info I was looking for.

I chose '500' as the top_n_by_rank parameter based on a calculation that, having considered all the other factors in the project I am working on, I am likely to get at least 10 results in the second query in the vast majority of cases.

Thanks for your help.

No comments:

Post a Comment