Sql Edition Selection: million

Showing posts with label million. Show all posts

Sunday, February 26, 2012

Eliminating the expensive remote join

I've got a table (dexIndex) with 9 million rows in it. It's got an
identity column (index_id) and a column that's full-text indexed
(insureds_name) and a column that I want to filter on (agency_number).
When I query like this:
select * from decIndex where agency_number = '0349323' and
contains(insureds_name,'"SMITH"')
The execution plan involves a remote scan (98%) and a clustered index
seek (0%) which lead into a hash match/inner join (2%). The problem is
that the remote scan is returning 51,000 rows even though only 1 comes
out of the hash match/inner join. Thus, the query takes over a minute
on a built-to-the-hilt server.
Is there a way to have the FT engine only look through or return a
subset of the FT index, like only those rows which correspond to
agency_number = '0349323'?
I'd like to avoid adding agency_number to the FT index.
Thanks!
Regretably there is no way of doing this. One of the strategies of getting
around this problem is to partition your table according to the other
elements of you where clause. Another option is to build full text indexes
on indexed views in SQL 2005. Again the view would be of a section of your
table which would match your where conditions.
"Ryan Walberg [MCSD]" <generalpf@.nospam.yahoo.reallynospam.com> wrote in
message news:%233kbwJZJFHA.588@.TK2MSFTNGP15.phx.gbl...
> I've got a table (dexIndex) with 9 million rows in it. It's got an
> identity column (index_id) and a column that's full-text indexed
> (insureds_name) and a column that I want to filter on (agency_number).
> When I query like this:
> select * from decIndex where agency_number = '0349323' and
> contains(insureds_name,'"SMITH"')
> The execution plan involves a remote scan (98%) and a clustered index seek
> (0%) which lead into a hash match/inner join (2%). The problem is that
> the remote scan is returning 51,000 rows even though only 1 comes out of
> the hash match/inner join. Thus, the query takes over a minute on a
> built-to-the-hilt server.
> Is there a way to have the FT engine only look through or return a subset
> of the FT index, like only those rows which correspond to agency_number =
> '0349323'?
> I'd like to avoid adding agency_number to the FT index.
> Thanks!
|||Hilary Cotter wrote:
> Regretably there is no way of doing this. One of the strategies of getting
> around this problem is to partition your table according to the other
> elements of you where clause. Another option is to build full text indexes
> on indexed views in SQL 2005. Again the view would be of a section of your
> table which would match your where conditions.
I ended up clustering the index that corresponded to the unique column
for the full-text index and that sped it up a great deal.
Thanks,
Ryan

Sunday, February 19, 2012

Efficiently Inserting 1 Million records

I have an app that needs to insert 1 million records into a table. The tabl
e
is very basic thus far, with no triggers or indexes on it. The procedure
takes in excess of an hour, which I am willing to accept if I have to, but I
would like to know what tools are available to streamline this process.
I am using VB.Net code to do the work with basically 1 million loops and an
insert for each one. Are there great gains in terms having Stored Procs do
the work vs Native insert statements or any fancy database tuning techniques
.
I know that indexes add Select efficiency but what can be suggested for the
insert?
Thanks in advance for any assitance.
--
RyanRyan,
1. BULK INSERT
2. BCP (IN)
3. DTS
--See SQL Books Online for more information on each
HTH
Jerry
"Ryan" <weeims@.nospam.nospam> wrote in message
news:CE654CF4-AF12-453E-A552-D80336685BEB@.microsoft.com...
>I have an app that needs to insert 1 million records into a table. The
>table
> is very basic thus far, with no triggers or indexes on it. The procedure
> takes in excess of an hour, which I am willing to accept if I have to, but
> I
> would like to know what tools are available to streamline this process.
> I am using VB.Net code to do the work with basically 1 million loops and
> an
> insert for each one. Are there great gains in terms having Stored Procs
> do
> the work vs Native insert statements or any fancy database tuning
> techniques.
> I know that indexes add Select efficiency but what can be suggested for
> the
> insert?
> Thanks in advance for any assitance.
> --
> Ryan|||Use a bulk insert, BCP
http://sqlservercode.blogspot.com/
"Ryan" wrote:

> I have an app that needs to insert 1 million records into a table. The ta
ble
> is very basic thus far, with no triggers or indexes on it. The procedure
> takes in excess of an hour, which I am willing to accept if I have to, but
I
> would like to know what tools are available to streamline this process.
> I am using VB.Net code to do the work with basically 1 million loops and a
n
> insert for each one. Are there great gains in terms having Stored Procs d
o
> the work vs Native insert statements or any fancy database tuning techniqu
es.
> I know that indexes add Select efficiency but what can be suggested for th
e
> insert?
> Thanks in advance for any assitance.
> --
> Ryan|||Inserting the rows one at time from a client application would be the
absolute slowest method of getting the work done.
You can create a DTS package to import the data. This might be the best
option if data transformations are involved and/or the process needs to be
scheduled as a job. Also, there is the bulk copy T-SQL command or DOS
executable command.
Importing and Exporting Data with DTS and BCP
http://www.microsoft.com/technet/pr...s/c07ppcsq.mspx
Using the DTS Import/Export Wizard
http://www.microsoft.com/mspress/bo...p/4885c.asp#126
Using the Bulk Copy Program (Bcp) and the BULK INSERT Transact-SQL Statement
http://www.microsoft.com/mspress/bo...p/4885e.asp#150
SQL Server 2000 Incremental Bulk Load Case Study
http://www.microsoft.com/technet/pr...n/ss2kidbp.mspx
"Ryan" <weeims@.nospam.nospam> wrote in message
news:CE654CF4-AF12-453E-A552-D80336685BEB@.microsoft.com...
>I have an app that needs to insert 1 million records into a table. The
>table
> is very basic thus far, with no triggers or indexes on it. The procedure
> takes in excess of an hour, which I am willing to accept if I have to, but
> I
> would like to know what tools are available to streamline this process.
> I am using VB.Net code to do the work with basically 1 million loops and
> an
> insert for each one. Are there great gains in terms having Stored Procs
> do
> the work vs Native insert statements or any fancy database tuning
> techniques.
> I know that indexes add Select efficiency but what can be suggested for
> the
> insert?
> Thanks in advance for any assitance.
> --
> Ryan

Efficient record storage 10+ Million records

I have a database that stores survey information. I've been asked to add
information to it to test scalability. Well, its' hurting.
What I have is a 'sample' table or the people that took the survey and then
several related tables.
Lets' say the sample table looks something like
tblData_Sample
SampleID INT IDENTITY(1,1)
FullName VARCHAR(20)
AgeGroupID INT
IncomeID INT
With a clustered index on SampleID and Indexes on AgeGroupID and IncomeID
I also have a table for storing the survey responses. Right now its' setup
like
tblData_Grid
SampleID INT
RestaurantID INT
QuestionID INT
Response INT
There is also a table for storing complete information
tblData_Session
SampleID INT
StartDate DATETIME
EndDate DATETIME
With a clustered index on SampleID
With a clustered index on SampleID, RestaurantID, QuestionID and Indexes on
RestaurantID and QuestionID.
The problem I'm having is that after scaling out to only 93,000 completed
surveys, the data table is at 10,000,000 records.
Needless to say OUCH!!
Queries of the data table are painful... I've rebuilt all of my indexes, so
I know they're up-to-date and I'm kinda of stuck on what to do for speed
gains at this point.
It's taking way to long to process (10 seconds on the development server, I
haven't bothered trying the live cluster yet)
SELECT DISTINCT g.SampleID, g.RestaurantID
FROM tblData_Grid g
WHERE g.SampleID IN
(
SELECT s.SampleID
FROM tblData_Session s
WHERE s.EndDate IS NOT NULL
)
So what are things I can to do speed things up here? I've thought about
taking all questions that are asked of each respondent and moving them to a
horizontal table.
I've also considered using partitioned views. Perhaps store data in
mini-grid tables based on the sample quarter or restaurantid and then using
a partitioned view to simulate the grid table...
Before I start making any sweeping changes, I was hoping that someone here
could steer me in the right direction.
Thanks,
BenHi Ben
Have you considered trying a simple join rather than nested queries? e.g.
SELECT DISTINCT g.SampleID, g.RestaurantID
FROM tblData_Grid g
INNER JOIN tblData_Session s
ON (g.SampleID = s.SampleID)
WHERE (s.EndDate IS NOT NULL)
Also there is a question as to what you are actually using the data for and
how you intend to analyse it? The Design answers are very different
depending upon the uses. The query above is, I would guess, going to return
a massive amount of data. As I believe Mr William Vaughn (sir) is wont to
ask (dotNetRocks #100) - How small a font do you use to display that to your
users?
If what you want is to display a count of samples for each restaurant or
some such then test with that style of aggregate query. Also you presumably
have a separate table to map Restaurant IDs to their details and supporting
tables not mentioned.
Yours
Alasdair Russell
"Ben" wrote:

> I have a database that stores survey information. I've been asked to add
> information to it to test scalability. Well, its' hurting.
> What I have is a 'sample' table or the people that took the survey and the
n
> several related tables.
> Lets' say the sample table looks something like
> tblData_Sample
> SampleID INT IDENTITY(1,1)
> FullName VARCHAR(20)
> AgeGroupID INT
> IncomeID INT
> With a clustered index on SampleID and Indexes on AgeGroupID and IncomeID
> I also have a table for storing the survey responses. Right now its' setu
p
> like
> tblData_Grid
> SampleID INT
> RestaurantID INT
> QuestionID INT
> Response INT
> There is also a table for storing complete information
> tblData_Session
> SampleID INT
> StartDate DATETIME
> EndDate DATETIME
> With a clustered index on SampleID
> With a clustered index on SampleID, RestaurantID, QuestionID and Indexes o
n
> RestaurantID and QuestionID.
> The problem I'm having is that after scaling out to only 93,000 completed
> surveys, the data table is at 10,000,000 records.
> Needless to say OUCH!!
> Queries of the data table are painful... I've rebuilt all of my indexes,
so
> I know they're up-to-date and I'm kinda of stuck on what to do for speed
> gains at this point.
> It's taking way to long to process (10 seconds on the development server,
I
> haven't bothered trying the live cluster yet)
> SELECT DISTINCT g.SampleID, g.RestaurantID
> FROM tblData_Grid g
> WHERE g.SampleID IN
> (
> SELECT s.SampleID
> FROM tblData_Session s
> WHERE s.EndDate IS NOT NULL
> )
>
> So what are things I can to do speed things up here? I've thought about
> taking all questions that are asked of each respondent and moving them to
a
> horizontal table.
> I've also considered using partitioned views. Perhaps store data in
> mini-grid tables based on the sample quarter or restaurantid and then usin
g
> a partitioned view to simulate the grid table...
> Before I start making any sweeping changes, I was hoping that someone here
> could steer me in the right direction.
> Thanks,
> Ben
>
>|||Dhericean,
Thanks for the fast response. Yes, I've tried the join and believe it or
not its' taking slightly longer... Looking at the execution plan there is
an extra aggregation involved with the join and that's amounting to more
reads.
All of the supporting tables are present and exist in a similar fashion to
the ones below.
This query is not used to report data to the end user rather it is used to
preselect records ID's that match the users query. Then they go through the
system picking 'measures' and 'analysis groups' that calculate aggregated
stats for the records in the users initial query. I've found that staging
these ID's is better than running "select these records and process them",
the approach as it stands now says, "populate selection table, select
additional parameters, merge query onto selection table".
The end result populates a data table and a chart. Typically there are less
than ten data points per analysis. I hope this helps clear things up.
Maybe a better way to describe it is that the initial query defines the
subset of data to be viewed, and that query is cached to reduce its' impact
on subsequent requests where the data returned by the query is crunched,
efficivly seperating selection processing away from calculation processing.
Thanks,
Ben
"Dhericean" <Dhericean@.discussions.microsoft.com> wrote in message
news:B5C02FF4-E48B-4410-9C96-F9A1233F6138@.microsoft.com...
> Hi Ben
> Have you considered trying a simple join rather than nested queries? e.g.
> SELECT DISTINCT g.SampleID, g.RestaurantID
> FROM tblData_Grid g
> INNER JOIN tblData_Session s
> ON (g.SampleID = s.SampleID)
> WHERE (s.EndDate IS NOT NULL)
> Also there is a question as to what you are actually using the data for
> and
> how you intend to analyse it? The Design answers are very different
> depending upon the uses. The query above is, I would guess, going to
> return
> a massive amount of data. As I believe Mr William Vaughn (sir) is wont to
> ask (dotNetRocks #100) - How small a font do you use to display that to
> your
> users?
> If what you want is to display a count of samples for each restaurant or
> some such then test with that style of aggregate query. Also you
> presumably
> have a separate table to map Restaurant IDs to their details and
> supporting
> tables not mentioned.
> Yours
> Alasdair Russell
> "Ben" wrote:
>|||How about;
select distinct g.sampleid, g.RestaurantID
FROM tblData_Grid g
WHERE exists (SELECT s.SampleID
FROM tblData_Session s
WHERE s.sampleid = g.sampleid and
s.EndDate IS NOT NULL)
"Ben" wrote:

> I have a database that stores survey information. I've been asked to add
> information to it to test scalability. Well, its' hurting.
> What I have is a 'sample' table or the people that took the survey and the
n
> several related tables.
> Lets' say the sample table looks something like
> tblData_Sample
> SampleID INT IDENTITY(1,1)
> FullName VARCHAR(20)
> AgeGroupID INT
> IncomeID INT
> With a clustered index on SampleID and Indexes on AgeGroupID and IncomeID
> I also have a table for storing the survey responses. Right now its' setu
p
> like
> tblData_Grid
> SampleID INT
> RestaurantID INT
> QuestionID INT
> Response INT
> There is also a table for storing complete information
> tblData_Session
> SampleID INT
> StartDate DATETIME
> EndDate DATETIME
> With a clustered index on SampleID
> With a clustered index on SampleID, RestaurantID, QuestionID and Indexes o
n
> RestaurantID and QuestionID.
> The problem I'm having is that after scaling out to only 93,000 completed
> surveys, the data table is at 10,000,000 records.
> Needless to say OUCH!!
> Queries of the data table are painful... I've rebuilt all of my indexes,
so
> I know they're up-to-date and I'm kinda of stuck on what to do for speed
> gains at this point.
> It's taking way to long to process (10 seconds on the development server,
I
> haven't bothered trying the live cluster yet)
> SELECT DISTINCT g.SampleID, g.RestaurantID
> FROM tblData_Grid g
> WHERE g.SampleID IN
> (
> SELECT s.SampleID
> FROM tblData_Session s
> WHERE s.EndDate IS NOT NULL
> )
>
> So what are things I can to do speed things up here? I've thought about
> taking all questions that are asked of each respondent and moving them to
a
> horizontal table.
> I've also considered using partitioned views. Perhaps store data in
> mini-grid tables based on the sample quarter or restaurantid and then usin
g
> a partitioned view to simulate the grid table...
> Before I start making any sweeping changes, I was hoping that someone here
> could steer me in the right direction.
> Thanks,
> Ben
>
>

Sunday, February 26, 2012

Eliminating the expensive remote join

Sunday, February 19, 2012

Efficiently Inserting 1 Million records

Efficient record storage 10+ Million records

Sql Edition Selection

Blog Archive

About Me