Sql Edition Selection: Efficient record storage 10+ Million records

I have a database that stores survey information. I've been asked to add
information to it to test scalability. Well, its' hurting.
What I have is a 'sample' table or the people that took the survey and then
several related tables.
Lets' say the sample table looks something like
tblData_Sample
SampleID INT IDENTITY(1,1)
FullName VARCHAR(20)
AgeGroupID INT
IncomeID INT
With a clustered index on SampleID and Indexes on AgeGroupID and IncomeID
I also have a table for storing the survey responses. Right now its' setup
like
tblData_Grid
SampleID INT
RestaurantID INT
QuestionID INT
Response INT
There is also a table for storing complete information
tblData_Session
SampleID INT
StartDate DATETIME
EndDate DATETIME
With a clustered index on SampleID
With a clustered index on SampleID, RestaurantID, QuestionID and Indexes on
RestaurantID and QuestionID.
The problem I'm having is that after scaling out to only 93,000 completed
surveys, the data table is at 10,000,000 records.
Needless to say OUCH!!
Queries of the data table are painful... I've rebuilt all of my indexes, so
I know they're up-to-date and I'm kinda of stuck on what to do for speed
gains at this point.
It's taking way to long to process (10 seconds on the development server, I
haven't bothered trying the live cluster yet)
SELECT DISTINCT g.SampleID, g.RestaurantID
FROM tblData_Grid g
WHERE g.SampleID IN
(
SELECT s.SampleID
FROM tblData_Session s
WHERE s.EndDate IS NOT NULL
)
So what are things I can to do speed things up here? I've thought about
taking all questions that are asked of each respondent and moving them to a
horizontal table.
I've also considered using partitioned views. Perhaps store data in
mini-grid tables based on the sample quarter or restaurantid and then using
a partitioned view to simulate the grid table...
Before I start making any sweeping changes, I was hoping that someone here
could steer me in the right direction.
Thanks,
BenHi Ben
Have you considered trying a simple join rather than nested queries? e.g.
SELECT DISTINCT g.SampleID, g.RestaurantID
FROM tblData_Grid g
INNER JOIN tblData_Session s
ON (g.SampleID = s.SampleID)
WHERE (s.EndDate IS NOT NULL)
Also there is a question as to what you are actually using the data for and
how you intend to analyse it? The Design answers are very different
depending upon the uses. The query above is, I would guess, going to return
a massive amount of data. As I believe Mr William Vaughn (sir) is wont to
ask (dotNetRocks #100) - How small a font do you use to display that to your
users?
If what you want is to display a count of samples for each restaurant or
some such then test with that style of aggregate query. Also you presumably
have a separate table to map Restaurant IDs to their details and supporting
tables not mentioned.
Yours
Alasdair Russell
"Ben" wrote:

> I have a database that stores survey information. I've been asked to add
> information to it to test scalability. Well, its' hurting.
> What I have is a 'sample' table or the people that took the survey and the
n
> several related tables.
> Lets' say the sample table looks something like
> tblData_Sample
> SampleID INT IDENTITY(1,1)
> FullName VARCHAR(20)
> AgeGroupID INT
> IncomeID INT
> With a clustered index on SampleID and Indexes on AgeGroupID and IncomeID
> I also have a table for storing the survey responses. Right now its' setu
p
> like
> tblData_Grid
> SampleID INT
> RestaurantID INT
> QuestionID INT
> Response INT
> There is also a table for storing complete information
> tblData_Session
> SampleID INT
> StartDate DATETIME
> EndDate DATETIME
> With a clustered index on SampleID
> With a clustered index on SampleID, RestaurantID, QuestionID and Indexes o
n
> RestaurantID and QuestionID.
> The problem I'm having is that after scaling out to only 93,000 completed
> surveys, the data table is at 10,000,000 records.
> Needless to say OUCH!!
> Queries of the data table are painful... I've rebuilt all of my indexes,
so
> I know they're up-to-date and I'm kinda of stuck on what to do for speed
> gains at this point.
> It's taking way to long to process (10 seconds on the development server,
I
> haven't bothered trying the live cluster yet)
> SELECT DISTINCT g.SampleID, g.RestaurantID
> FROM tblData_Grid g
> WHERE g.SampleID IN
> (
> SELECT s.SampleID
> FROM tblData_Session s
> WHERE s.EndDate IS NOT NULL
> )
>
> So what are things I can to do speed things up here? I've thought about
> taking all questions that are asked of each respondent and moving them to
a
> horizontal table.
> I've also considered using partitioned views. Perhaps store data in
> mini-grid tables based on the sample quarter or restaurantid and then usin
g
> a partitioned view to simulate the grid table...
> Before I start making any sweeping changes, I was hoping that someone here
> could steer me in the right direction.
> Thanks,
> Ben
>
>|||Dhericean,
Thanks for the fast response. Yes, I've tried the join and believe it or
not its' taking slightly longer... Looking at the execution plan there is
an extra aggregation involved with the join and that's amounting to more
reads.
All of the supporting tables are present and exist in a similar fashion to
the ones below.
This query is not used to report data to the end user rather it is used to
preselect records ID's that match the users query. Then they go through the
system picking 'measures' and 'analysis groups' that calculate aggregated
stats for the records in the users initial query. I've found that staging
these ID's is better than running "select these records and process them",
the approach as it stands now says, "populate selection table, select
additional parameters, merge query onto selection table".
The end result populates a data table and a chart. Typically there are less
than ten data points per analysis. I hope this helps clear things up.
Maybe a better way to describe it is that the initial query defines the
subset of data to be viewed, and that query is cached to reduce its' impact
on subsequent requests where the data returned by the query is crunched,
efficivly seperating selection processing away from calculation processing.
Thanks,
Ben
"Dhericean" <Dhericean@.discussions.microsoft.com> wrote in message
news:B5C02FF4-E48B-4410-9C96-F9A1233F6138@.microsoft.com...
> Hi Ben
> Have you considered trying a simple join rather than nested queries? e.g.
> SELECT DISTINCT g.SampleID, g.RestaurantID
> FROM tblData_Grid g
> INNER JOIN tblData_Session s
> ON (g.SampleID = s.SampleID)
> WHERE (s.EndDate IS NOT NULL)
> Also there is a question as to what you are actually using the data for
> and
> how you intend to analyse it? The Design answers are very different
> depending upon the uses. The query above is, I would guess, going to
> return
> a massive amount of data. As I believe Mr William Vaughn (sir) is wont to
> ask (dotNetRocks #100) - How small a font do you use to display that to
> your
> users?
> If what you want is to display a count of samples for each restaurant or
> some such then test with that style of aggregate query. Also you
> presumably
> have a separate table to map Restaurant IDs to their details and
> supporting
> tables not mentioned.
> Yours
> Alasdair Russell
> "Ben" wrote:
>|||How about;
select distinct g.sampleid, g.RestaurantID
FROM tblData_Grid g
WHERE exists (SELECT s.SampleID
FROM tblData_Session s
WHERE s.sampleid = g.sampleid and
s.EndDate IS NOT NULL)
"Ben" wrote:

> I have a database that stores survey information. I've been asked to add
> information to it to test scalability. Well, its' hurting.
> What I have is a 'sample' table or the people that took the survey and the
n
> several related tables.
> Lets' say the sample table looks something like
> tblData_Sample
> SampleID INT IDENTITY(1,1)
> FullName VARCHAR(20)
> AgeGroupID INT
> IncomeID INT
> With a clustered index on SampleID and Indexes on AgeGroupID and IncomeID
> I also have a table for storing the survey responses. Right now its' setu
p
> like
> tblData_Grid
> SampleID INT
> RestaurantID INT
> QuestionID INT
> Response INT
> There is also a table for storing complete information
> tblData_Session
> SampleID INT
> StartDate DATETIME
> EndDate DATETIME
> With a clustered index on SampleID
> With a clustered index on SampleID, RestaurantID, QuestionID and Indexes o
n
> RestaurantID and QuestionID.
> The problem I'm having is that after scaling out to only 93,000 completed
> surveys, the data table is at 10,000,000 records.
> Needless to say OUCH!!
> Queries of the data table are painful... I've rebuilt all of my indexes,
so
> I know they're up-to-date and I'm kinda of stuck on what to do for speed
> gains at this point.
> It's taking way to long to process (10 seconds on the development server,
I
> haven't bothered trying the live cluster yet)
> SELECT DISTINCT g.SampleID, g.RestaurantID
> FROM tblData_Grid g
> WHERE g.SampleID IN
> (
> SELECT s.SampleID
> FROM tblData_Session s
> WHERE s.EndDate IS NOT NULL
> )
>
> So what are things I can to do speed things up here? I've thought about
> taking all questions that are asked of each respondent and moving them to
a
> horizontal table.
> I've also considered using partitioned views. Perhaps store data in
> mini-grid tables based on the sample quarter or restaurantid and then usin
g
> a partitioned view to simulate the grid table...
> Before I start making any sweeping changes, I was hoping that someone here
> could steer me in the right direction.
> Thanks,
> Ben
>
>

Sunday, February 19, 2012

Efficient record storage 10+ Million records

No comments:

Post a Comment

Sql Edition Selection

Blog Archive

About Me