Showing posts with label stores. Show all posts
Showing posts with label stores. Show all posts

Sunday, February 26, 2012

eliminating redundant data

edit: this came out longer than I thought, any comments about anything
here is greatly appreciated. thank you for reading

My system stores millions of records, each with fields like firstname,
lastname, email address, city, state, zip, along with any number of user
defined fields. The application allows users to define message templates
with variables. They can then select a template, and for each variable
in the template, type in a value or select a field.

The system allows you to query for messages you've sent by specifying
criteria for the variables (not the fields).

This requirement has made it difficult to normalize my datamodel at all
for speed. What I have is this:

[fieldindex]
id int PK
name nvarchar
type datatype

[recordindex]
id int PK
...

[recordvalues]
recordid int PK
fieldid int PK
value nvarchar

whenever messages are sent, I store which fields were mapped to what
variables for that deployment. So the query with a variable criteria
looks like this:

select coalesce(vm.value, rv.value)
from sentmessages sm
inner join variablemapping vm on vm.deploymentid=sm.deploymentid
left outer join recordvalues rv on
rv.recordid=sm.recordid and rv.fieldid=vm.fieldid
where coalesce(vm.value, rv.value) ...

this model works pretty well for searching messages with variable
criteria and looking up variable values for a particular message. the
big problem I have is that the recordvalues table is HUGE, 1 million
records with 50 fields each = 50 million recordvalues rows. The value,
two int columns plus the two indexes I have on the table make it into a
beast. Importing data takes forever. Querying the records (with a field
criteria) also takes longer than it should.

makes sense, the performance was largely IO bound.

I decided to try and cut into that IO. looking at a recordvalues table
with over 100 million rows in it, there were only about 3 million unique
values. so I split the recordvalues table into two tables:

[recordvalues]
recordid int PK
fieldid int PK
valueid int

[valueindex]
id int PK
value nvarchar (unique)

now, valueindex holds 3 million unique values and recordvalues
references them by id. to my suprise this shaved only 500mb off a 4gb
database!

importing didn't get any faster either, although it's no longer IO bound
it appears the cpu as the new bottleneck outweighed the IO bottleneck.
this is probably because I haven't optimized the queries for the new
tables (was hoping it wouldn't be so hard w/o the IO problem).

is there a better way to accomplish what I'm trying to do? (eliminate
the redundant data).. does SQL have built-in constructs to do stuff like
this? It seems like maybe I'm trying to duplicate functionality at a
high level that may already exist at a lower level.

IO is becoming a serious bottleneck.
the million record 50 field csv file is only 500mb. I would've thought
that after eliminating all the redundant first name, city, last name,
etc it would be less data and not 8x more!

-
Gordon

Posted Via Usenet.com Premium Usenet Newsgroup Services
------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
------------------
http://www.usenet.comNo database vendor, version, platform... Why are you using tables as indexes
when you can use indexes as indexes? Normalizing a data model generally
slows it down. Denormalizing generally speeds things up.

"gordy" <gordy@.dynamicsdirect.com> wrote in message
news:40be960f$1_1@.Usenet.com...
> edit: this came out longer than I thought, any comments about anything
> here is greatly appreciated. thank you for reading
> My system stores millions of records, each with fields like firstname,
> lastname, email address, city, state, zip, along with any number of user
> defined fields. The application allows users to define message templates
> with variables. They can then select a template, and for each variable
> in the template, type in a value or select a field.
> The system allows you to query for messages you've sent by specifying
> criteria for the variables (not the fields).
> This requirement has made it difficult to normalize my datamodel at all
> for speed. What I have is this:
> [fieldindex]
> id int PK
> name nvarchar
> type datatype
> [recordindex]
> id int PK
> ...
> [recordvalues]
> recordid int PK
> fieldid int PK
> value nvarchar
> whenever messages are sent, I store which fields were mapped to what
> variables for that deployment. So the query with a variable criteria
> looks like this:
> select coalesce(vm.value, rv.value)
> from sentmessages sm
> inner join variablemapping vm on vm.deploymentid=sm.deploymentid
> left outer join recordvalues rv on
> rv.recordid=sm.recordid and rv.fieldid=vm.fieldid
> where coalesce(vm.value, rv.value) ...
> this model works pretty well for searching messages with variable
> criteria and looking up variable values for a particular message. the
> big problem I have is that the recordvalues table is HUGE, 1 million
> records with 50 fields each = 50 million recordvalues rows. The value,
> two int columns plus the two indexes I have on the table make it into a
> beast. Importing data takes forever. Querying the records (with a field
> criteria) also takes longer than it should.
> makes sense, the performance was largely IO bound.
> I decided to try and cut into that IO. looking at a recordvalues table
> with over 100 million rows in it, there were only about 3 million unique
> values. so I split the recordvalues table into two tables:
> [recordvalues]
> recordid int PK
> fieldid int PK
> valueid int
> [valueindex]
> id int PK
> value nvarchar (unique)
> now, valueindex holds 3 million unique values and recordvalues
> references them by id. to my suprise this shaved only 500mb off a 4gb
> database!
> importing didn't get any faster either, although it's no longer IO bound
> it appears the cpu as the new bottleneck outweighed the IO bottleneck.
> this is probably because I haven't optimized the queries for the new
> tables (was hoping it wouldn't be so hard w/o the IO problem).
> is there a better way to accomplish what I'm trying to do? (eliminate
> the redundant data).. does SQL have built-in constructs to do stuff like
> this? It seems like maybe I'm trying to duplicate functionality at a
> high level that may already exist at a lower level.
> IO is becoming a serious bottleneck.
> the million record 50 field csv file is only 500mb. I would've thought
> that after eliminating all the redundant first name, city, last name,
> etc it would be less data and not 8x more!
> -
> Gordon
> Posted Via Usenet.com Premium Usenet Newsgroup Services
> ------------------
> ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
> ------------------
> http://www.usenet.com|||> No database vendor, version, platform... Why are you using tables as indexes
> when you can use indexes as indexes? Normalizing a data model generally
> slows it down. Denormalizing generally speeds things up.

sorry, I'm using MS SQL2000

How can I use indexes as indexes? I mean, in the example I posted, can
you give an example?

Posted Via Usenet.com Premium Usenet Newsgroup Services
------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
------------------
http://www.usenet.com|||Well, I don't know SQL Server, but in Oracle, you create an index using the
CREATE INDEX statement. I suspect it works the same or similar in SQL
Server.

Here's an Oracle example that creates an index called
"asearch_client_id_idx" on the client_id field in a table called
"alphasearch" owned by user "alphasearch":

CREATE INDEX ALPHASEARCH.ASEARCH_CLIENT_ID_IDX
ON ALPHASEARCH.ALPHASEARCH(CLIENT_ID);

Now, I am about to lie a little bit for simplicity's sake, but here goes...

When a query is executed that uses client_id as a search or sort criteria,
the Oracle optimizer will decide whether or not to use the index. If it
does, it looks up the values needed in the index, and retrieves their row
ids, which in turn are essentially pointers to the location of the data in
data blocks on disc, so it goes dirctly to that location on disc and
retrieves the data out of the blocks. It does not need to go logically into
the table. Note that what I refer to as row_id in Oracle may not be the same
concept in SQL Server.

Hope you get the general idea, and you should consult your documentation
about indexes.

"gordy" <gordy@.dynamicsdirect.com> wrote in message
news:40bf5ec1$1_1@.Usenet.com...
> > No database vendor, version, platform... Why are you using tables as
indexes
> > when you can use indexes as indexes? Normalizing a data model generally
> > slows it down. Denormalizing generally speeds things up.
> sorry, I'm using MS SQL2000
> How can I use indexes as indexes? I mean, in the example I posted, can
> you give an example?
> Posted Via Usenet.com Premium Usenet Newsgroup Services
> ------------------
> ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
> ------------------
> http://www.usenet.com|||> Well, I don't know SQL Server, but in Oracle, you create an index using the
> CREATE INDEX statement. I suspect it works the same or similar in SQL
> Server.
> Here's an Oracle example that creates an index called
> "asearch_client_id_idx" on the client_id field in a table called
> "alphasearch" owned by user "alphasearch":
> CREATE INDEX ALPHASEARCH.ASEARCH_CLIENT_ID_IDX
> ON ALPHASEARCH.ALPHASEARCH(CLIENT_ID);

wow, what a concept ;)

I appreciate the criticism.. after all that's the intent of my original
post, however, I would prefer it to be of the constructive variety.

In my system, there are 'fields' and there are 'variables'. the user
creates the relationships between them whenever they send a message. in
order to search for messages by 'variable' values, sql needs a
relationship of its own to translate between them.

this has kept me from being able to use the obvious:
[records]
id,field1,field2,field3,...

because in a query for 'variable1', depending on the message it may have
to look at 'field3' or 'field4' for the value. this requirement is why I
have the tables I have now (recordindex, fieldindex and recordvalues).

I realize this makes for very large indexes.. and like you said, the
table itself is nothing more than a big index. This is the problem I'd
like to solve. In my original post I explained how I attempted to
eliminate redundant data, but I only eliminated 500mb (of 4gb) because
the majority of volume in this db isn't the data itself, but the index size.

Posted Via Usenet.com Premium Usenet Newsgroup Services
------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
------------------
http://www.usenet.com|||"gordy" <gordy@.dynamicsdirect.com> wrote in message
news:40bf9c49$1_1@.Usenet.com...
> > Well, I don't know SQL Server, but in Oracle, you create an index using
the
> > CREATE INDEX statement. I suspect it works the same or similar in SQL
> > Server.
> > Here's an Oracle example that creates an index called
> > "asearch_client_id_idx" on the client_id field in a table called
> > "alphasearch" owned by user "alphasearch":
> > CREATE INDEX ALPHASEARCH.ASEARCH_CLIENT_ID_IDX
> > ON ALPHASEARCH.ALPHASEARCH(CLIENT_ID);
> wow, what a concept ;)
> I appreciate the criticism.. after all that's the intent of my original
> post, however, I would prefer it to be of the constructive variety.

It wasn't criticism. I thought you really didn't know about RDBMS indexes,
being that you essentially created your own.

> In my system, there are 'fields' and there are 'variables'. the user
> creates the relationships between them whenever they send a message. in
> order to search for messages by 'variable' values, sql needs a
> relationship of its own to translate between them.
> this has kept me from being able to use the obvious:
> [records]
> id,field1,field2,field3,...
> because in a query for 'variable1', depending on the message it may have
> to look at 'field3' or 'field4' for the value. this requirement is why I
> have the tables I have now (recordindex, fieldindex and recordvalues).
> I realize this makes for very large indexes.. and like you said, the
> table itself is nothing more than a big index. This is the problem I'd
> like to solve. In my original post I explained how I attempted to
> eliminate redundant data, but I only eliminated 500mb (of 4gb) because
> the majority of volume in this db isn't the data itself, but the index
size.

The only similar situation I've seen like this (home-brew index constructs)
is with a document imaging system called FileNET. In that case, the vendor
actually created its own mini RDBMS to handle just these index/table
constructs. It was very fast, but, of course, proprietary.

It's hard to tell exactly what you are trying to do, though. Could you get
into the business requirements a bit? It would help me to understand what
you need to do. It get the feeling from the solution you came up with that
you are a programmer, not a DBA.

> Posted Via Usenet.com Premium Usenet Newsgroup Services
> ------------------
> ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
> ------------------
> http://www.usenet.com

Sunday, February 19, 2012

Efficient record storage 10+ Million records

I have a database that stores survey information. I've been asked to add
information to it to test scalability. Well, its' hurting.
What I have is a 'sample' table or the people that took the survey and then
several related tables.
Lets' say the sample table looks something like
tblData_Sample
SampleID INT IDENTITY(1,1)
FullName VARCHAR(20)
AgeGroupID INT
IncomeID INT
With a clustered index on SampleID and Indexes on AgeGroupID and IncomeID
I also have a table for storing the survey responses. Right now its' setup
like
tblData_Grid
SampleID INT
RestaurantID INT
QuestionID INT
Response INT
There is also a table for storing complete information
tblData_Session
SampleID INT
StartDate DATETIME
EndDate DATETIME
With a clustered index on SampleID
With a clustered index on SampleID, RestaurantID, QuestionID and Indexes on
RestaurantID and QuestionID.
The problem I'm having is that after scaling out to only 93,000 completed
surveys, the data table is at 10,000,000 records.
Needless to say OUCH!!
Queries of the data table are painful... I've rebuilt all of my indexes, so
I know they're up-to-date and I'm kinda of stuck on what to do for speed
gains at this point.
It's taking way to long to process (10 seconds on the development server, I
haven't bothered trying the live cluster yet)
SELECT DISTINCT g.SampleID, g.RestaurantID
FROM tblData_Grid g
WHERE g.SampleID IN
(
SELECT s.SampleID
FROM tblData_Session s
WHERE s.EndDate IS NOT NULL
)
So what are things I can to do speed things up here? I've thought about
taking all questions that are asked of each respondent and moving them to a
horizontal table.
I've also considered using partitioned views. Perhaps store data in
mini-grid tables based on the sample quarter or restaurantid and then using
a partitioned view to simulate the grid table...
Before I start making any sweeping changes, I was hoping that someone here
could steer me in the right direction.
Thanks,
BenHi Ben
Have you considered trying a simple join rather than nested queries? e.g.
SELECT DISTINCT g.SampleID, g.RestaurantID
FROM tblData_Grid g
INNER JOIN tblData_Session s
ON (g.SampleID = s.SampleID)
WHERE (s.EndDate IS NOT NULL)
Also there is a question as to what you are actually using the data for and
how you intend to analyse it? The Design answers are very different
depending upon the uses. The query above is, I would guess, going to return
a massive amount of data. As I believe Mr William Vaughn (sir) is wont to
ask (dotNetRocks #100) - How small a font do you use to display that to your
users?
If what you want is to display a count of samples for each restaurant or
some such then test with that style of aggregate query. Also you presumably
have a separate table to map Restaurant IDs to their details and supporting
tables not mentioned.
Yours
Alasdair Russell
"Ben" wrote:

> I have a database that stores survey information. I've been asked to add
> information to it to test scalability. Well, its' hurting.
> What I have is a 'sample' table or the people that took the survey and the
n
> several related tables.
> Lets' say the sample table looks something like
> tblData_Sample
> SampleID INT IDENTITY(1,1)
> FullName VARCHAR(20)
> AgeGroupID INT
> IncomeID INT
> With a clustered index on SampleID and Indexes on AgeGroupID and IncomeID
> I also have a table for storing the survey responses. Right now its' setu
p
> like
> tblData_Grid
> SampleID INT
> RestaurantID INT
> QuestionID INT
> Response INT
> There is also a table for storing complete information
> tblData_Session
> SampleID INT
> StartDate DATETIME
> EndDate DATETIME
> With a clustered index on SampleID
> With a clustered index on SampleID, RestaurantID, QuestionID and Indexes o
n
> RestaurantID and QuestionID.
> The problem I'm having is that after scaling out to only 93,000 completed
> surveys, the data table is at 10,000,000 records.
> Needless to say OUCH!!
> Queries of the data table are painful... I've rebuilt all of my indexes,
so
> I know they're up-to-date and I'm kinda of stuck on what to do for speed
> gains at this point.
> It's taking way to long to process (10 seconds on the development server,
I
> haven't bothered trying the live cluster yet)
> SELECT DISTINCT g.SampleID, g.RestaurantID
> FROM tblData_Grid g
> WHERE g.SampleID IN
> (
> SELECT s.SampleID
> FROM tblData_Session s
> WHERE s.EndDate IS NOT NULL
> )
>
> So what are things I can to do speed things up here? I've thought about
> taking all questions that are asked of each respondent and moving them to
a
> horizontal table.
> I've also considered using partitioned views. Perhaps store data in
> mini-grid tables based on the sample quarter or restaurantid and then usin
g
> a partitioned view to simulate the grid table...
> Before I start making any sweeping changes, I was hoping that someone here
> could steer me in the right direction.
> Thanks,
> Ben
>
>|||Dhericean,
Thanks for the fast response. Yes, I've tried the join and believe it or
not its' taking slightly longer... Looking at the execution plan there is
an extra aggregation involved with the join and that's amounting to more
reads.
All of the supporting tables are present and exist in a similar fashion to
the ones below.
This query is not used to report data to the end user rather it is used to
preselect records ID's that match the users query. Then they go through the
system picking 'measures' and 'analysis groups' that calculate aggregated
stats for the records in the users initial query. I've found that staging
these ID's is better than running "select these records and process them",
the approach as it stands now says, "populate selection table, select
additional parameters, merge query onto selection table".
The end result populates a data table and a chart. Typically there are less
than ten data points per analysis. I hope this helps clear things up.
Maybe a better way to describe it is that the initial query defines the
subset of data to be viewed, and that query is cached to reduce its' impact
on subsequent requests where the data returned by the query is crunched,
efficivly seperating selection processing away from calculation processing.
Thanks,
Ben
"Dhericean" <Dhericean@.discussions.microsoft.com> wrote in message
news:B5C02FF4-E48B-4410-9C96-F9A1233F6138@.microsoft.com...
> Hi Ben
> Have you considered trying a simple join rather than nested queries? e.g.
> SELECT DISTINCT g.SampleID, g.RestaurantID
> FROM tblData_Grid g
> INNER JOIN tblData_Session s
> ON (g.SampleID = s.SampleID)
> WHERE (s.EndDate IS NOT NULL)
> Also there is a question as to what you are actually using the data for
> and
> how you intend to analyse it? The Design answers are very different
> depending upon the uses. The query above is, I would guess, going to
> return
> a massive amount of data. As I believe Mr William Vaughn (sir) is wont to
> ask (dotNetRocks #100) - How small a font do you use to display that to
> your
> users?
> If what you want is to display a count of samples for each restaurant or
> some such then test with that style of aggregate query. Also you
> presumably
> have a separate table to map Restaurant IDs to their details and
> supporting
> tables not mentioned.
> Yours
> Alasdair Russell
> "Ben" wrote:
>|||How about;
select distinct g.sampleid, g.RestaurantID
FROM tblData_Grid g
WHERE exists (SELECT s.SampleID
FROM tblData_Session s
WHERE s.sampleid = g.sampleid and
s.EndDate IS NOT NULL)
"Ben" wrote:

> I have a database that stores survey information. I've been asked to add
> information to it to test scalability. Well, its' hurting.
> What I have is a 'sample' table or the people that took the survey and the
n
> several related tables.
> Lets' say the sample table looks something like
> tblData_Sample
> SampleID INT IDENTITY(1,1)
> FullName VARCHAR(20)
> AgeGroupID INT
> IncomeID INT
> With a clustered index on SampleID and Indexes on AgeGroupID and IncomeID
> I also have a table for storing the survey responses. Right now its' setu
p
> like
> tblData_Grid
> SampleID INT
> RestaurantID INT
> QuestionID INT
> Response INT
> There is also a table for storing complete information
> tblData_Session
> SampleID INT
> StartDate DATETIME
> EndDate DATETIME
> With a clustered index on SampleID
> With a clustered index on SampleID, RestaurantID, QuestionID and Indexes o
n
> RestaurantID and QuestionID.
> The problem I'm having is that after scaling out to only 93,000 completed
> surveys, the data table is at 10,000,000 records.
> Needless to say OUCH!!
> Queries of the data table are painful... I've rebuilt all of my indexes,
so
> I know they're up-to-date and I'm kinda of stuck on what to do for speed
> gains at this point.
> It's taking way to long to process (10 seconds on the development server,
I
> haven't bothered trying the live cluster yet)
> SELECT DISTINCT g.SampleID, g.RestaurantID
> FROM tblData_Grid g
> WHERE g.SampleID IN
> (
> SELECT s.SampleID
> FROM tblData_Session s
> WHERE s.EndDate IS NOT NULL
> )
>
> So what are things I can to do speed things up here? I've thought about
> taking all questions that are asked of each respondent and moving them to
a
> horizontal table.
> I've also considered using partitioned views. Perhaps store data in
> mini-grid tables based on the sample quarter or restaurantid and then usin
g
> a partitioned view to simulate the grid table...
> Before I start making any sweeping changes, I was hoping that someone here
> could steer me in the right direction.
> Thanks,
> Ben
>
>

Friday, February 17, 2012

Efficiency of SmallInt vs. SmallDatetime

I have a scheduling db (sql2000) that stores a lot of 'time' data and does
a lot of operations on this data. I'm storing event dates in a smalldatetime
field. I'm storing a 'starttime' in another smalldatetime field (eg. as
'1/1/1900 9:30') and 'endtime' in a third smalldatetime field ( eg. as
'1/1/1900 11:30').
Does anyone have any information/ideas on what performance improvements I
might see if was to instead store the 2 'time' fields as smallints (storing
number of minutes since midnight) rather than as smalldatetimes? I
understand I would halve the storage required, what about performance per
se?
Thanks,
Paul.
Hi
Every row in a Table has an overhead of about 20 bytes. Saving a byte here
or there does not help much. You have to do a lot more processing to convert
your time to smallint and back (in your code).
Have you tested the difference? Functions like dateadd and datediff and not
avilable to you if you don't use DateTime datatypes.
Regards
Mike
"Paul W" wrote:

> I have a scheduling db (sql2000) that stores a lot of 'time' data and does
> a lot of operations on this data. I'm storing event dates in a smalldatetime
> field. I'm storing a 'starttime' in another smalldatetime field (eg. as
> '1/1/1900 9:30') and 'endtime' in a third smalldatetime field ( eg. as
> '1/1/1900 11:30').
> Does anyone have any information/ideas on what performance improvements I
> might see if was to instead store the 2 'time' fields as smallints (storing
> number of minutes since midnight) rather than as smalldatetimes? I
> understand I would halve the storage required, what about performance per
> se?
> Thanks,
> Paul.
>
>

Efficiency of SmallInt vs. SmallDatetime

I have a scheduling db (sql2000) that stores a lot of 'time' data and does
a lot of operations on this data. I'm storing event dates in a smalldatetime
field. I'm storing a 'starttime' in another smalldatetime field (eg. as
'1/1/1900 9:30') and 'endtime' in a third smalldatetime field ( eg. as
'1/1/1900 11:30').
Does anyone have any information/ideas on what performance improvements I
might see if was to instead store the 2 'time' fields as smallints (storing
number of minutes since midnight) rather than as smalldatetimes? I
understand I would halve the storage required, what about performance per
se?
Thanks,
Paul.Hi
Every row in a Table has an overhead of about 20 bytes. Saving a byte here
or there does not help much. You have to do a lot more processing to convert
your time to smallint and back (in your code).
Have you tested the difference? Functions like dateadd and datediff and not
avilable to you if you don't use DateTime datatypes.
Regards
Mike
"Paul W" wrote:
> I have a scheduling db (sql2000) that stores a lot of 'time' data and does
> a lot of operations on this data. I'm storing event dates in a smalldatetime
> field. I'm storing a 'starttime' in another smalldatetime field (eg. as
> '1/1/1900 9:30') and 'endtime' in a third smalldatetime field ( eg. as
> '1/1/1900 11:30').
> Does anyone have any information/ideas on what performance improvements I
> might see if was to instead store the 2 'time' fields as smallints (storing
> number of minutes since midnight) rather than as smalldatetimes? I
> understand I would halve the storage required, what about performance per
> se?
> Thanks,
> Paul.
>
>

Wednesday, February 15, 2012

Editing Varchar field in Enterprise Manager

I am using SQL 2000. In a table (tblTestTable) I have a field VARHCAR(4000).
the field stores text taken from a text area in a web page so contains new line characters also length is around 2000 characters.

The problem is that I am unable to edit the text in this field through Enterprise Manager. The text remains as if it is readonly. All I can do is select entire text and cut or delete. :confused:
Is there reason for this occurance? How can it beovercome?Don't use EM?

It really wasn't made for that purpose, but I'll have a look.