Sunday, February 26, 2012

eliminating redundant data

edit: this came out longer than I thought, any comments about anything
here is greatly appreciated. thank you for reading

My system stores millions of records, each with fields like firstname,
lastname, email address, city, state, zip, along with any number of user
defined fields. The application allows users to define message templates
with variables. They can then select a template, and for each variable
in the template, type in a value or select a field.

The system allows you to query for messages you've sent by specifying
criteria for the variables (not the fields).

This requirement has made it difficult to normalize my datamodel at all
for speed. What I have is this:

[fieldindex]
id int PK
name nvarchar
type datatype

[recordindex]
id int PK
...

[recordvalues]
recordid int PK
fieldid int PK
value nvarchar

whenever messages are sent, I store which fields were mapped to what
variables for that deployment. So the query with a variable criteria
looks like this:

select coalesce(vm.value, rv.value)
from sentmessages sm
inner join variablemapping vm on vm.deploymentid=sm.deploymentid
left outer join recordvalues rv on
rv.recordid=sm.recordid and rv.fieldid=vm.fieldid
where coalesce(vm.value, rv.value) ...

this model works pretty well for searching messages with variable
criteria and looking up variable values for a particular message. the
big problem I have is that the recordvalues table is HUGE, 1 million
records with 50 fields each = 50 million recordvalues rows. The value,
two int columns plus the two indexes I have on the table make it into a
beast. Importing data takes forever. Querying the records (with a field
criteria) also takes longer than it should.

makes sense, the performance was largely IO bound.

I decided to try and cut into that IO. looking at a recordvalues table
with over 100 million rows in it, there were only about 3 million unique
values. so I split the recordvalues table into two tables:

[recordvalues]
recordid int PK
fieldid int PK
valueid int

[valueindex]
id int PK
value nvarchar (unique)

now, valueindex holds 3 million unique values and recordvalues
references them by id. to my suprise this shaved only 500mb off a 4gb
database!

importing didn't get any faster either, although it's no longer IO bound
it appears the cpu as the new bottleneck outweighed the IO bottleneck.
this is probably because I haven't optimized the queries for the new
tables (was hoping it wouldn't be so hard w/o the IO problem).

is there a better way to accomplish what I'm trying to do? (eliminate
the redundant data).. does SQL have built-in constructs to do stuff like
this? It seems like maybe I'm trying to duplicate functionality at a
high level that may already exist at a lower level.

IO is becoming a serious bottleneck.
the million record 50 field csv file is only 500mb. I would've thought
that after eliminating all the redundant first name, city, last name,
etc it would be less data and not 8x more!

-
Gordon

Posted Via Usenet.com Premium Usenet Newsgroup Services
------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
------------------
http://www.usenet.comNo database vendor, version, platform... Why are you using tables as indexes
when you can use indexes as indexes? Normalizing a data model generally
slows it down. Denormalizing generally speeds things up.

"gordy" <gordy@.dynamicsdirect.com> wrote in message
news:40be960f$1_1@.Usenet.com...
> edit: this came out longer than I thought, any comments about anything
> here is greatly appreciated. thank you for reading
> My system stores millions of records, each with fields like firstname,
> lastname, email address, city, state, zip, along with any number of user
> defined fields. The application allows users to define message templates
> with variables. They can then select a template, and for each variable
> in the template, type in a value or select a field.
> The system allows you to query for messages you've sent by specifying
> criteria for the variables (not the fields).
> This requirement has made it difficult to normalize my datamodel at all
> for speed. What I have is this:
> [fieldindex]
> id int PK
> name nvarchar
> type datatype
> [recordindex]
> id int PK
> ...
> [recordvalues]
> recordid int PK
> fieldid int PK
> value nvarchar
> whenever messages are sent, I store which fields were mapped to what
> variables for that deployment. So the query with a variable criteria
> looks like this:
> select coalesce(vm.value, rv.value)
> from sentmessages sm
> inner join variablemapping vm on vm.deploymentid=sm.deploymentid
> left outer join recordvalues rv on
> rv.recordid=sm.recordid and rv.fieldid=vm.fieldid
> where coalesce(vm.value, rv.value) ...
> this model works pretty well for searching messages with variable
> criteria and looking up variable values for a particular message. the
> big problem I have is that the recordvalues table is HUGE, 1 million
> records with 50 fields each = 50 million recordvalues rows. The value,
> two int columns plus the two indexes I have on the table make it into a
> beast. Importing data takes forever. Querying the records (with a field
> criteria) also takes longer than it should.
> makes sense, the performance was largely IO bound.
> I decided to try and cut into that IO. looking at a recordvalues table
> with over 100 million rows in it, there were only about 3 million unique
> values. so I split the recordvalues table into two tables:
> [recordvalues]
> recordid int PK
> fieldid int PK
> valueid int
> [valueindex]
> id int PK
> value nvarchar (unique)
> now, valueindex holds 3 million unique values and recordvalues
> references them by id. to my suprise this shaved only 500mb off a 4gb
> database!
> importing didn't get any faster either, although it's no longer IO bound
> it appears the cpu as the new bottleneck outweighed the IO bottleneck.
> this is probably because I haven't optimized the queries for the new
> tables (was hoping it wouldn't be so hard w/o the IO problem).
> is there a better way to accomplish what I'm trying to do? (eliminate
> the redundant data).. does SQL have built-in constructs to do stuff like
> this? It seems like maybe I'm trying to duplicate functionality at a
> high level that may already exist at a lower level.
> IO is becoming a serious bottleneck.
> the million record 50 field csv file is only 500mb. I would've thought
> that after eliminating all the redundant first name, city, last name,
> etc it would be less data and not 8x more!
> -
> Gordon
> Posted Via Usenet.com Premium Usenet Newsgroup Services
> ------------------
> ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
> ------------------
> http://www.usenet.com|||> No database vendor, version, platform... Why are you using tables as indexes
> when you can use indexes as indexes? Normalizing a data model generally
> slows it down. Denormalizing generally speeds things up.

sorry, I'm using MS SQL2000

How can I use indexes as indexes? I mean, in the example I posted, can
you give an example?

Posted Via Usenet.com Premium Usenet Newsgroup Services
------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
------------------
http://www.usenet.com|||Well, I don't know SQL Server, but in Oracle, you create an index using the
CREATE INDEX statement. I suspect it works the same or similar in SQL
Server.

Here's an Oracle example that creates an index called
"asearch_client_id_idx" on the client_id field in a table called
"alphasearch" owned by user "alphasearch":

CREATE INDEX ALPHASEARCH.ASEARCH_CLIENT_ID_IDX
ON ALPHASEARCH.ALPHASEARCH(CLIENT_ID);

Now, I am about to lie a little bit for simplicity's sake, but here goes...

When a query is executed that uses client_id as a search or sort criteria,
the Oracle optimizer will decide whether or not to use the index. If it
does, it looks up the values needed in the index, and retrieves their row
ids, which in turn are essentially pointers to the location of the data in
data blocks on disc, so it goes dirctly to that location on disc and
retrieves the data out of the blocks. It does not need to go logically into
the table. Note that what I refer to as row_id in Oracle may not be the same
concept in SQL Server.

Hope you get the general idea, and you should consult your documentation
about indexes.

"gordy" <gordy@.dynamicsdirect.com> wrote in message
news:40bf5ec1$1_1@.Usenet.com...
> > No database vendor, version, platform... Why are you using tables as
indexes
> > when you can use indexes as indexes? Normalizing a data model generally
> > slows it down. Denormalizing generally speeds things up.
> sorry, I'm using MS SQL2000
> How can I use indexes as indexes? I mean, in the example I posted, can
> you give an example?
> Posted Via Usenet.com Premium Usenet Newsgroup Services
> ------------------
> ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
> ------------------
> http://www.usenet.com|||> Well, I don't know SQL Server, but in Oracle, you create an index using the
> CREATE INDEX statement. I suspect it works the same or similar in SQL
> Server.
> Here's an Oracle example that creates an index called
> "asearch_client_id_idx" on the client_id field in a table called
> "alphasearch" owned by user "alphasearch":
> CREATE INDEX ALPHASEARCH.ASEARCH_CLIENT_ID_IDX
> ON ALPHASEARCH.ALPHASEARCH(CLIENT_ID);

wow, what a concept ;)

I appreciate the criticism.. after all that's the intent of my original
post, however, I would prefer it to be of the constructive variety.

In my system, there are 'fields' and there are 'variables'. the user
creates the relationships between them whenever they send a message. in
order to search for messages by 'variable' values, sql needs a
relationship of its own to translate between them.

this has kept me from being able to use the obvious:
[records]
id,field1,field2,field3,...

because in a query for 'variable1', depending on the message it may have
to look at 'field3' or 'field4' for the value. this requirement is why I
have the tables I have now (recordindex, fieldindex and recordvalues).

I realize this makes for very large indexes.. and like you said, the
table itself is nothing more than a big index. This is the problem I'd
like to solve. In my original post I explained how I attempted to
eliminate redundant data, but I only eliminated 500mb (of 4gb) because
the majority of volume in this db isn't the data itself, but the index size.

Posted Via Usenet.com Premium Usenet Newsgroup Services
------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
------------------
http://www.usenet.com|||"gordy" <gordy@.dynamicsdirect.com> wrote in message
news:40bf9c49$1_1@.Usenet.com...
> > Well, I don't know SQL Server, but in Oracle, you create an index using
the
> > CREATE INDEX statement. I suspect it works the same or similar in SQL
> > Server.
> > Here's an Oracle example that creates an index called
> > "asearch_client_id_idx" on the client_id field in a table called
> > "alphasearch" owned by user "alphasearch":
> > CREATE INDEX ALPHASEARCH.ASEARCH_CLIENT_ID_IDX
> > ON ALPHASEARCH.ALPHASEARCH(CLIENT_ID);
> wow, what a concept ;)
> I appreciate the criticism.. after all that's the intent of my original
> post, however, I would prefer it to be of the constructive variety.

It wasn't criticism. I thought you really didn't know about RDBMS indexes,
being that you essentially created your own.

> In my system, there are 'fields' and there are 'variables'. the user
> creates the relationships between them whenever they send a message. in
> order to search for messages by 'variable' values, sql needs a
> relationship of its own to translate between them.
> this has kept me from being able to use the obvious:
> [records]
> id,field1,field2,field3,...
> because in a query for 'variable1', depending on the message it may have
> to look at 'field3' or 'field4' for the value. this requirement is why I
> have the tables I have now (recordindex, fieldindex and recordvalues).
> I realize this makes for very large indexes.. and like you said, the
> table itself is nothing more than a big index. This is the problem I'd
> like to solve. In my original post I explained how I attempted to
> eliminate redundant data, but I only eliminated 500mb (of 4gb) because
> the majority of volume in this db isn't the data itself, but the index
size.

The only similar situation I've seen like this (home-brew index constructs)
is with a document imaging system called FileNET. In that case, the vendor
actually created its own mini RDBMS to handle just these index/table
constructs. It was very fast, but, of course, proprietary.

It's hard to tell exactly what you are trying to do, though. Could you get
into the business requirements a bit? It would help me to understand what
you need to do. It get the feeling from the solution you came up with that
you are a programmer, not a DBA.

> Posted Via Usenet.com Premium Usenet Newsgroup Services
> ------------------
> ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
> ------------------
> http://www.usenet.com

No comments:

Post a Comment