Sunday, February 26, 2012

Eliminating Duplicate Records

Hi

How can i eliminate duplicate records from a single table, when i use a query that links multipla tables together?

There is a surrogate key TABLE1 and i just can't seem to wrap my head around the layout.

TABLE1

ID | memQuestionaireID | QuestscoreID | DateOfAssessment | etc.

TABLE2
memQuestionaireID | MemberID | Answer1 | Answer2 | etc.

TABLE3

MemberID | Name | Surname | IDNo | etc.

t1.memQuestionaireID = t2.memQuestionaireID

t2.MemberID = t3.MemberID

That's how the tables link, obvious enough. How can i delete duplicate records from TABLE1 where MemberID occurs more than once and the t1.DateOfAssessment = MAX(DateOfAssessment)

The ID is unique, but there are duplicate memberID's with different date of assessments, i want to take the last used date of assessment and keep only that one, and remove the rest from the table?

I'm sure the query is simple, i'm just too stupid to see how to solve it!!

Any help would be greatly appreciated.

I'm not very clued up on how to remove records or update records, Please help.

Kind Regards
Carel Greaves

See if this does what you need...

Code Snippet

DELETE t1

FROM Table1 t1

INNER JOIN Table2

on t1.memQuestionaireID = t2.memQuestionaireID

INNER JOIN

(

SELECT t2b.MemberID, max(t1b.DateOfAssessment) as DateOfAssessment

FROM table1 t1b

INNER JOIN table2 t2b

ON t1b.memQuestionaireID = t2b.memQuestionaireID

AND t2b.MemberID in

(

SELECT MemberID

FROM Table3

GROUP BY MemberID

HAVING count(*)>1

)

GROUP BY t2b.MemberID

) as list

ON t1.DateOfAssessment < list.DateOfAssessment

and t2.MemberID = list.MemberID

To view what will be delete w/o actually deleting, just change the first line from DELETE t1 to SELECT t1.*

|||

I'm a bit confused Carel,

Call me dense today, but it seems like the MemberID 'should' be in Table1 since it is the Member that has a memQuestionaireID, has a QuestScoreID and a DateOfAssessment. Then for Table2, it is the Questionaire that has a set of answers, etc. Please help me out here...

It seems like your Table1 has a one-to-many relationship with Table2. I'm assuming that a Member may have multiple instances of the same questionnaire (re-takes, etc.)

Is one questionnaire (MemQuestionnaireID) with a QuestScoreID and a DateOfAssessment directly linked to one or more set(s) of answers in Table2?

Is the QuestScoreID changed upon each re-take? (Shouldn't the score be directly related to a set of answers?)

I ask this because it seems from the design you presented that perhaps you want to also delete rows from Table2...

|||

I know the database structure is a bit stupid. Not my design!

I am only trying to remove duplicate entries in the QuestScore table.

Problem:

A member fills in a Questionaire - MemQuestionaires

The Querstionaire's Score is placed in the QuestScore Table.

There are duplicate entries in the QuestScore table with regards to a questionaireID from the memQuestionaires Table

If a duplicate exists, then the members can't log into there profiles anymore, so i need to remove the duplicates so that the members can access their profiles again.

I'm quite new to all of this, so any help would be greatly appreciated

Kind Regards

Carel Greaves.

|||

You gave us Table1, Table2, Table3...I'm guessing Table1 is QuestScore?

Did my code above work? If not, what was not quite right?

Are there multiple entries just in Table1, or is it Table3?

Please clarify and maybe post some sample data...

|||

Okay, so i'm a bito f a moron :-) he he

Here is the code i used to find the duplicates

SELECT me.MemberID, /*me.QuestionaireID,*/ q.memQuestionaireID, q.*, me.*

FROM questscore q, memQuestionaires m, Members me

WHERE q.memquestionaireID = m.memquestionaireID

AND m.MemberiD = me.memberid

AND me.ClientID = 124

ANd m.questionaireid = 85

AND q.sectionid >= 1000000

ORDER BY me.memberID

Here is my results:

MemberID memQuestionaireID QuestScoreID AnswerID AnswerValue DateCompleted 10005772 7615 273308 0 -500 50:59.7 10005789 11124 267771 0 -500 25:16.0 10005798 11129 259150 0 -500 44:41.1 10005798 11129 261032 0 -500 01:21.3 10005807 8214 237464 0 -500 59:18.1 10005819 11142 293317 0 -500 43:09.4

i have marked the duplicate values in red that i want to eliminate.

I need to eliminate only the record with the oldest date, and keep the newer one.

Hopefully when i've gone for my SQL course then i will only be half a moron :-) he he

Thanks for helping me out guys.

Kind Regards

Carel Greaves

|||

Wow, talk about a 'bolixed' up database...

I'm still confused, in your query above, you have a WHERE clause that indicates the results should be filtered down to QuestionaireID = 85, but your duplicate records are for memQuestionaireID 11129. What is the difference between QuestionaireID and memQuestionaireID? How did that happen?

Also, the DateCompleted is obviously not a date, but a number of hours. Is that the number of hours since some event? How are the hours determined?

Which row is the 'newer one'?

It would be very helpful, and keep us from wasting our time, if you would post the DDL for each of the tables, as well as some sample data for each table (in the form of INSERT statements.)

As you seen, you have folks sincerely attempting to help you, and to this point wasting our time because we just don't have the 'full' picture. You've inherited a very 'odd' database, and you need to help us so we can better help you.

|||

Here is sample data in the Tables.

Member Table

med_aid MemberID ClientID Name Surname Username Password NULL 10000000 1 NiftyName NiftySurname his test NULL 10000001 5 NiftyTest2 Surname2 nifty@.ybo.oa JBCYWO NULL 10000002 5 KeithTest WilsonTest willi willi

memQuestionaires Table

memQuestionaireID MemberID QuestionaireID LastPage Paused Complete MaxPages Sent LastSaved InitMail ReceiveMail UpdateDate 871 10000000 85 1 0 0 1 1 0 1 1 2007/02/2506:23 872 10000001 85 1 0 0 1 1 0 1 1 2007/02/2506:23 873 10000002 85 1 0 0 1 1 0 1 1 2007/02/2506:23

QuestScore Table

QuestScoreID memQuestionaireID AnswerID AnswerValue SectionID Page QuestType AnswerText AnswerShort Updatedate DefaultValue DateCompleted 4641 871 0 -9 361 1 9 NULL NULL

2007/02/0722:31

4642 872 0 -5 362 1 6 NULL NULL

2007/02/0722:31

4643 873 0 0 345 1 2 NULL NULL

2007/02/0722:31

When i executed this statement, i only used these three tables.

SELECT me.MemberID, /*me.QuestionaireID,*/ q.memQuestionaireID, q.*, me.*
FROM questscore q, memQuestionaires m, Members me
WHERE q.memquestionaireID = m.memquestionaireID
AND m.MemberiD = me.memberid
AND me.ClientID = 124
ANd m.questionaireid = 85
AND q.sectionid >= 1000000
ORDER BY me.memberID

I am only trying to remove the duplicate records for where ClientID = 124

AND QuestionaireID = 85

as for the 1000000, there it is just to narrow down the search for myself, there are a lot of duplicates.

There are a lot of members -> Members Table

Each member can fill in a number of questionaires -> QuestionaireID on memQuestionaires Table in this case 85

And all the questionaire's data is stored in the QuestScore Table which is linked to the memQuestionaires table using the memQuestionairesID.

There seems to be duplicate values in the QuestScore Table referring to the members.

The date completed is an actual DATETIME field in the database, i used excel to transfer the data to here. The newer date will be a actual date with the time in the database i.e. 2007/05/23 02:05:11

All the fields get inserted into the database via a application which i haven't seen.

I'm new to the company and am trying to figure a lot of it out for myself too.

If there is any more information you need, please don't hesitate to ask.

|||try this one,

DECLARE @.QuestScore TABLE (
QuestScoreID int
, memQuestionaireID int
, DateCompleted smalldatetime
)

DECLARE @.MemQuestionaire TABLE (
MemQuestionaireID int
, MemberID int
, QuestionaireID int
, UpdateDate smalldatetime
)

INSERT
INTO @.QuestScore
SELECT 1, 1, dateadd(day,-1, getdate()) UNION ALL
SELECT 2, 2, dateadd(day,-1, getdate()) UNION ALL
SELECT 3, 3, dateadd(day,-3, getdate())

INSERT
INTO @.MemQuestionaire
SELECT 1, 1, 4, dateadd(day,-1, getdate()) UNION ALL
SELECT 2, 1, 4, dateadd(day,-1, getdate()) UNION ALL
SELECT 3, 2, 5, dateadd(day,-3, getdate())

select *
from @.memquestionaire

select *
from @.questscore

declare @.tobedeleted table( -- create a table var for the data to be deleted
memquestionaireid int
, datecompleted smalldatetime
)

insert
into @.tobedeleted -- the memquestionaireid and datecompleted to be deleted
select MIN(qq.memquestionaireid) as memquestionaireid -- just in case if they have the same date, get the lowest id
, mq.datecompleted
from @.questscore qq inner join
(
select m.memberid
, m.questionaireid
, MIN(q.datecompleted) as datecompleted -- delete the lowest date
from @.memquestionaire m inner join
@.questscore q on m.memquestionaireid = q.memquestionaireid
group by
m.memberid
, m.questionaireid
having count(m.memberid) > 1 -- member having more than 1 quest score
) mq on qq.datecompleted = mq.datecompleted
group by
mq.datecompleted

delete @.questscore -- delete in questscore
from @.questscore q inner join
@.tobedeleted dq on q.memquestionaireid = dq.memquestionaireid
and q.datecompleted = dq.datecompleted

delete @.memquestionaire -- delete in memquestionaire
from @.memquestionaire m inner join
@.tobedeleted dq on m.memquestionaireid = dq.memquestionaireid

select *
from @.questscore

select *
from @.memquestionaire|||

Thanks, that's awesome, that's why i come to you guys for help.

Another quick question, that i think it might be easier to do, i just can't do it. (I just thought of it now)

If i added memQuestionaireID and SectionID together to create a unique field, and then filter out the duplicates by taking out the old dates within the duplicates it might be easier, but how would i go about accomplishing this?

I don't know how to compare the dates against itself to get keep the new date and get rid of the old date?

sorry for all of this i'm just trying to learn from all the things myself?

|||

Thanks, that was a start.

Here is an example of the 'best' form for DDL and sample data. In this fashion, each person that wants to try and help you can just run this code and have 'your' problem to work with. Without this, folks are turned away because it takes so much time and effort to duplicate your situation -and helps to keep us from running off on tangents that don't really help you.

Now if you would only add to the sample data so that examples of the duplicate records you want to find and delete are represented, we 'should' be able to help you.

Code Snippet


DECLARE @.Members table
( med_aid int,
MemberID int,
ClientID int,
Name varchar(20),
Surname varchar(20),
Username varchar(100),
Password varchar(20)
)


INSERT INTO @.Members VALUES ( NULL, 10000000, 1, 'NiftyName', 'NiftySurname', 'his', 'test' )
INSERT INTO @.Members VALUES ( NULL, 10000001, 5, 'NiftyTest2', 'Surname2', 'nifty@.ybo.oa', 'JBCYWO' )
INSERT INTO @.Members VALUES ( NULL, 10000002, 5, 'KeithTest', 'WilsonTest', 'willi', 'willi' )


DECLARE @.memQuestionaires table
( memQuestionaireID int,
MemberID int,
QuestionaireID int,
LastPage int,
Paused int,
Complete int,
MaxPages int,
Sent int,
LastSaved int,
InitMail int,
ReceiveMail int,
UpdateDate datetime
)


INSERT INTO @.memQuestionaires VALUES ( 871, 10000000, 85, 1, 0, 0, 1, 1, 0, 1, 1, '2007/02/25 06:23' )
INSERT INTO @.memQuestionaires VALUES ( 872, 10000001, 85, 1, 0, 0, 1, 1, 0, 1, 1, '2007/02/25 06:23' )
INSERT INTO @.memQuestionaires VALUES ( 873, 10000002, 85, 1, 0, 0, 1, 1, 0, 1, 1, '2007/02/25 06:23' )


DECLARE @.QuestScore table
( QuestScoreID int,
memQuestionaireID int,
AnswerID int,
AnswerValue int,
SectionID int,
Page int,
QuestType int,
AnswerText int,
AnswerShort int,
Updatedate int,
DefaultValue int,
DateCompleted datetime
)


INSERT INTO @.QuestScore VALUES ( 4641, 871, 0, -9, 361, 1, 9, '', '', NULL, NULL, '2007/02/07 22:31' )
INSERT INTO @.QuestScore VALUES ( 4642, 872, 0, -5, 362, 1, 6, '', '', NULL, NULL, '2007/02/07 22:31' )
INSERT INTO @.QuestScore VALUES ( 4643, 873, 0, 0, 345, 1, 2, '', '', NULL, NULL, '2007/02/07 22:31' )

|||very clean example arnie.. i wonder how your desk looks like |||

Sorry i'm sending you guys on a wild goose chase like this, and thanks a lot Arnie, i'd buy you a case of beer if i knew where you were.

I'm really learning a lot from this!

I just thought that it might be easier, if i had to add the memQuestionaireID and SectionID together in the QuestScore table to get a single unique value and remove the duplicates based on the newer unique value according to the Date, i.e. Only removing the old dates.

Sorry about this, i only saw it now.

That way i'm only using a single table.

|||

Thanks,

And my desk is under the stuff somewhere, or so it was a few months ago...

And UPS serves Portland, OR! But it would be much more enjoyable to share, so if you can wrangle it, start lobbying management to let you come to the PASS (Professional Association for SQL Server) Conference in September. (http://sqlpass.org/)

|||

Cummon guys, i'm a newbie at SQL Server in South Africa and it's late at night and my CEO's say i can't go home until i get this problem sorted out.

Please could you guys help me out quick?

No comments:

Post a Comment