Friday, February 24, 2012

Eliminate duplicate records

department table got duplicate data......

when i insert data from department to employee table, i want to eliminate duplicate and insert unique data.

department

emp_id emp_name emp_address

10 mary melville,ny

10 mary longisland,ny

11 linsy sugarland,tx

12 sam fairfax,va

12 sam dice,va

i want result in employee table....How can i get it with fastest query...i got tons of record to insert.

emp_id emp_name emp_address

10 mary melville,ny

11 linsy sugarland,tx

12 sam fairfax,va

insert into employee(emp_id,emp_name,emp_address)
select emp_id, max(emp_name), max(emp_address)
from department

where emp_id is not null

If you are using SQL Server 2005, give a look to the ROW_NUMBER() functon in books online. One technique is to create a "sequence number" for each grouping using the ROW_NUMBER() function and to select the records that have a generated "sequence number" of 1.

Also, something like this can be done:

Code Snippet

declare @.mockup table
( emp_id integer,
emp_name varchar(10),
emp_address varchar(25)
)
insert into @.mockup
select 10, 'mary', 'melville,ny' union all
select 10, 'mary', 'longisland,ny' union all
select 11, 'linsy', 'sugarland,tx' union all
select 12, 'sam', 'fairfax,va' union all
select 12, 'sam', 'dice,va'
--select * from @.mockup

select emp_id,
emp_name,
( select max(emp_address)
from @.mockup b
where a.emp_id = b.emp_id
) as emp_address
from @.mockup a
group by emp_id, emp_name

/*
emp_id emp_name emp_address
-- - -
10 mary melville,ny
11 linsy sugarland,tx
12 sam fairfax,va
*/

|||

For SS2005:

Code Snippet

create table #dept( emp_id int, emp_name varchar(50), emp_address varchar(50) )

insert into #dept

select 10, 'mary', 'melville,ny'

union all select 10, 'mary', 'longisland,ny'

union all select 11, 'linsy', 'sugarland,tx'

union all select 12, 'sam', 'fairfax,va'

union all select 12, 'sam', 'dice,va'

select emp_id, emp_name, emp_address

from

(

select *, row_number() over (partition by emp_id order by emp_id, emp_name desc, emp_address desc) as rn

from #dept

) emp

where rn = 1

For SS2000:

Code Snippet

select emp_id, max(emp_name) as emp_name, max(emp_address) as emp_address

from #dept

group by emp_id

|||

The trick here is to get a list of the records you want to keep. However, using MAX on character fields isn't the way to go (unless you don't care which record to keep). Hopefully you've got a column which gives some sort of order to the data (eg datecreated) so you can choose the most recent one, otherwise its a guessing game as to which one is "keepable"

So assuming you have one... ;-)

INSERT INTO employee (emp_id, name, emp_address)

SELECT d.emp_id, d.name, d.emp_address
FROM department d

INNER JOIN

(SELECT emp_id, MAX(datecreated) AS dt

FROM department

GROUP BY emp_id) AS dist

ON d.emp_id = dist.emp_id and d.datecreated = dist.dt

HTH!

|||

when i use max function....i get this error....

i get an error like this.....
Server: Msg 8152, Level 16, State 9, Line 1
String or binary data would be truncated.
The statement has been terminated.

|||When you try the insert? Check that your column definitions in the department and employee tables match. You may have a smaller field in the destination table...|||

The problem isn't with max(), it's a mismatch between column definitions in your source and your target.

|||

Getting this error...

Server: Msg 207, Level 16, State 3, Line 1
Invalid column name 'datecreated'.
Server: Msg 207, Level 16, State 1, Line 1
Invalid column name 'datecreated'.

|||

I guess you don't have a column matching that in your table ;-)

My example was based upon the presumption that you had a field such as that in your table. You won't be able to just copy and paste it and get results. It was based upon the following DDL

CREATE TABLE departments
(emp_id int, emp_name varchar(100), emp_address varchar(100), datecreated datetime)


Basically, if you do have a field (in my example datecreated) that tells you which record is the one you want to keep then you can use a variation on my example. If you don't have one and you're happy for the choice to be made based arbitarilly then go with Dales option. His can be used straight out of the box.

Sorry for any confusion.

|||

Is there any internal field of table record i can use?.

like in oracle, do we have system variable/ records created along with insert of data in a row in sql server?.

so can i use that date created field?.

|||

I'm afraid not. If you don't have an explicit date field you can't use my example.


If the ddl of your table is the same as in Dales example, you should go with one of his excellent suggestions.

|||

i used dale solution...max thing.

select emp_id, max(emp_name) as emp_name, max(emp_address) as emp_address

from #dept

group by emp_id

works fine today.

No comments:

Post a Comment