LinuxQuestions.org - [SOLVED] Newbie

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Newbie - How To ... (https://www.linuxquestions.org/questions/linux-newbie-8/newbie-how-to-4175651846/)

Newbie - How To ...

I know that this is truly a newbie question, but here goes...

We have created several files that insert into Oracle DB 12c tables a large amount of data. We've got this one table that we are getting many error messages during the load concerning duplicate rows.

I need a method that will allow me to identify these duplicate rows and delete them in case there are other errors.

Is there a way to do this using a bash script?

Sandy

I don’t know how you perform the insertion, but if you can list all rows, a pipeline consisting of the sort and uniq commands is the easiest solution. For example, to find the duplicates:

Code:

sort FILE-WITH-ROWS | uniq -d >FILE-WITH-DUPS

To delete the duplicates, you don’t even need uniq:

Code:

sort -u FILE-WITH-ROWS > FILE-WITHOUT-DUPS

As a side-effect, FILE-WITHOUT-DUPS contains the rows sorted alphanumerically. If that is not acceptable, heavier weapons are required.

If you're getting errors about duplicate rows, the duplicates won't be inserted into the database, BUT "duplicate" to the database means duplicate keys, so there may be new information in the data that is not being updated that should be.

Quote:

Originally Posted by sjpoole12 (Post 5983135)

berndbausch's answer will avoid the problem if you errors are about duplicate rows---i.e., where the entire row has been duplicated. Using those tips, you merely clean the data before you try and insert.

Things get more complicated if you're inserting multiple values and only one value turns out to be duplicated in some rows. Example: you have two records comprised of a person's social security number and their name.

Code:

nnn-nn-nnnn James Smith

nnn-nn-nnnn Jim Smith

The SSN value/column would almost certainly be defined in the database as needing to be unique and attempting to add the second would result in an error. The previously mentioned checks at the shell level wouldn't catch these potential errors. Also, if you merely chucked everything in each file record into a single column, you wouldn't catch these potential errors either. I seem to recall that Oracle would allow you to turn off constraints checking during a load which, in the above example, would allow the loading of those rows even though you were going to wind up with duplicate values in the column that should shouldn't have them. Database administrators tend to get ulcers when someone proposes this.

Newbie - How To ...

Outstanding!
I took the results of the 2 sort suggestions and they provided exactly what I needed.
I'm a contractor and even though the database would have loaded correctly having those errors would not have looked good.
I will retain this information and suggestions on my cheat sheet for future problems.
Thank You!
Sandy