LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   sed or awk to remove $8 in a | delimited file if non-numerical value exists (https://www.linuxquestions.org/questions/programming-9/sed-or-awk-to-remove-%248-in-a-%7C-delimited-file-if-non-numerical-value-exists-4175466935/)

BludGeonT 06-21-2013 06:34 PM

sed or awk to remove $8 in a | delimited file if non-numerical value exists
 
Hello,

I have a question for you all, I have a text file that is delimited with | between fields. The file has roughly half a million lines in it. I am looking for the fastest way to remove the lines in this file or generate a new file if the pattern in field $8 is non-numerical.

I could write a for loop with an if statement to do this with cut, but it would take forever to run, so I'm looking for a good awk or sed one liner that could do this instead.

There are many ways to do this, either managing the file directly with a sed statement, or by awk'ing it and outputting everything to another file - but I'm asking the community on what they think the most efficient way to do this, please.


Your help is much appreciated.

dayid 06-21-2013 07:06 PM

So you want to keep the entire line if $8 is numerical or you want to remove those?

Do you need it to handle large numbers or just one or two digits?

This will be far easier if you paste an example of how the data is now and how you wish it to end up: e.g.,

"I have this:"
Code:

a | b | c | d | e | f | g | h | i | j
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0
a | b | c | d | e | f | g | h | i | j
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0
a | b | c | d | e | f | g | h | i | j
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0
a | b | c | d | e | f | g | h | i | j
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0
a | b | c | d | e | f | g | h | i | j
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0
a | b | c | d | e | f | g | h | i | j
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0
a | b | c | d | e | f | g | h | i | j
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0
a | b | c | d | e | f | g | h | i | j
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0
a | b | c | d | e | f | g | h | i | j
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0

"I want this:"
Code:

a | b | c | d | e | f | g | h | i | j
a | b | c | d | e | f | g | h | i | j
a | b | c | d | e | f | g | h | i | j
a | b | c | d | e | f | g | h | i | j
a | b | c | d | e | f | g | h | i | j
a | b | c | d | e | f | g | h | i | j
a | b | c | d | e | f | g | h | i | j
a | b | c | d | e | f | g | h | i | j
a | b | c | d | e | f | g | h | i | j


BludGeonT 06-21-2013 07:13 PM

Dayid,

Actually we just figured it out - basically it was to exclude the line entirely if it contained anything other than strictly numbers (including alpha-numeric values), here is what we came up with which did the trick:

Code:

# cat oldfile.txt | awk -F \| ' $8 ~ /^[0-9]*$/ ' > newfile.txt
~400000 lines of text completed in less than 8 seconds.

Wanted to share this with everyone in the event someone else needed this to work.

I appreciate your help Dayid, and in the future I'll pose my questions with examples such as how you mentioned.

Thanks again :)

dayid 06-21-2013 08:24 PM

Quote:

Originally Posted by BludGeonT (Post 4976357)
Dayid,

Actually we just figured it out - basically it was to exclude the line entirely if it contained anything other than strictly numbers (including alpha-numeric values), here is what we came up with which did the trick:

Code:

# cat oldfile.txt | awk -F \| ' $8 ~ /^[0-9]*$/ ' > newfile.txt

Great solution, but just a FWIW, piping "cat" is a common error.

If you have to do it again skip 'cat' as a whole:
Code:

awk -F '|' ' $8 ~ /^[0-9]*$/ ' oldfile.txt > newfile.txt
Glad you got it to work.

grail 06-21-2013 10:22 PM

It may not be important, but remember that your current solution also allows for the eighth field to be empty, ie ||

BludGeonT 06-21-2013 11:57 PM

Thanks again everyone for your insight and recommendations, the Linux community never ceases to amaze me with the combined knowledge, have a good one folks.


All times are GMT -5. The time now is 02:32 AM.