LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-14-2021, 10:43 AM   #1
czezz
Member
 
Registered: Nov 2004
Distribution: Slackware/Solaris
Posts: 899

Rep: Reputation: 41
[BASH][AWK] filtering by column; value greater than 0


Im trying to filter out every single record by column "Connection_time" that is greater than: 000000.
Here is my sample.out

Code:
cat sample.out
|Date|Time|Type|MSISDN|Connection_time|Cost|
|26.06.2021|2040|Dane||8.01MB|0|
|26.06.2021|1701|Dane||9.67MB|0|
|26.06.2021|1639|Dane||0.2MB|0|
|26.06.2021|1339|Voice|xxxx|000135|0|
|26.06.2021|1239|Other||000000|0|
|26.06.2021|1120|||000000|1|
|26.06.2021|1120|Voice|xxxx|000138|0|
|26.06.2021|0622|Other||000000|0|
|26.06.2021|0000|Other||000000|0|
|25.06.2021|2203|Voice|xxxx|000016|0|
|25.06.2021|2103|Voice|xxxx|000056|0|
|25.06.2021|2038|Voice|xxxx|000102|0|
|25.06.2021|2016|Voice|xxxx|000038|0|
|25.06.2021|1909|SMS|xxxx|000000|0|
|25.06.2021|1849|SMS|xxxx|000000|0|
|25.06.2021|1843|SMS|xxxx|000000|0|
To filter out all records equal 000000 works OK.

Code:
$ cat sample.out | awk -F"|" '$6==000000'
|26.06.2021|1239|Other||000000|0|
|26.06.2021|1120|||000000|1|
|26.06.2021|0622|Other||000000|0|
|26.06.2021|0000|Other||000000|0|
|25.06.2021|1909|SMS|xxxx|000000|0|
|25.06.2021|1849|SMS|xxxx|000000|0|
|25.06.2021|1843|SMS|xxxx|000000|0|
However, when I try to filter out records greater than 000000 I get the whole output of sample.out / the criteria is ignored.
It does not work probably because AWK does not know/recognize it is "integer".
Code:
$ cat sample.out | awk -F"|" '$6>=000000'
Question:
Is there any way to filter by column Connection_time greater than 000000?
(I know - it would be easier to import output to some DB)

Last edited by czezz; 10-14-2021 at 10:46 AM.
 
Old 10-14-2021, 10:52 AM   #2
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 17,201

Rep: Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822
if you want to compare strings use "
Code:
awk -F"|" '$6=="000000"' file.out
if you want to compare numbers you need to convert strings like 8.01MB to number, which is probably not that simple, but I guess you can work with: if $6 is not equal to 000000
Code:
awk -F"|" '$6!="000000"' file.out
 
2 members found this post helpful.
Old 10-14-2021, 10:56 AM   #3
czezz
Member
 
Registered: Nov 2004
Distribution: Slackware/Solaris
Posts: 899

Original Poster
Rep: Reputation: 41
Of course!
Thanks
 
Old 10-14-2021, 11:01 AM   #4
michaelk
Moderator
 
Registered: Aug 2002
Posts: 22,081

Rep: Reputation: 4430Reputation: 4430Reputation: 4430Reputation: 4430Reputation: 4430Reputation: 4430Reputation: 4430Reputation: 4430Reputation: 4430Reputation: 4430Reputation: 4430
Field 6 is cost but then why is the one line with cost=1 in the output or what am I missing.
 
Old 10-14-2021, 11:43 AM   #5
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2025Reputation: 2025Reputation: 2025Reputation: 2025Reputation: 2025Reputation: 2025Reputation: 2025Reputation: 2025Reputation: 2025Reputation: 2025Reputation: 2025
Quote:
Originally Posted by michaelk View Post
Field 6 is cost but then why is the one line with cost=1 in the output or what am I missing.
When the field separator is not the default whitespace, a delimiting character appearing at the start of a line will make the "empty" field before it count as $1. So in this case, the OP has it right, and $6 is connection time.

A similar effect occurs at the end of lines.


FYI, while it may not be applicable to this specific situation, the dateutils package includes a utility called dategrep, which is specifically designed for extracting lines that match various date and time criteria (from logfiles and such).
 
1 members found this post helpful.
Old 10-14-2021, 12:01 PM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,841

Rep: Reputation: 649Reputation: 649Reputation: 649Reputation: 649Reputation: 649Reputation: 649
Quote:
Originally Posted by czezz View Post
Im trying to filter out every single record ...
Be careful with your wording. "Filter out" could mean "get rid of." Perhaps you meant "Select" or "Keep."

Daniel B. Martin

.
 
1 members found this post helpful.
Old 10-14-2021, 12:06 PM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,841

Rep: Reputation: 649Reputation: 649Reputation: 649Reputation: 649Reputation: 649Reputation: 649
Don't overlook grep.

You could use ...
Code:
grep '|.*|.*|.*|.*|000000|' $InFile >$OutFile
If the format of your InFile is rigid you could simplify this to ...
Code:
grep '|000000|' $InFile >$OutFile
Daniel B. Martin

.
 
1 members found this post helpful.
Old 10-14-2021, 04:52 PM   #8
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,035

Rep: Reputation: Disabled
@danielbmartin. I guess what the OP wanted is grep -v '|000000|'
 
1 members found this post helpful.
Old 10-14-2021, 07:29 PM   #9
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,841

Rep: Reputation: 649Reputation: 649Reputation: 649Reputation: 649Reputation: 649Reputation: 649
Quote:
Originally Posted by shruggy View Post
@danielbmartin. I guess what the OP wanted is grep -v '|000000|'
Post #6 in this thread pointed out the ambiguity of "filter out." I prefer your solution but showed a solution in post #7 which produced a result the OP seemed to like.

Daniel B. Martin

.
 
Old 10-15-2021, 12:33 AM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,852

Rep: Reputation: 3111Reputation: 3111Reputation: 3111Reputation: 3111Reputation: 3111Reputation: 3111Reputation: 3111Reputation: 3111Reputation: 3111Reputation: 3111Reputation: 3111
Quote:
Originally Posted by czezz
Is there any way to filter by column Connection_time greater than 000000?
I am curious what the intention was here?
Looking at the data, and seeing you used >=, all data in your example does match this criteria.

$6 > 0 (or as many zeroes as you like as all will get ignored except the last) does correctly print all lines except those with all zeroes in the sixth field
 
1 members found this post helpful.
Old 10-15-2021, 02:20 AM   #11
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 1,856

Rep: Reputation: 827Reputation: 827Reputation: 827Reputation: 827Reputation: 827Reputation: 827Reputation: 827
@danielbmartin
Code:
grep '|.*|.*|.*|.*|000000|'
ensures there are 5 preceding | then searches from the 6th field onwards.
To only search the 6th field you must use the ^ anchor and non-greedy gaps.
Code:
grep '^[^|]*|[^|]*|[^|]*|[^|]*|[^|]*|000000|'
or
Code:
grep -P '^.*?|.*?|.*?|.*?|.*?|000000|'
awk is much better with fields.
EDIT: as mentioned later, the | is special in grep -P and each literal | must be escaped!

Last edited by MadeInGermany; 10-15-2021 at 06:42 PM.
 
1 members found this post helpful.
Old 10-15-2021, 04:17 AM   #12
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 17,201

Rep: Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822Reputation: 5822
Quote:
Originally Posted by MadeInGermany View Post
@danielbmartin
Code:
grep '|.*|.*|.*|.*|000000|'
ensures there are 5 preceding | then searches from the 6th field onwards.
This may give strange results in some cases....


Quote:
Originally Posted by MadeInGermany View Post
awk is much better with fields.
And I guess much faster too
 
Old 10-15-2021, 05:31 AM   #13
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,035

Rep: Reputation: Disabled
Quote:
Originally Posted by pan64 View Post
And I guess much faster too
Not necessarily so. grep -G and grep -E use a very smart and efficient GNU RE engine that avoids or minimizes backtracking in many cases. grep -P is another story though. PCRE is more powerful, but often also much slower than the GNU RE engine. In order to be efficient with it, you really have to use lazy/possessive quantifies or other tricks mentioned by MadeInGermany.

The famous Jamie Zawinski's quote was originally about Perl regexes. Although I prefer a similar, but more precise quote by Jan Goyvaerts, addressing the same problem:
Quote:
Originally Posted by Jan Goyvarts
People with little regex experience have surprising skill at coming up with exponentially complex regular expressions.

Last edited by shruggy; 10-15-2021 at 05:52 AM. Reason: More appropiate link
 
1 members found this post helpful.
Old 10-15-2021, 07:18 AM   #14
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,841

Rep: Reputation: 649Reputation: 649Reputation: 649Reputation: 649Reputation: 649Reputation: 649
Quote:
Originally Posted by MadeInGermany View Post
... To only search the 6th field you must use the ^ anchor and non-greedy gaps. ...
You are right. Your solution is robust. Mine were reliant on all InFiles adhering to the format and content of the sample. Might be true, might not.

On my machine this worked ...
Code:
echo; echo "Method #1 of LQ Senior Member MadeInGermany."
grep '^[^|]*|[^|]*|[^|]*|[^|]*|[^|]*|000000|' $InFile >$OutFile
... but this did not ...
Code:
echo; echo "Method #2 of LQ Senior Member MadeInGermany."
grep -P '^.*?|.*?|.*?|.*?|.*?|000000|' $InFile >$OutFile
Please advise.

Daniel B. Martin

.

Last edited by danielbmartin; 10-15-2021 at 07:33 AM. Reason: Clarification
 
Old 10-15-2021, 08:00 AM   #15
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,035

Rep: Reputation: Disabled
Quote:
Originally Posted by danielbmartin View Post
Please advise.
Vertical bar is a special metacharacter in Perl and extended regexes (alternate). To match it literally, you have to escape it with a backslash:
Code:
grep -P '^.*?\|.*?\|.*?\|.*?\|.*?\|000000\|' $InFile >$OutFile
awk uses extended RE as well. In the OP, awk -F'|' worked only because awk treats the argument to -F literally as long as it is one character long.

Last edited by shruggy; 10-15-2021 at 08:18 AM.
 
2 members found this post helpful.
  


Reply

Tags
csv


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
AWK - filter value greater than pradeepspa Linux - Newbie 7 07-04-2018 03:11 PM
tcp keepalive - why new connections have value slightly greater than value I just set. myjess Linux - Networking 2 02-24-2017 07:16 AM
Linux text file 6th Column if value is greater than 80 then Higlight /change colour gbm4ibm@gmail.com Linux - General 4 02-25-2015 10:13 AM
Spreadsheet question find value for column b based on max value in column k davholla General 2 02-28-2014 04:50 AM
awk comparing a column value with a stored variables value bugg_deccan Programming 4 12-05-2008 08:08 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:07 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration