LinuxQuestions.org - script help...

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - script help... (https://www.linuxquestions.org/questions/linux-newbie-8/script-help-421417/)

realized

03-03-2006 09:06 PM

script help...

I have a script that compares fields in a csv file and spams me the matches..

#!/bin/bash

if [ ! -r $1 ]; then
echo "$1 does not exist or is not readable."
fi

gawk -F ',' '{ print $1 FS $2 FS $3 }' $1 | sed 's/ //g' | tr A-Z a-z | sort | uniq -d

My goal is for it to spam me the ENTIRE LINE of the MATCH.. i.e i will get at least 2 lines for every 1 match.. ..

Any ideas?

muha	03-04-2006 10:49 AM

I'm not entirely sure what you want to do. Maybe post some of the input and intended output :?
I think you want: egrep PATTERN
so something like:
gawk -F ',' '{ print $1 FS $2 FS $3 }' $1 |egrep match
to get the lines which contain 'match'

realized

03-06-2006 11:13 PM

change:

the goal is now to take a CSV FILE, and look at the FIRST FIELD (before the first ,)

If any lines, the first field of the line, matches.. i want the ENTIRE LINE/ALL FIELDS of that line spammed to the screen.

sooo the first part:

gawk -F ',' '{ print $1 }' $1 | uniq -d
works fine.

so now i have the output of JUST the matchs.. how do i "grep" each line of that .. i.e

grep 'match' $1 ??

how do i define the "match" ?

muha	03-07-2006 03:44 AM

again :D i'm not really positive on what you want since you did not give an example.
Anyways, it sounds like you want to match a PATTERN (let's say foo) before the first comma.
If it is matched, the output is the whole line. Rather then using cat, gawk, grep; i'd use sed.
Inputfile:

Code:

$ : cat aa.txt

output,test,test,

foo,test,test,

bar2,test,test,

foo,test,test

footest,test,

the interesting part:

Code:

$ : sed -n '/^foo,/p' aa.txt

foo,test,test,

foo,test,test

Or if there is something else behind the pattern foo allowed:

Code:

$ : sed -n '/^foo.*,/p' aa.txt

foo,test,test,

foo,test,test

footest,test,

If there is something before and after the pattern allowed:

Code:

$ : sed -n '/^[^,]*foo.*,/p' aa.txt

foo,test,test,

foo,test,test

footest,test,

testfoo,test,test,

What i'm trying to do:
-n no output. Usually used together with p.
-p print output.
^foo matches foo in the begin of the sentence
.* matches all characters (.) zero or more times (*)
[^,]* ^ for negation; so match a single non-comma; * for zero or more times.

If you really need the gawk, grep, you might be able to use line-numbers to pass the line number to print from gawk to grep ..

timmeke

03-07-2006 04:25 AM

Or simply use:

Code:

grep -e '^foo' $1

That grep's all lines starting with "foo".
Similar regular expression like in the sed example from muha are possible too.
awk is more powerful than sed and grep, but it can be trickier too.

If you insist on using awk, try something like this:

Code:

awk -F',' '/^foo/ {print;}' $1

Or to match certain columns:

Code:

awk -F',' '{if ($1=='foo') print;}' $1

In awk, the command "print;" prints the current line.
There are other possibilities too (like using the ~ or !~ operators in the if-test for matching regular expressions.

Edit: corrected small typo in awk commands.

muha	03-07-2006 05:56 AM

@timmeke, these two awk's don't work for me ..
This one does work for me:

Code:

awk -F ',' '/^foo/ {print}' aa.txt

It also takes prints the line when there is something between the pattern foo and the comma.

timmeke

03-07-2006 06:19 AM

You're right. The $1 at the end should of course be OUTSIDE of the single quotes, otherwise the shell
will say "file $1 not found" or something like that (the shell needs to interprete $1 as a variable,
not literally).

I've edited my previous post accordingly.

Quote:

It also takes prints the line when there is something between the pattern foo and the comma.

Indeed. The regular expression /^foo/ matches any line that starts with "foo", regardless of what follows.
But you can use any regular expression you like...

muha	03-07-2006 06:57 AM

Not trying to bitch here, but am trying to learn :)
@timmekes first awk: since we are not separating the columns, we can ditch the -F option.
We only check whether a line starts with foo, so:

Code:

awk '/^foo/ {print;}' aa.txt

The second awk only works when i double quote it, instead of single quotes 'foo'

Code:

awk -F',' '{if ($1=="foo") print;}' aa.txt

archtoad6

03-07-2006 09:32 AM

OP: Please, please, give us a (short) sample input file, the current output, & the desired output.

At this point I don't know what more you are trying to accomplish beyond what you already have.

Perhaps this is what you want:

Code:

F=<name_of_target_file> 

S=","  # the separator, for flexiblity 

L=1    # this a variable to aid debugging 

for X in `cut -d"$S" -f 1 $F | sort -u` 

do 

  if [ `grep "^$X$S" $F | wc -l` -ge $L ] 

    then grep "^$X$S" $F 

  fi 

done

Use at your own risk & no fair flaming any dumb coding errors -- I had no test target to run it on & I ain't writing one for you.

RTFM list:

uniq
sort
cut
wc
test
uniq
(bash)

This should also work:

Code:

F=<name_of_target_file> 

S="," 

uniq -t"$S" -D -W1 $F

Technically, this could be a one-liner, but using the variables makes it: a) easier to read, b) more flexible. (I thought my 1st try was too good to omit, & it gives a good demo finding a better way.)

realized

03-07-2006 01:11 PM

Example of file..

user@isp.com,500,200,100
test@aol.com,5431,3015,3561
casper@earthlink.net,4301,hah,mofo
user@isp.com,3051,01001,ajksdf,dadsf
homo@mofoisp.com,3035,1950,00dc,fmo
psst@hushmail.com,9315,d00,0llld,f
test@aol.com,3013,34,6,61

So that file has 2 matches... we are just comparing the first field.. the email addresses..

matches are:

test@aol.com
user@isp.com

so i want a script to parse a file like that, with any LINE where the FIRST FIELD matches ANY OTHER LINE OF THE FILE, i want it to show the entire line...

so the output would be:

user@isp.com,500,200,100
user@isp.com,3051,01001,ajksdf,dadsf

test@aol.com,5431,3015,3561
test@aol.com,3013,34,6,61

timmeke

03-08-2006 01:47 AM

@muha. Thanks for the corrections.
You're right for both awks.

@realized. To find "double" entries in the first column, I often use a little shell or Perl scripting.
awk can probably do it to.

A not-so-performant Bash example, using grep:

Code:

adr=`cut -d',' -f1 your_file | sort -u`; # this retrieves all (unique) e-mail addresses from your file

for i in $adr; do

  count=`grep -e "^${i}" your_file|wc -l`; #counts the number of occurrences for each of the addresses

  if (( $count > 1 )); then

      grep -e "^${i}" your_file; #prints all the matches found 

  fi;

done

A Perl script would be more efficient, but it requires that you sort your file on the e-mail addresses first. In your case, this sorting is easy, since the addresses are in the first column (start of the lines).
So do

Code:

sort your_file > sorted_file;

first.
Then: use a script like this.
You may need to change the path to the Perl interpreter and possibly add some error/input checks.

Code:

#!/usr/bin/perl -w

$prevAdr=""; #we'll store the previous entry in this var.

$prevLine="";

open(FILE, $ARGV[0])||die "Cannot open file $ARGV[0]";

while(<FILE>)

{

  chop();

  $line=$_;

  @elem=split("\," , $line); #splits the line up based on the "," field separator

  if ($elem[0] eq $prevAdr)

  {

      print "$prevLine\n";

      print "$line\n";

  }

  $prevLine=$line;

  $prevAdr=$elem[0];

}

close(FILE);

Please note that:
1. I haven't tested this script, so it can be buggy.
2. For addresses that occur more than 2 times, which isn't checked, the script will print some entries twice. So you may want to pipe the output of this script into a "sort -u" or "uniq".

[/CODE]

timmeke

03-08-2006 02:05 AM

After reading the man page of the "uniq" command (I had never used it before), it seems that that command can do the same trick I just described.

Code:

uniq -d

Code:

uniq -D

seem to be your friends...

archtoad6

03-20-2006 06:57 AM

realized: Thanks for the sample.

timmeke: As I said, RT(F)M uniq

indeed uniq -D will do what you want, provided you sort the input 1st:

Code:

F=<name_of_target_file>

# S is a var. for flexibility -- -t, would also work

S=","

sort $F  | uniq -t"$S" -D -W1

It worked here on your sample.

archtoad6

03-20-2006 07:18 AM

BTW, my 1st piece of code also works, provided L=2. L=1 shows all lines.

All times are GMT -5. The time now is 06:51 AM.