LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   script help... (https://www.linuxquestions.org/questions/linux-newbie-8/script-help-421417/)

realized 03-03-2006 09:06 PM

script help...
 
I have a script that compares fields in a csv file and spams me the matches..

#!/bin/bash

if [ ! -r $1 ]; then
echo "$1 does not exist or is not readable."
fi

gawk -F ',' '{ print $1 FS $2 FS $3 }' $1 | sed 's/ //g' | tr A-Z a-z | sort | uniq -d




My goal is for it to spam me the ENTIRE LINE of the MATCH.. i.e i will get at least 2 lines for every 1 match.. ..


Any ideas?

muha 03-04-2006 10:49 AM

I'm not entirely sure what you want to do. Maybe post some of the input and intended output :?
I think you want: egrep PATTERN
so something like:
gawk -F ',' '{ print $1 FS $2 FS $3 }' $1 |egrep match
to get the lines which contain 'match'

realized 03-06-2006 11:13 PM

change:

the goal is now to take a CSV FILE, and look at the FIRST FIELD (before the first ,)

If any lines, the first field of the line, matches.. i want the ENTIRE LINE/ALL FIELDS of that line spammed to the screen.

sooo the first part:

gawk -F ',' '{ print $1 }' $1 | uniq -d
works fine.

so now i have the output of JUST the matchs.. how do i "grep" each line of that .. i.e

grep 'match' $1 ??

how do i define the "match" ?

muha 03-07-2006 03:44 AM

again :D i'm not really positive on what you want since you did not give an example.
Anyways, it sounds like you want to match a PATTERN (let's say foo) before the first comma.
If it is matched, the output is the whole line. Rather then using cat, gawk, grep; i'd use sed.
Inputfile:
Code:

$ : cat aa.txt
output,test,test,
foo,test,test,
bar2,test,test,
foo,test,test
footest,test,

the interesting part:
Code:

$ : sed -n '/^foo,/p' aa.txt
foo,test,test,
foo,test,test

Or if there is something else behind the pattern foo allowed:
Code:

$ : sed -n '/^foo.*,/p' aa.txt
foo,test,test,
foo,test,test
footest,test,

If there is something before and after the pattern allowed:
Code:

$ : sed -n '/^[^,]*foo.*,/p' aa.txt
foo,test,test,
foo,test,test
footest,test,
testfoo,test,test,

What i'm trying to do:
-n no output. Usually used together with p.
-p print output.
^foo matches foo in the begin of the sentence
.* matches all characters (.) zero or more times (*)
[^,]* ^ for negation; so match a single non-comma; * for zero or more times.

If you really need the gawk, grep, you might be able to use line-numbers to pass the line number to print from gawk to grep ..

timmeke 03-07-2006 04:25 AM

Or simply use:
Code:

grep -e '^foo' $1
That grep's all lines starting with "foo".
Similar regular expression like in the sed example from muha are possible too.
awk is more powerful than sed and grep, but it can be trickier too.

If you insist on using awk, try something like this:
Code:

awk -F',' '/^foo/ {print;}' $1
Or to match certain columns:
Code:

awk -F',' '{if ($1=='foo') print;}' $1
In awk, the command "print;" prints the current line.
There are other possibilities too (like using the ~ or !~ operators in the if-test for matching regular expressions.

Edit: corrected small typo in awk commands.

muha 03-07-2006 05:56 AM

@timmeke, these two awk's don't work for me ..
This one does work for me:
Code:

awk -F ',' '/^foo/ {print}' aa.txt
It also takes prints the line when there is something between the pattern foo and the comma.

timmeke 03-07-2006 06:19 AM

You're right. The $1 at the end should of course be OUTSIDE of the single quotes, otherwise the shell
will say "file $1 not found" or something like that (the shell needs to interprete $1 as a variable,
not literally).

I've edited my previous post accordingly.

Quote:

It also takes prints the line when there is something between the pattern foo and the comma.
Indeed. The regular expression /^foo/ matches any line that starts with "foo", regardless of what follows.
But you can use any regular expression you like...

muha 03-07-2006 06:57 AM

Not trying to bitch here, but am trying to learn :)
@timmekes first awk: since we are not separating the columns, we can ditch the -F option.
We only check whether a line starts with foo, so:
Code:

awk '/^foo/ {print;}' aa.txt
The second awk only works when i double quote it, instead of single quotes 'foo'
Code:

awk -F',' '{if ($1=="foo") print;}' aa.txt

archtoad6 03-07-2006 09:32 AM

OP: Please, please, give us a (short) sample input file, the current output, & the desired output.

At this point I don't know what more you are trying to accomplish beyond what you already have.

Perhaps this is what you want:
Code:

F=<name_of_target_file>
S=","  # the separator, for flexiblity
L=1    # this a variable to aid debugging
for X in `cut -d"$S" -f 1 $F | sort -u`
do
  if [ `grep "^$X$S" $F | wc -l` -ge $L ]
    then grep "^$X$S" $F
  fi
done

Use at your own risk & no fair flaming any dumb coding errors -- I had no test target to run it on & I ain't writing one for you.

RTFM list:
  • uniq
  • sort
  • cut
  • wc
  • test
  • uniq
  • (bash)

This should also work:
Code:

F=<name_of_target_file>
S=","
uniq -t"$S" -D -W1 $F

Technically, this could be a one-liner, but using the variables makes it: a) easier to read, b) more flexible. (I thought my 1st try was too good to omit, & it gives a good demo finding a better way.)

realized 03-07-2006 01:11 PM

Example of file..


user@isp.com,500,200,100
test@aol.com,5431,3015,3561
casper@earthlink.net,4301,hah,mofo
user@isp.com,3051,01001,ajksdf,dadsf
homo@mofoisp.com,3035,1950,00dc,fmo
psst@hushmail.com,9315,d00,0llld,f
test@aol.com,3013,34,6,61


So that file has 2 matches... we are just comparing the first field.. the email addresses..

matches are:

test@aol.com
user@isp.com

so i want a script to parse a file like that, with any LINE where the FIRST FIELD matches ANY OTHER LINE OF THE FILE, i want it to show the entire line...

so the output would be:

user@isp.com,500,200,100
user@isp.com,3051,01001,ajksdf,dadsf

test@aol.com,5431,3015,3561
test@aol.com,3013,34,6,61

timmeke 03-08-2006 01:47 AM

@muha. Thanks for the corrections.
You're right for both awks.

@realized. To find "double" entries in the first column, I often use a little shell or Perl scripting.
awk can probably do it to.

A not-so-performant Bash example, using grep:
Code:

adr=`cut -d',' -f1 your_file | sort -u`; # this retrieves all (unique) e-mail addresses from your file
for i in $adr; do
  count=`grep -e "^${i}" your_file|wc -l`; #counts the number of occurrences for each of the addresses
  if (( $count > 1 )); then
      grep -e "^${i}" your_file; #prints all the matches found
  fi;
done

A Perl script would be more efficient, but it requires that you sort your file on the e-mail addresses first. In your case, this sorting is easy, since the addresses are in the first column (start of the lines).
So do
Code:

sort your_file > sorted_file;
first.
Then: use a script like this.
You may need to change the path to the Perl interpreter and possibly add some error/input checks.
Code:

#!/usr/bin/perl -w
$prevAdr=""; #we'll store the previous entry in this var.
$prevLine="";
open(FILE, $ARGV[0])||die "Cannot open file $ARGV[0]";
while(<FILE>)
{
  chop();
  $line=$_;
  @elem=split("\," , $line); #splits the line up based on the "," field separator
  if ($elem[0] eq $prevAdr)
  {
      print "$prevLine\n";
      print "$line\n";
  }
  $prevLine=$line;
  $prevAdr=$elem[0];
}
close(FILE);

Please note that:
1. I haven't tested this script, so it can be buggy.
2. For addresses that occur more than 2 times, which isn't checked, the script will print some entries twice. So you may want to pipe the output of this script into a "sort -u" or "uniq".

[/CODE]

timmeke 03-08-2006 02:05 AM

After reading the man page of the "uniq" command (I had never used it before), it seems that that command can do the same trick I just described.
Code:

uniq -d
or
Code:

uniq -D
seem to be your friends...

archtoad6 03-20-2006 06:57 AM

realized: Thanks for the sample.

timmeke: As I said, RT(F)M uniq

indeed uniq -D will do what you want, provided you sort the input 1st:
Code:

F=<name_of_target_file>
# S is a var. for flexibility -- -t, would also work
S=","
sort $F  | uniq -t"$S" -D -W1

It worked here on your sample.

archtoad6 03-20-2006 07:18 AM

BTW, my 1st piece of code also works, provided L=2. L=1 shows all lines.


All times are GMT -5. The time now is 06:51 AM.