LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-11-2012, 10:33 PM   #1
atjurhs
Member
 
Registered: Aug 2012
Posts: 179

Rep: Reputation: Disabled
linking between two files


Hi guys,

I have two files that I need to "link" (not combine) and get a count on a specific value in a specific column, let me explain...

FileA is a csv that has two columns:

Code:
61443, 97336
68473, 59775
12345, 67890
23159, 09895
09785, 13844
FileB is a csv with many columns, but I only care about, values in the 5th column

Code:
79, 478394554984562, 0, 1, 89705, 1, 0, 89657943793...
79, 478394554984563, 0, 1, 00001, 0, 1, 65894584945...
79, 478394554984564, 0, 1, 67890, 0, 0, 23987593872...
79, 478394554984565, 0, 1, 67890, 0, 1, 09347934433...
79, 478394554984566, 0, 1, 11114, 0, 0, 67539849393...
79, 478394554984567, 0, 1, 67890, 1, 1, 65894584945...
79, 478394554984568, 0, 1, 67890, 0, 0, 12893490543...
79, 478394554984569, 0, 1, 48760, 0, 1, 56804850333...
In FileA my input vale (which I know beforehand) is 12345. The "linking" value to FileB (which I don't know beforehand, but read from FileA) is 67890. Then I need to get a count of all the occurrences of the linking value 67890 in FileB.

So I can't just do a grep on FileB for 67890, because I don't know 67890 before searching for the 12345 in FileA.

There's one additional problem to this. Sometimes 12345 in FileA might not correspond to one and only one linking value. It could correspond to both 67890 and say 11111. And I won't know if there are two beforehand either. So then I need two counts, one on 67890 in FileB and another (second) count on 11111 in FileB. BTW, there can never be more than two linking values.

can you guys help me?

Tabitha

Last edited by atjurhs; 10-11-2012 at 10:35 PM.
 
Old 10-12-2012, 01:20 AM   #2
nugat
Member
 
Registered: Sep 2012
Posts: 122

Rep: Reputation: 31
Hi,

What I would recommend is using a (Bash) array and a loop. Since you know the input, pass it as a command line argument to this script. Then use awk to match on the input in the first column in FileA, print column two, if a match is found. This result is saved to an array.

Then loop thru the array, and for each link found in FileA, use awk to match on it in the 5th column in FileB.

Try it:

Code:
#!/bin/bash
[ $# -ne 1 ] && echo "Usage: $0 <input>" && exit 1
input=$1

# files
fileA='fileA.txt'
fileB='fileB.txt'

declare -a links
links=($(awk -F, "\$1 ~ /$input/{print \$2}" $fileA|sed -e 's|^[[:space:]]*||'))
if [ ${#links[*]} -lt 1 ]; then
  echo "Input \`$input' not found in first column of $fileA"
  exit 1
fi
echo "Found ${#links[*]} links for $input in $fileA: ${links[*]}"

for link in ${links[*]}; do
  echo -n "Getting count of '$link' in $fileB: "
  awk -F, "\$5 ~ /$link/{print \$5}" $fileB|grep -c .
done
 
1 members found this post helpful.
Old 10-12-2012, 03:45 PM   #3
atjurhs
Member
 
Registered: Aug 2012
Posts: 179

Original Poster
Rep: Reputation: Disabled
I'm not sure I unsderstand what "links" is performing in your script, there's too many things it it that are above my head. I mean I understand howe it is being used after it, there's parts to it that I don't understand their function

Tabitha

Last edited by atjurhs; 10-12-2012 at 04:15 PM.
 
Old 10-12-2012, 04:54 PM   #4
atjurhs
Member
 
Registered: Aug 2012
Posts: 179

Original Poster
Rep: Reputation: Disabled
here's what I get back:

[codes_directory]$ sh searching_script.bash 12345
Found 0 link for 12345 in fileA.txt
Getting counts of '67890' in fileB.txt: 4

which is the right answer for the number of 67890 occurences, and if I give 12345 a second assignment in FileA.txt like 11111 the script will also find the number of those occrenences. perfect!

but there is a slight bug. the script should not give back "Found 0 link for 12345 in fileA.txt" it found the linking value of 67890 so it should give back 1 and if there was a second occerence of a linking value, like the 11111, then it should have given back 2.

thanks so much!

Tabitha
 
Old 10-12-2012, 05:09 PM   #5
nugat
Member
 
Registered: Sep 2012
Posts: 122

Rep: Reputation: 31
Quote:
Originally Posted by atjurhs View Post
I'm not sure I unsderstand what "links" is performing in your script, there's too many things it it that are above my head. I mean I understand howe it is being used after it, there's parts to it that I don't understand their function
I am using "links" as a bash array. that basically means a variable that consists of a list of elements. e.g., you could define them individually, using an index number, like this:

Code:
links[0]=12345
links[1]=67890
you could then display the contents of the entire array with this:
Code:
$ echo ${links[*]}
12345 67890
or a specific element, using its index number:
Code:
$ echo ${links[1]}
67890
you can also print the number of elements in the array with:
Code:
$ echo ${#links[*]}
2
 
1 members found this post helpful.
Old 10-12-2012, 05:12 PM   #6
nugat
Member
 
Registered: Sep 2012
Posts: 122

Rep: Reputation: 31
Quote:
Originally Posted by atjurhs View Post
here's what I get back:

[codes_directory]$ sh searching_script.bash 12345
Found 0 link for 12345 in fileA.txt
Getting counts of '67890' in fileB.txt: 4

which is the right answer for the number of 67890 occurences, and if I give 12345 a second assignment in FileA.txt like 11111 the script will also find the number of those occrenences. perfect!

but there is a slight bug. the script should not give back "Found 0 link for 12345 in fileA.txt" it found the linking value of 67890 so it should give back 1 and if there was a second occerence of a linking value, like the 11111, then it should have given back 2.
hmmm..i think the bug was introduced when you created "searching_script.bash". it clearly finds the input in FileA b/c it finds the related value in FileB. i'm guessing there is a typo in this line:

Code:
echo "Found ${#links[*]} links for $input in $fileA: ${links[*]}"
can you have a look?
 
1 members found this post helpful.
Old 10-12-2012, 07:03 PM   #7
atjurhs
Member
 
Registered: Aug 2012
Posts: 179

Original Poster
Rep: Reputation: Disabled
Hi nugat,

Code:
 links=($(awk -F, "\$1 ~ /$input/{print \$2}" $fileA|sed -e 's|^[[:space:]]*||'))
I much appreciate the explanation of the links output. That helps! What I really don't understand is the sed part of the statement.

I thought I checked it closely, but I'll have to wait till Monday to check the echo statement again.....

I certainly will get back with you.

Thanks sooooo much,

Tabby
 
Old 10-12-2012, 08:10 PM   #8
nugat
Member
 
Registered: Sep 2012
Posts: 122

Rep: Reputation: 31
Quote:
Originally Posted by atjurhs View Post
Code:
 links=($(awk -F, "\$1 ~ /$input/{print \$2}" $fileA|sed -e 's|^[[:space:]]*||'))
I much appreciate the explanation of the links output. That helps! What I really don't understand is the sed part of the statement.
all that sed statement is doing is stripping off the leading white space at the beginning of what awk is outputting.

here is what the value would look like, enclosed in single quotes, before the sed:

' 12345'

and after sed:

'12345'

I am using a POSIX character class there for white space matching (the [[:space:]] portion). That will match on any space or tab character. And the asterisk after it just means to match more than one space.

Why strip that space in the first place? It is cleaner. When we then look for that value in the second file, we don't want to be worrying about white space surrounding the value.
 
1 members found this post helpful.
Old 10-12-2012, 11:54 PM   #9
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,424

Rep: Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823
How about something like:
Code:
awk -vval=12345 -F" *, *" 'FNR==NR{if($1 == val){val2 = $2;nextfile};next}$5 == val2{sum++;print}END{print "Total lines containing",val2,"is",sum}' FileA FileB
 
Old 10-15-2012, 10:07 AM   #10
atjurhs
Member
 
Registered: Aug 2012
Posts: 179

Original Poster
Rep: Reputation: Disabled
nugat, you were absolutely correct! I had typed

Code:
 echo "Found ${#link[*]} ......
without the "s" on links

istead of

Code:
 echo "Found ${#links[*]} ......
with the "s"

you've really been sooooo much help, thanks you so much and I've learned a few bash coding things as well. It's kinda hard to learn this stuff without getting to take a class on it. Other than the O'Riely pubs on bash, sed, and awk, are there otehr books that are more teaching/tutorial to help me learn this easier?

muah, Tabby

Last edited by atjurhs; 10-15-2012 at 10:13 AM.
 
Old 10-15-2012, 10:54 AM   #11
atjurhs
Member
 
Registered: Aug 2012
Posts: 179

Original Poster
Rep: Reputation: Disabled
Ok, so as it turns out I goofed slightly, the 12345 val is in the 2nd column (not the 1st) of FileA.

thinking that the
Code:
 awk -F
was doing the searching when defing links, I first tried just changing

Code:
 links=($(awk -F, "\$1 ~ /$input/{print \$2}"   ....
to

Code:
 links=($(awk -F, "\$2 ~ /$input/{print \$2}"   ....
that didn't work, I also know that it is not in the very bottom awk statement because I can easily change it's column assignment to anything I want

Code:
  awk -F, "\$27 ~ /$link/{print \$27}" $fileB|grep -c .
and it runs fine. I really thought the answer would be in that first awk statement, but as much as I play around with< I can'r get it to come out right

Tabby

Last edited by atjurhs; 10-15-2012 at 12:57 PM.
 
Old 10-15-2012, 06:43 PM   #12
nugat
Member
 
Registered: Sep 2012
Posts: 122

Rep: Reputation: 31
Hi,

So if you mean that the input you need to match on is in the 2nd column of FileA, and you want it's paired value in the first column, then take that value and search for it in FileB, then you just need to swap the $1 and the $2 in the first awk statement, e.g.:

Code:
links=($(awk -F, "\$2 ~ /$input/{print \$1}" $fileA|sed -e 's|^[[:space:]]*||'))
Btw, the $1 and the $2 are internal awk variables representing the 1st and 2nd fields of data (by default, white-space separated). does that work?
 
1 members found this post helpful.
Old 10-15-2012, 08:29 PM   #13
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,258

Rep: Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328
Quote:
teaching/tutorial
Try these
http://rute.2038bug.com/index.html.gz
http://tldp.org/LDP/Bash-Beginners-G...tml/index.html
http://www.tldp.org/LDP/abs/html/
 
1 members found this post helpful.
Old 10-16-2012, 10:22 AM   #14
atjurhs
Member
 
Registered: Aug 2012
Posts: 179

Original Poster
Rep: Reputation: Disabled
nugat, that worked perfectly!

chrism01, that's for the reads, I 'll start looking through them to try and learn more. Learning this on my own w/o a formal class or someone right there to explain things, is difficult, but you guys help sooooo much.

thanks again guys!

Tabby
 
Old 10-16-2012, 11:15 AM   #15
atjurhs
Member
 
Registered: Aug 2012
Posts: 179

Original Poster
Rep: Reputation: Disabled
next I'd like to print off the number of occuances for any one pair.

I tried writing an echo and > statment both befrore and after the "done" at the end ofthe script, which I thought I should be echoing $link - that didn't work

so I tried an fprintf statement in the last awk line - that didn't work

then I tried to wrap the whole script in another awk stament and a > to a text file. - that also didn't work

what's the correst way?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Linking C++ files to C flipflopfrog Programming 8 01-31-2011 09:07 AM
linking some files chewbo Linux - Software 3 01-28-2008 04:35 PM
Linking Library files mickeyboa Fedora 1 10-16-2006 08:12 AM
Linking Files sksom123 Linux - General 2 08-21-2006 03:55 AM
linking files? citrus Linux - Software 4 01-15-2004 07:46 PM


All times are GMT -5. The time now is 06:03 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration