LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 11-20-2012, 04:02 PM   #1
atjurhs
Member
 
Registered: Aug 2012
Posts: 133

Rep: Reputation: Disabled
"stictching" together files with changing linking values


Hi guys,

I have three data files that I need to "stitch" together with "linking values" that change between files.

Here's what I mean by "stitching" and "linking value" and maybe there's a coding term I should use instead, idk?

Beginning with fileA, I know that I want all the data in all the files that is associated with the "knownName" parameter in column 2, in this example the value is 6102. fileA also tells me that the value of 6102 is linked to "objectIdentifier_" value of 17447 in column 1.

Now in fileB column 1 the "objectIdentifier_" single value of 17447 is linked to many different "dNumCount_" values listed in the 2nd column, 2764201, 2764379, 2765099, 2818365 (in a full length file there would be hundreds of them).

Now in fileC those same "dNumCount_" values of fileB show up in column 1 but they are called "DcrldNumber".

What I need to do is to "stich together" fileA->fileB->fileC-> and create an outputfile having whatever columns of data that I need to pull from from fileA and fileB and fileC filtering out the rest, probably using a using a print statement, idk?

so far I have a tool (courtesy of grail) that begins the stitching process but it won't go further than fileB because the tool doesn't know how to handle the multiple occurances listed for 17447

these example files are all comma deliminated, but some of the files are space deliminated, and these example files have header lines, but some of the files will not

fileA.dat
Code:
objectIdentifier_ , knownName
17432 , 18123
17433 , 34512
17401 , 18123
17447 , 6102 
17452 , 18123
17503 , 6122
17511 , 11688
17899 , 5150
fileB.dat
Code:
objectIdentifier_ , dNumCount_,  completed_, confid_ 
0 , 2764106 , 1 , 0.0
17398 , 2764108 , 1  , 0.931 
17398 , 2764111 , 0  , 0.0 
17447 , 2764201 , 1  , 0.947 
17493 , 2764228 , 1  , 0.979 
0 , 2764322 , 1 , 0.0
17447 , 2764379 , 1  , 0.879
17447 , 2765099 , 1  , 0.988
17485 , 2765777 , 1  , 0.464
17450 , 2768881 , 1  , 0.999
17447 , 2818365 , 1  , 0.918
17551 , 2995401 , 0  , 0.0
17398 , 2996103 , 1  , 0.788
fileC.dat
Code:
DcrIdNumber , type , RdB, SdB, uVal,  count , use, ...   
2764106 , string_a , 580 , 284 , -8991718 , 298 , 1,  ... 
2764108 , string_a , 790 , 111 , -5695475 , 670 , 0,  ...
2764111 , string_b , 420 , 227 , -8991989 , 831 , 1,  ...
2764201 , string_c , 826 , 110 , -6050178 , 298 , 1,  ... 
2764228 , string_a , 226 , 528 , -3986151 , 514 , 0,  ...
2764322 , string_c , 224 , 969 , -8976960 , 171 , 1,  ...
2764379 , string_a , 747 , 961 , -3814686 , 218 , 1,  ...
2765099 , string_a , 798 , 775 , -6594104 , 199 , 1,  ...
2765777 , string_a , 400 , 830 , -7203293 , 950 , 1,  ...
2768881 , string_b , 719 , 698 , -6284989 , 115 , 1,  ...
2818365 , string_b , 655 , 411 , -8595573 , 824 , 1,  ...
2995401 , string_c , 897 , 210 , -8856596 , 128 , 0,  ...
2996103 , string_a , 510 , 323 , -5834545 , 764 , 1,  ...
outputfile.dat
Code:
objectIdentifier_, DcrIdNumber,  stringType, RdB, SdB , uVal, count , use, confid_   
17447 , 2764201 , string_c ,  826 , 110 , -6050118, 298 , 1 , 1 , 0.947 
17447 , 2764379 , string_a ,  747 , 961 , -3814686, 218 , 1 , 1 , 0.879 
17447 , 2765099 , string_a ,  798 , 775 , -6594104, 199 , 1 , 1 , 0.988 
17447 , 2818365 , string_b ,  655 , 411 , -8595573, 824 , 1 , 1 , 0.918
grail's script
Code:
awk -vval=6102 -F" *, *" 'FNR==NR{if($1 == val){val2 = $2;nextfile};next}$5 == val2{sum++;print(val,$0)}' fileA fileB > outputfile.out
any help, would be greatly appreciated, thanks soooo much!

Tabitha
Attached Images
File Type: jpg a.jpg (22.9 KB, 12 views)
File Type: jpg b.jpg (51.6 KB, 12 views)
File Type: jpg c.jpg (76.9 KB, 13 views)

Last edited by atjurhs; 11-21-2012 at 08:59 AM. Reason: added example data
 
Old 11-20-2012, 05:37 PM   #2
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,446

Rep: Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829
Quote:
Originally Posted by atjurhs View Post
Here's what I mean by "stitching" and "linking value" and maybe there's a coding term I should use instead, idk?
What you call "stiching" is known as a "Join" in SQL, or Relational Algebra. The "linking value" would be "Join Field" or Foreign Key (though Foreign Key implies a bit more than Join Field).

There is a coreutils command call join, but it requires the data be sorted on the join field.

Quote:
What I need to do is to "stich together" fileA->fileB->fileC-> and create a fileD having whatever columns of data that I want to pull from from fileA and fileB and fileC using a print statement

so far I have a tool (courtesy of grail) that begins the stitching process but it won't go further than fileB because the tool doesn't know how to handle the multiple occurances listed for 17447
It's a bit unclear how you want to handle multiple occurrences? Anyway, if you could post some sample data in text format, and sample output it would a lot easier to help.
 
1 members found this post helpful.
Old 11-20-2012, 05:37 PM   #3
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,261

Rep: Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028
If I was going to have to do that with flat files, I'd use Perl to map each file as a hash so that the key values are the keys to each file hash.
To be honest though, this is a classic problem to solve using a DB+SQL.
 
Old 11-21-2012, 09:28 AM   #4
atjurhs
Member
 
Registered: Aug 2012
Posts: 133

Original Poster
Rep: Reputation: Disabled
Hi ntubski and Chris,

Thank you guys so much for helping!!!

ntubski, I posted the example data set in the first post, I guess I should have done that first - whoooops.

Chris, I read about databases and SQL statements on line last night, and you are right that that would be a great way to solve the proble, too bad I don't have that option.

in my girlish thinking of the psuedo code would be, but there's probably a better way???

1. user inputs "knownName" value, either written in the code or typed in at the keyboard
2. read fileA once to find the "objectIdentifier_" which will be join_key(k) in this case 17447
3. read fileB looking for join_key(k) in order to find "dNumCount_" which will be join_key(j) in these example files the first occurence is 2818365
4. read fileC looking for "DcrIdNumber" join_key(j)
5. and print to an outputfile the lines of data from fileB and fileC
6. loop through fileB and fileC for the j+1 join_key, then j+2 join_key, then j+3 join_key, etc.etc. writing those lines of data to the outputfile until end of fileB.

ntbuski, maybe fileB and fileC could be sorted on join_key(k) and then use the "join" command, but I'm not sure because they are diffenert lengths idk?

Tabby

Last edited by atjurhs; 11-21-2012 at 09:50 AM. Reason: added idea of sorting and join
 
Old 11-21-2012, 11:18 AM   #5
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,446

Rep: Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829
Quote:
Originally Posted by atjurhs View Post
Chris, I read about databases and SQL statements on line last night, and you are right that that would be a great way to solve the proble, too bad I don't have that option.
Actually...
Code:
#!/bin/sh
name=6102
sqlite3 <<EOF
-- The fields that will be compared need to declared as NUMERIC otherwise the extra whitespace gets in the way.
CREATE TABLE a(objectIdentifier_ NUMERIC, knownName NUMERIC);
CREATE TABLE b(objectIdentifier_ NUMERIC, dNumCount_ NUMERIC,  completed_, confid_);
CREATE TABLE c(DcrIdNumber NUMERIC , stringType , RdB, SdB, uVal,  count , use, etc);

.mode csv
.import fileA.dat a
.import fileB.dat b
.import fileC.dat c
.mode list

SELECT a.objectIdentifier_, DcrIdNumber,  stringType, RdB, SdB , uVal, count , use, confid_
FROM a, b, c
WHERE (a._rowid_ == 1 AND b._rowid_ == 1 AND c._rowid_ == 1) OR -- first line contains header names
(a.knownName == $name AND
 a.objectIdentifier_ == b.objectIdentifier_ AND
 b.dNumCount_ == c.DcrIdNumber);
EOF
 
1 members found this post helpful.
Old 11-23-2012, 01:37 PM   #6
atjurhs
Member
 
Registered: Aug 2012
Posts: 133

Original Poster
Rep: Reputation: Disabled
Hi ntubski, I hope you had a nice Thanksgiving!

I ran your script by using sh ntubki.sh and I got back

Code:
objectIdentifier_ ,DcrIdNumber , stringType, RdB, SdB , uVal, count , use, confid_
which is obviously not what I was hoping for, so I looked in the bins of my system and I didn't find an sqlite3 executeable.

so I guess I'm back to looking for an AWK of bash way of doing this, yes???
 
Old 11-23-2012, 02:48 PM   #7
atjurhs
Member
 
Registered: Aug 2012
Posts: 133

Original Poster
Rep: Reputation: Disabled
I can run grail's awk script once to find the correlation from my known 6102 to the first join key 17447 and

I kinda think that if there is a way for grail's awk script to "loop over" the many occrances of 17447 in fileB so then it will know the new join key(S)that relate to 17447 in fileC it could join them one by one to their match in fileC, yes, idk, maybe?


Code:
awk -vval=6102 -F" *, *" 'FNR==NR{if($1 == val){val2 = $2;nextfile};next}$5 == val2{sum++;print(val,$0)}' fileA fileB > outputfile.out

Last edited by atjurhs; 11-23-2012 at 02:49 PM.
 
Old 11-23-2012, 06:37 PM   #8
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,446

Rep: Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829
Quote:
Originally Posted by atjurhs View Post
Hi ntubski, I hope you had a nice Thanksgiving!
Yes, but in October (I'm Canadian).

Quote:
I ran your script by using sh ntubki.sh and I got back

Code:
objectIdentifier_ ,DcrIdNumber , stringType, RdB, SdB , uVal, count , use, confid_
which is obviously not what I was hoping for, so I looked in the bins of my system and I didn't find an sqlite3 executeable.
If you don't have sqlite3 installed I would expect to see something like
Code:
./ntubski.sh: 3:  sqlite3: not found
Since you got some output, it seems like you do have sqlite3 installed, but maybe the format of the .csv files is a bit inconsistent. That seems like the main barrier to handling problems like this with sqlite3: it doesn't have a good builtin csv parser.

The sort and join commands don't really handle full csv either; I had to filter out the whitespace around the commas to make things work (it's also too bad sort lacks a --header option):
Code:
#!/bin/bash
name=6102
objId=$(awk -F' *, *' -vname=$name '$2 == name {print $1}' fileA.dat)
join --header -t, -1 2 -2 1 -o 1.1,2.1,2.2,2.3,2.4,2.5,2.6,2.7,1.4 \
    <(head -1 fileB.dat
    awk -F, -vobjId="$objId" '{gsub(/ *, */, ",")} objId == $1' fileB.dat | sort -t, -k2,2) \
    <(head -1 fileC.dat
    sed '1d; s/ *, */,/g' fileC.dat | sort -t, -k1,1)
 
Old 11-26-2012, 10:13 AM   #9
atjurhs
Member
 
Registered: Aug 2012
Posts: 133

Original Poster
Rep: Reputation: Disabled
Good morning ntubski, I ran your script below, well sort of:

Quote:
Originally Posted by ntubski View Post
The sort and join commands don't really handle full csv either; I had to filter out the whitespace around the commas to make things work (it's also too bad sort lacks a --header option):

Code:
#!/bin/bash
name=6102
objId=$(awk -F' *, *' -vname=$name '$2 == name {print $1}' fileA.dat)
join --header -t, -1 2 -2 1 -o 1.1,2.1,2.2,2.3,2.4,2.5,2.6,2.7,1.4 \
    <(head -1 fileB.dat
    awk -F, -vobjId="$objId" '{gsub(/ *, */, ",")} objId == $1' fileB.dat | sort -t, -k2,2) \
    <(head -1 fileC.dat
    sed '1d; s/ *, */,/g' fileC.dat | sort -t, -k1,1)
my machine does not recognize the "header" option of join, it came back with:

Code:
join: unrecognized option `--header'
so I re-wrote that line as

Code:
 join -t, -1 2 -2 1 -o 1.1,2.1,2.2,2.3,2.4,2.5,2.6,2.7,1.4 \
and removed the header lines from the files.

ran the modified scirpt ./ntubski_mod.sh and it ran, well sort of, it didn't error out, but there was also no output file?

so I removed all the commas spaces * , * and replaced them with just a single space (I wrote a script that does this very quickly), so that now all the files have this format:

Code:
 
0 2764106 1 0.0
17398 2764108 1 0.931 
17398  2764111 0 0.0 
17447 2764201 1 0.947 
17493 2764228 1 0.979 
0 2764322 1 0.0
17447 2764379 1 0.879
17447 2765099 1 0.988
17485 2765777 1 0.464
17450 2768881 1 0.999
17447 2818365 1 0.918
17551 2995401 0 0.0
17398 2996103 1  0.788
ran the script again ./ntubski_mod_mod.sh, and again it doesn't error out, but it doesn't produce an output file either. Any ideas as to the problem?

again, thanks so much for your help with this!!!

Tabitha

Last edited by atjurhs; 11-26-2012 at 10:18 AM.
 
Old 11-26-2012, 11:40 AM   #10
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,446

Rep: Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829
Quote:
Originally Posted by atjurhs View Post
my machine does not recognize the "header" option of join, it came back with:

Code:
join: unrecognized option `--header'
Hmm, do you know what version of join you have? Does join --version give you anything?

Quote:
and removed the header lines from the files.

ran the modified scirpt ./ntubski_mod.sh and it ran, well sort of, it didn't error out, but there was also no output file?
You would also have modify the other parts of the code because some of it is there to remove the first line so that it doesn't go through sort, but then add it back so it does go through join.

Just to clarify though, the script I wrote just puts the result to standard out, not to a file. By "no output file", do you mean no output whatsoever?


Quote:
so I removed all the commas spaces * , * and replaced them with just a single space (I wrote a script that does this very quickly), so that now all the files have this format:
That might be a better approach, but you also have to change the parts in the script that were told to use a comma as the delimiter: the -t option for join and sort, and the "-F," for awk. Things look quite a bit simpler without having to account for headers and commas:
Code:
#!/bin/bash
name=6102
objId=$(awk -vname=$name '$2 == name {print $1}' fileA.dat.space)
join -1 2 -2 1 -o 1.1,2.1,2.2,2.3,2.4,2.5,2.6,2.7,1.4 \
    <(awk -vobjId="$objId" 'objId == $1' fileB.dat.space | sort -k2,2) \
    <(sort -k1,1 fileC.dat.space)
The .space files were produced with
Code:
for file in file?.dat ; do sed '1d; s/ *, */ /g' $file > $file.space ; done
 
Old 11-26-2012, 03:42 PM   #11
atjurhs
Member
 
Registered: Aug 2012
Posts: 133

Original Poster
Rep: Reputation: Disabled
wohhoo, super, it runs !!!

it would be really helpful if there was a switch or case statement (but I don't see that sort of cmnd in awk) that would enable it to handle a header line if one occurred in the file

I also added a redirect to an output file that now I can do other stuff with

many thanks!

btw, the join --version gave me back GNU coreutils 5.97

thanks again, Tabitha

Last edited by atjurhs; 11-26-2012 at 03:56 PM.
 
Old 11-26-2012, 07:13 PM   #12
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,261

Rep: Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028
In the long run you'll find these useful to bookmark (if you haven't already)
http://rute.2038bug.com/index.html.gz
http://tldp.org/LDP/Bash-Beginners-G...tml/index.html
http://www.tldp.org/LDP/abs/html/
http://www.grymoire.com/Unix/

I'd also add that if you will be doing a lot of data processing, install/learn some sort of SQL tool and/or learn Perl.
Data munging is one of Perl's strengths.

Good luck
 
Old 11-26-2012, 09:36 PM   #13
atjurhs
Member
 
Registered: Aug 2012
Posts: 133

Original Poster
Rep: Reputation: Disabled
I totally agree that an SQL DB would be a great solution tooooo bad I don't have su on my system so that probably won't work for me but I'll ask....

to me Perl looked way more scary than AWK and bash, but I have been thinking about Python???
 
Old 11-27-2012, 05:59 AM   #14
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,261

Rep: Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028
https://en.wikipedia.org/wiki/There%...e_way_to_do_it
 
Old 11-27-2012, 09:44 AM   #15
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,446

Rep: Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829Reputation: 829
Quote:
Originally Posted by atjurhs View Post
it would be really helpful if there was a switch or case statement (but I don't see that sort of cmnd in awk) that would enable it to handle a header line if one occurred in the file
awk has a switch statement, but I don't think you would need it just to remove a header.

Quote:
btw, the join --version gave me back GNU coreutils 5.97
Ah, the --header option was added in version 8.5: NEWS item.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
nano -- changing colors for files without a "syntax" SilversleevesX Linux - Newbie 1 10-07-2012 02:00 AM
LXer: Debian: contempt for "end user" values has to stop! LXer Syndicated Linux News 0 08-13-2009 05:20 PM
"Permission denied" and "recursive directory loop" when searching for string in files mack1e Linux - Newbie 5 06-12-2008 07:38 AM
mambo: changing "Poll" to say "Encuesta" eantoranz Linux - Software 2 03-28-2006 11:35 AM
"X-MS" cant open because "x-Multimedia System" cant access files at "smb&qu ponchy5 Linux - Networking 0 03-29-2004 11:18 PM


All times are GMT -5. The time now is 06:30 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration