"stictching" together files with changing linking values
3 Attachment(s)
Hi guys,
I have three data files that I need to "stitch" together with "linking values" that change between files. Here's what I mean by "stitching" and "linking value" and maybe there's a coding term I should use instead, idk? Beginning with fileA, I know that I want all the data in all the files that is associated with the "knownName" parameter in column 2, in this example the value is 6102. fileA also tells me that the value of 6102 is linked to "objectIdentifier_" value of 17447 in column 1. Now in fileB column 1 the "objectIdentifier_" single value of 17447 is linked to many different "dNumCount_" values listed in the 2nd column, 2764201, 2764379, 2765099, 2818365 (in a full length file there would be hundreds of them). Now in fileC those same "dNumCount_" values of fileB show up in column 1 but they are called "DcrldNumber". What I need to do is to "stich together" fileA->fileB->fileC-> and create an outputfile having whatever columns of data that I need to pull from from fileA and fileB and fileC filtering out the rest, probably using a using a print statement, idk? so far I have a tool (courtesy of grail) that begins the stitching process but it won't go further than fileB because the tool doesn't know how to handle the multiple occurances listed for 17447 these example files are all comma deliminated, but some of the files are space deliminated, and these example files have header lines, but some of the files will not fileA.dat Code:
objectIdentifier_ , knownName Code:
objectIdentifier_ , dNumCount_, completed_, confid_ Code:
DcrIdNumber , type , RdB, SdB, uVal, count , use, ... Code:
objectIdentifier_, DcrIdNumber, stringType, RdB, SdB , uVal, count , use, confid_ Code:
awk -vval=6102 -F" *, *" 'FNR==NR{if($1 == val){val2 = $2;nextfile};next}$5 == val2{sum++;print(val,$0)}' fileA fileB > outputfile.out Tabitha |
Quote:
There is a coreutils command call join, but it requires the data be sorted on the join field. Quote:
|
If I was going to have to do that with flat files, I'd use Perl to map each file as a hash so that the key values are the keys to each file hash.
To be honest though, this is a classic problem to solve using a DB+SQL. |
Hi ntubski and Chris,
Thank you guys so much for helping!!! ntubski, I posted the example data set in the first post, I guess I should have done that first - whoooops. Chris, I read about databases and SQL statements on line last night, and you are right that that would be a great way to solve the proble, too bad I don't have that option. in my girlish thinking of the psuedo code would be, but there's probably a better way??? 1. user inputs "knownName" value, either written in the code or typed in at the keyboard 2. read fileA once to find the "objectIdentifier_" which will be join_key(k) in this case 17447 3. read fileB looking for join_key(k) in order to find "dNumCount_" which will be join_key(j) in these example files the first occurence is 2818365 4. read fileC looking for "DcrIdNumber" join_key(j) 5. and print to an outputfile the lines of data from fileB and fileC 6. loop through fileB and fileC for the j+1 join_key, then j+2 join_key, then j+3 join_key, etc.etc. writing those lines of data to the outputfile until end of fileB. ntbuski, maybe fileB and fileC could be sorted on join_key(k) and then use the "join" command, but I'm not sure because they are diffenert lengths idk? Tabby |
Quote:
Code:
#!/bin/sh |
Hi ntubski, I hope you had a nice Thanksgiving!
I ran your script by using sh ntubki.sh and I got back Code:
objectIdentifier_ ,DcrIdNumber , stringType, RdB, SdB , uVal, count , use, confid_ so I guess I'm back to looking for an AWK of bash way of doing this, yes??? |
I can run grail's awk script once to find the correlation from my known 6102 to the first join key 17447 and
I kinda think that if there is a way for grail's awk script to "loop over" the many occrances of 17447 in fileB so then it will know the new join key(S)that relate to 17447 in fileC it could join them one by one to their match in fileC, yes, idk, maybe? Code:
awk -vval=6102 -F" *, *" 'FNR==NR{if($1 == val){val2 = $2;nextfile};next}$5 == val2{sum++;print(val,$0)}' fileA fileB > outputfile.out |
Quote:
Quote:
Code:
./ntubski.sh: 3: sqlite3: not found The sort and join commands don't really handle full csv either; I had to filter out the whitespace around the commas to make things work (it's also too bad sort lacks a --header option): Code:
#!/bin/bash |
Good morning ntubski, I ran your script below, well sort of:
Quote:
Code:
join: unrecognized option `--header' Code:
join -t, -1 2 -2 1 -o 1.1,2.1,2.2,2.3,2.4,2.5,2.6,2.7,1.4 \ ran the modified scirpt ./ntubski_mod.sh and it ran, well sort of, it didn't error out, but there was also no output file? so I removed all the commas spaces * , * and replaced them with just a single space (I wrote a script that does this very quickly), so that now all the files have this format: Code:
again, thanks so much for your help with this!!! Tabitha |
Quote:
Quote:
Just to clarify though, the script I wrote just puts the result to standard out, not to a file. By "no output file", do you mean no output whatsoever? Quote:
Code:
#!/bin/bash Code:
for file in file?.dat ; do sed '1d; s/ *, */ /g' $file > $file.space ; done |
wohhoo, super, it runs :) !!!
it would be really helpful if there was a switch or case statement (but I don't see that sort of cmnd in awk) that would enable it to handle a header line if one occurred in the file I also added a redirect to an output file that now I can do other stuff with :) many thanks! btw, the join --version gave me back GNU coreutils 5.97 thanks again, Tabitha |
In the long run you'll find these useful to bookmark (if you haven't already)
http://rute.2038bug.com/index.html.gz http://tldp.org/LDP/Bash-Beginners-G...tml/index.html http://www.tldp.org/LDP/abs/html/ http://www.grymoire.com/Unix/ I'd also add that if you will be doing a lot of data processing, install/learn some sort of SQL tool and/or learn Perl. Data munging is one of Perl's strengths. Good luck :) |
I totally agree that an SQL DB would be a great solution tooooo bad I don't have su on my system so that probably won't work for me :( but I'll ask....
to me Perl looked way more scary than AWK and bash, but I have been thinking about Python??? |
|
Quote:
Quote:
|
All times are GMT -5. The time now is 11:19 PM. |