deleting columns over multiple files
Hi guys,
I have a directory full of files, like maybe a hundred or two. These files are suppppper large in size, and I only need a few of the columns of data. So I'd like to delete a bunch of the files and only keep the data I need. I've found several ways to delete multiple columns of data on a single file, but I need to do that to lots of files. Can someone please explain to me a general simple way to take the functionality of an awk statement and loop it over many files? thanks! Tabitha |
the basic construct is:
Code:
for filename in <list>; do
the best solution depends on where your files are |
so something like this would remove the 3rd, 16th, 21st, and 22nd columns across hundreds of files?
Code:
for filename in *409.txt: How does filename get past into the awk command? |
You meant passed, not past......
The AWK syntax is awk <commands> filename so just add $filename after the AWK command string |
Actually the loop is not really necessary, since awk accepts multiple arguments and it's able to process all the files in sequence. The code should be slightly corrected (mainly it misses a equal sign before the empty string) but it is not clear to me what the actual field separator is: blank spaces, commas or both?
Here is my suggestion applied to files whose fields (columns) are separated by blank spaces: Code:
awk '{ $3=$16=$21=$22=""; gsub(/ +/,OFS); gsub(/^ +| +$/,""); print > ( FILENAME ".processed" ) }' *409.txt Code:
rename ".processed" "" *.processed A final note: don't try to redirect the output of awk directly to the original input, that is using the internal variable FILENAME alone, because in case of large files the input is buffered and at some point (maybe after a few tenths of lines) it might be completely lost and the resulting output hopelessly truncated. |
Hi guys, well the output wasn't exactly right.
here's what my input file *409.txt looks like, note even though many of the columns don't have data or only partial amounts of data, the file is a csv file. head1, head2, head3, head4, head5, head6, head7, head8, head9, head10 16987423, 1960, 103, 0, 1333275624, -3.6742382, 0.01, 0, stringA, 56723867 , , , , 67332,5555555, , 0.04, , , , , , , ,5555556, 79633 , -68.1 , , , , , , ,5555559, , , , , , , , ,5555561, , , , , , , , ,5555562, , , , , , , , ,5555569, , , , 58767422, 2101, 965, 0, 13333845222, -0.144383, 0.01, 0, stringB, 98673890 and this "block" of data repeat (just different number and strings) till EOF and that's a looooooong ways cause the files are REALLY BIG after executing: Code:
awk '{ $1=$2=$3=$4""; gsub(/ +/,OFS); gsub(/^ +| +$/,""); print > ( FILENAME ".processed" ) }' *409.txt no header line 0,0,0,0,1333275624, -3.6742382, 0.01, 0, stringA, 56723867 , , , , 67332,5555555, , 0.04, , , , , , , ,5555556, 79633, -68.1 , , , , , , ,5555559, , , , , , , , ,5555561, , , , , , , , ,5555568, , , , , , , , ,5555569, , , , 0,0,0,0,13333845222, -0.144383, 0.01, 0, stringB, 98673890 so it looks like it didn't delete the first four columns of the data. It looks like it just replaced the data at the top and bottom of each "block" of data with zeros, and the number of comas on each row stayed the same. My main goal here is to reduce the file size by removing columns of data I don't need, so having the zeros messes that up these files are too large to suck into Excel, but if I could, I would simply highlight the columns I don't want and delete. this is the functionality that I want to have with this script thanks sooooo much guys for your help! Tabitha |
Well.. the posted example clarifies that the fields are comma (not space separated) so that, with a slight modification the suggested code should work as expected. Said that, I'd like to make you notice that you repeated the typo (?) in your line of code, that is:
Code:
$1=$2=$3=$4"" Code:
$1=$2=$3=$4="" The rest of the code serves to adjust the unwanted delimiters (if any). However I still have some doubt about the appearance of the desired output. Please, can you post an example of output based on the sample lines posted above? Here I copy/paste your input into CODE tags, that reveal (preserve) the true spacing of the file: Code:
16987423, 1960, 103, 0, 1333275624, -3.6742382, 0.01, 0, stringA, 56723867 |
so I'm guessing that the coma based delimeter get stipulated in this part of the script
Code:
gsub(/ +/,OFS) I'm thinking that Code:
gsub(/, +/,OFS) Tabby |
Code:
gsub(/ +/,OFS) In a true CSV file, e.g. Code:
field 1,field 2,field 3,field 4,field 5,field 6,field 7,field 8,field 9 Code:
$ awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; print}' file Code:
$ awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; gsub(/,,+/,OFS); print}' file Code:
$ awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; gsub(/,,+/,OFS); gsub(/^,+|,+$/,""); print}' file |
Quote:
A safer bet would be to either use find and a while+read loop, with null separators. Code:
while IFS='' read -r -d '' fname; do script.awk: Code:
#!/usr/bin/awk -f Code:
find . -maxdepth 1 -type f -name '*.409.txt' -exec script.awk '{}' \+ |
Awk is quite confusing to test. Anyway in bash you could make it work like this:
Code:
#!/bin/bash |
colucix, I ran your last script
Code:
awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; gsub(/,,+/,OFS); gsub(/^,+|,+$/,""); print}' file so I tried Code:
awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; gsub(/,,+/,OFS); gsub(/^,+|,+$/,""); print > ( FILENAME ".processed" ) }' *409.txt after removing all spaces, and all but one coma, per column so that it is a "true" *.csv file like: field1,field2,field3, etc. column 1 has the header and first row of data from column 3 then from row 2 till the EOF it has column 1 data column 2 has the header and first row of data from column 5 then from row 2 till the EOF it has column 2 data column 3 has the header and first row of data from column 8 then from row 2 till the EOF it has column 3 data column 4 has the header and first row of data from column 10 then from row 2 till the EOF it has column 4 data column 5 has the header and first row of data from column 11 then from row 2 till the EOF it has column 5 data column 6 has the header and first row of data from column 12 then from row 2 till the EOF it has column 6 data so essentially the script only worked on the first row of data, but it did not delete any of the data columns below the first row. the rest of the columns have the same pattern of being 6 columns off in the header and first row of data. |
konsolebox
for constency in the data file I am using,I removed the same numbers of column data as colucix did, ie. 1 2 4 6 7 9 here's the results of running your bash script on the file after removing all spaces, and all but one coma, per column so that it is a "true" *.csv file like: field1,field2,field3, etc. column 1 is identical to the input file's column 1 column 2 has the header and first row of data from column 4 then from row 2 till the EOF it has column 2 data column 3 has the header and first row of data from column 6 then from row 2 till the EOF it has column 3 data column 4 has the header and first row of data from column 9 then from row 2 till the EOF it has column 4 data column 5 has the header and first row of data from column 11 then from row 2 till the EOF it has column 5 data column 6 has the header and first row of data from column 12 then from row 2 till the EOF it has column 6 data so essentially the script only worked on the first row of data, but it was off in index by 1, and it did not delete any of the data columns below the first row. the rest of the columns have the same pattern of being 6 columns off in the header and first row of data |
@atjurhs: With respect to the results, could you give me a sample of the input and the output?
|
all the data both input and output is on a private LAN, so I'll do my best typing to reflect both.
the input data is very well depicted in post number 7 and here's an example of the output data. It that has varied amounts of spaces and comas in between each column that has data. Some of the columns contain no data, just separated by multiple spaces and comas. Other columns do have data that may of may not be separated by multiple spaces or comas. 1, 4242, 3.42323e+23, 0.1, 0, 0,5,294875, 8438393, 394,,,,,,,, ,0, ,0,,,, 0.487564, , ,0, 0,0, 87563,,,,,,,,, , 0 , ,1, ,,,,,,,,,,,,,,,, 0, , , , 5, ,1, ,,,,,,5241,,,,, , , 0.4543e-3 , 0 , 111111111, 1, 1000,,,, 9576336e+10, 0.1, 0, 0, ,,, , 8438393, 001, if by hand I delete all the spaces, and replace all multiple occurrences of comas (two or more consecutive) with a single coma, then the data looks like that below. note the 1st and 5th rows are missing an initial coma, or you could say that the 2nd through 4th rows have an extra coma at the front. rc11,rc12,rc13,rc14,rc15, etc. ,rc21,rc22,rc23,rc24,rc25, etc. ,rc31,rc32,rc33,rc34,rc35, etc. ,rc41,rc42,rc43,rc44,rc45, etc. rc51,rc52,rc53,rc54,rc55, etc. etc. etc. etc. I really hope this helps you! and REALLY appreciate you helping me! Tabitha |
All times are GMT -5. The time now is 03:33 PM. |