LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-17-2012, 07:55 PM   #1
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Rep: Reputation: Disabled
deleting columns over multiple files


Hi guys,

I have a directory full of files, like maybe a hundred or two.

These files are suppppper large in size, and I only need a few of the columns of data. So I'd like to delete a bunch of the files and only keep the data I need.

I've found several ways to delete multiple columns of data on a single file, but I need to do that to lots of files.

Can someone please explain to me a general simple way to take the functionality of an awk statement and loop it over many files?

thanks!

Tabitha
 
Old 09-17-2012, 08:23 PM   #2
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738
the basic construct is:
Code:
for filename in <list>; do
     <stuff--eg your AWK command(s)>
done
there are a number of ways of implementing <list>.....eg:
  • a simple wildcard (*)---acts on every file in the current directory
  • The output of something like ls|grep something
  • read from a file containing the filenames

the best solution depends on where your files are
 
Old 09-17-2012, 08:43 PM   #3
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
so something like this would remove the 3rd, 16th, 21st, and 22nd columns across hundreds of files?

Code:
for filename in *409.txt:
do
   awk 'BEGIN { FS=" "; OFS=" " } {$3=$16=$21=$22"";gsub(",+",",",$0)}1'
done

How does filename get past into the awk command?

Last edited by atjurhs; 09-17-2012 at 08:45 PM.
 
Old 09-17-2012, 10:09 PM   #4
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738
You meant passed, not past......

The AWK syntax is
awk <commands> filename

so just add $filename after the AWK command string
 
Old 09-18-2012, 04:36 AM   #5
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
Actually the loop is not really necessary, since awk accepts multiple arguments and it's able to process all the files in sequence. The code should be slightly corrected (mainly it misses a equal sign before the empty string) but it is not clear to me what the actual field separator is: blank spaces, commas or both?

Here is my suggestion applied to files whose fields (columns) are separated by blank spaces:
Code:
awk '{ $3=$16=$21=$22=""; gsub(/  +/,OFS); gsub(/^ +| +$/,""); print > ( FILENAME ".processed" ) }' *409.txt
The second gsub statement serves to remove blank spaces at the beginning or at the end of each line, in case you remove the first or the last columns. Furthermore, the output is redirected by awk itself to a file whose name is the same as the original one with the suffix ".processed" added. If you want to rename (overwrite) the original files after having checked the results, you can simply do
Code:
rename ".processed" "" *.processed
Beware that the syntax of the rename command may differ among various operating systems, hence - please - check your man pages to find out the correct one.

A final note: don't try to redirect the output of awk directly to the original input, that is using the internal variable FILENAME alone, because in case of large files the input is buffered and at some point (maybe after a few tenths of lines) it might be completely lost and the resulting output hopelessly truncated.

Last edited by colucix; 09-18-2012 at 06:34 AM. Reason: Typo in suggested code
 
Old 09-21-2012, 02:32 PM   #6
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
Hi guys, well the output wasn't exactly right.

here's what my input file *409.txt looks like, note even though many of the columns don't have data or only partial amounts of data, the file is a csv file.



head1, head2, head3, head4, head5, head6, head7, head8, head9, head10
16987423, 1960, 103, 0, 1333275624, -3.6742382, 0.01, 0, stringA, 56723867
, , , , 67332,5555555, , 0.04, , ,
, , , , ,5555556, 79633 , -68.1 , ,
, , , , ,5555559, , , ,
, , , , ,5555561, , , ,
, , , , ,5555562, , , ,
, , , , ,5555569, , , ,
58767422, 2101, 965, 0, 13333845222, -0.144383, 0.01, 0, stringB, 98673890




and this "block" of data repeat (just different number and strings) till EOF and that's a looooooong ways cause the files are REALLY BIG

after executing:
Code:
awk '{ $1=$2=$3=$4""; gsub(/  +/,OFS); gsub(/^ +| +$/,""); print > ( FILENAME ".processed" ) }' *409.txt
the output file that I got has in it:

no header line
0,0,0,0,1333275624, -3.6742382, 0.01, 0, stringA, 56723867
, , , , 67332,5555555, , 0.04, , ,
, , , , ,5555556, 79633, -68.1 , ,
, , , , ,5555559, , , ,
, , , , ,5555561, , , ,
, , , , ,5555568, , , ,
, , , , ,5555569, , , ,
0,0,0,0,13333845222, -0.144383, 0.01, 0, stringB, 98673890

so it looks like it didn't delete the first four columns of the data. It looks like it just replaced the data at the top and bottom of each "block" of data with zeros, and the number of comas on each row stayed the same.

My main goal here is to reduce the file size by removing columns of data I don't need, so having the zeros messes that up

these files are too large to suck into Excel, but if I could, I would simply highlight the columns I don't want and delete. this is the functionality that I want to have with this script

thanks sooooo much guys for your help!

Tabitha
 
Old 09-21-2012, 05:26 PM   #7
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
Well.. the posted example clarifies that the fields are comma (not space separated) so that, with a slight modification the suggested code should work as expected. Said that, I'd like to make you notice that you repeated the typo (?) in your line of code, that is:
Code:
$1=$2=$3=$4""
that should be instead:
Code:
$1=$2=$3=$4=""
At this point, it should be clear why you got zeros in place of the first three fields: the fourth field was 0 and the (errouneous) statement assigned the value of $4 (followed by an empty string) to $1, $2 and $3. In the corrected statement, the empty string is assigned to $1, $2, $3 and $4.

The rest of the code serves to adjust the unwanted delimiters (if any). However I still have some doubt about the appearance of the desired output. Please, can you post an example of output based on the sample lines posted above?

Here I copy/paste your input into CODE tags, that reveal (preserve) the true spacing of the file:
Code:
16987423, 1960, 103, 0, 1333275624, -3.6742382, 0.01, 0, stringA, 56723867
        ,     ,    ,  , 67332,5555555,        , 0.04,  ,        ,
        ,     ,    ,  ,     ,5555556, 79633   , -68.1  ,        , 
        ,     ,    ,  ,     ,5555559,         ,        ,        ,
        ,     ,    ,  ,     ,5555561,         ,        ,        ,
        ,     ,    ,  ,     ,5555562,         ,        ,        ,        
        ,     ,    ,  ,     ,5555569,         ,        ,        ,
58767422, 2101, 965, 0, 13333845222, -0.144383, 0.01, 0, stringB, 98673890
Please, do the same to post the desired output, since the correct spacing is very relevant in order to suggest a suitable awk code. Thanks.
 
Old 09-21-2012, 06:33 PM   #8
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
so I'm guessing that the coma based delimeter get stipulated in this part of the script

Code:
gsub(/  +/,OFS)
and that that was the case for space separated?

I'm thinking that
Code:
gsub(/,  +/,OFS)
would be used for coma separated?

Tabby

Last edited by atjurhs; 09-21-2012 at 06:34 PM.
 
Old 09-22-2012, 03:35 AM   #9
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
Code:
gsub(/  +/,OFS)
This substitutes every sequence of two or more spaces (that resulted from reducing some fields to empty strings in a file whose field separators are blank spaces) with the actual value of the Output Field Separator.

In a true CSV file, e.g.
Code:
field 1,field 2,field 3,field 4,field 5,field 6,field 7,field 8,field 9
we should want to remove the fields 1,2,4,6,7,9 e.g.
Code:
$ awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; print}' file
,,field 3,,field 5,,,field 8,
from the obtained output we want to change the multiple commas in the interior of the line with a single comma AND remove the remaining commas at the edges. For the first task we do:
Code:
$ awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; gsub(/,,+/,OFS); print}' file
,field 3,field 5,field 8,
and for the second one we add another gsub call:
Code:
$ awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; gsub(/,,+/,OFS); gsub(/^,+|,+$/,""); print}' file
field 3,field 5,field 8
My doubt is that in your file you have some null fields made by blank spaces (even in the 5th column) hence I don't really know what the exact output should be (for this reason I asked). Anyway, the suggested code using comma as separator should give you a clue and maybe bring to a solution near to the desired output. Hope this clarifies a little bit.

Last edited by colucix; 09-22-2012 at 03:36 AM.
 
Old 09-23-2012, 09:50 AM   #10
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Quote:
Originally Posted by colucix View Post
Actually the loop is not really necessary, since awk accepts multiple arguments and it's able to process all the files in sequence.
I think this depends on the total number of files passed. The command line does have a limit on the length of the command it can run, and globbing patterns are expanded before the command is executed. If it matches too many files it could overwhelm the system's ability to handle them.

A safer bet would be to either use find and a while+read loop, with null separators.

Code:
while IFS='' read -r -d '' fname; do

	awk '....' "$fname"

done < <( find . -maxdepth 1 -type f -name '*.409.txt' -print0 )
Actually, perhaps a better option overall would be to set up your awk command as a standalone script, and run it directly with find.

script.awk:
Code:
#!/usr/bin/awk -f
....your awk commands go here...
Make it executable and use it like this:
Code:
find . -maxdepth 1 -type f -name '*.409.txt' -exec script.awk '{}' \+
The + at the end means it will send as many files to the script at once as the system will allow. This makes it as efficient as possible within the limits of the system. Note that the script has to be able to handle the multiple input arguments, though that shouldn't be a problem with awk.

Last edited by David the H.; 09-23-2012 at 09:51 AM.
 
Old 09-23-2012, 10:44 AM   #11
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
Awk is quite confusing to test. Anyway in bash you could make it work like this:
Code:
#!/bin/bash

# Change columns here. This is repetitive within the loop, but just for convenience.
REMOVE=(1 2 3 4)

# Extension of output file's name.
OUTPUTEXT='out'

IFS=','

for FILE; do
	while read LINE; do
		read -a FIELDS <<< "${LINE//, /,}"
		for I in "${REMOVE[@]}"; do
			unset "FIELDS[$I]"
		done
		LINE="${FIELDS[*]}"
		echo "${LINE//,/, }"
	done < "$FILE" > "$FILE.$OUTPUTEXT"
done
bash script.sh file1 file2 ...
 
Old 09-24-2012, 12:14 PM   #12
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
colucix, I ran your last script
Code:
awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; gsub(/,,+/,OFS); gsub(/^,+|,+$/,""); print}' file
field 3,field 5,field 8
and it said it could not find the file ???

so I tried
Code:
awk 'BEGIN{FS=OFS=","}{$1=$2=$4=$6=$7=$9=""; gsub(/,,+/,OFS); gsub(/^,+|,+$/,""); print > ( FILENAME ".processed" ) }' *409.txt
and the script found the file and ran perfectly BUT the header and 1st row of data were shifted in the output file, see below.

after removing all spaces, and all but one coma, per column so that it is a "true" *.csv file like: field1,field2,field3, etc.

column 1 has the header and first row of data from column 3 then from row 2 till the EOF it has column 1 data
column 2 has the header and first row of data from column 5 then from row 2 till the EOF it has column 2 data
column 3 has the header and first row of data from column 8 then from row 2 till the EOF it has column 3 data
column 4 has the header and first row of data from column 10 then from row 2 till the EOF it has column 4 data
column 5 has the header and first row of data from column 11 then from row 2 till the EOF it has column 5 data
column 6 has the header and first row of data from column 12 then from row 2 till the EOF it has column 6 data

so essentially the script only worked on the first row of data, but it did not delete any of the data columns below the first row.

the rest of the columns have the same pattern of being 6 columns off in the header and first row of data.

Last edited by atjurhs; 09-24-2012 at 05:10 PM. Reason: better clarification of output file
 
Old 09-24-2012, 03:06 PM   #13
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
konsolebox

for constency in the data file I am using,I removed the same numbers of column data as colucix did, ie. 1 2 4 6 7 9 here's the results of running your bash script on the file

after removing all spaces, and all but one coma, per column so that it is a "true" *.csv file like: field1,field2,field3, etc.

column 1 is identical to the input file's column 1
column 2 has the header and first row of data from column 4 then from row 2 till the EOF it has column 2 data
column 3 has the header and first row of data from column 6 then from row 2 till the EOF it has column 3 data
column 4 has the header and first row of data from column 9 then from row 2 till the EOF it has column 4 data
column 5 has the header and first row of data from column 11 then from row 2 till the EOF it has column 5 data
column 6 has the header and first row of data from column 12 then from row 2 till the EOF it has column 6 data

so essentially the script only worked on the first row of data, but it was off in index by 1, and it did not delete any of the data columns below the first row.

the rest of the columns have the same pattern of being 6 columns off in the header and first row of data

Last edited by atjurhs; 09-24-2012 at 05:10 PM. Reason: better clarification of output file
 
Old 09-24-2012, 08:20 PM   #14
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
@atjurhs: With respect to the results, could you give me a sample of the input and the output?

Last edited by konsolebox; 09-24-2012 at 08:22 PM.
 
Old 09-24-2012, 09:58 PM   #15
atjurhs
Member
 
Registered: Aug 2012
Posts: 168

Original Poster
Rep: Reputation: Disabled
all the data both input and output is on a private LAN, so I'll do my best typing to reflect both.

the input data is very well depicted in post number 7

and here's an example of the output data. It that has varied amounts of spaces and comas in between each column that has data. Some of the columns contain no data, just separated by multiple spaces and comas. Other columns do have data that may of may not be separated by multiple spaces or comas.

1, 4242, 3.42323e+23, 0.1, 0, 0,5,294875, 8438393, 394,,,,,,,,
,0, ,0,,,, 0.487564, , ,0, 0,0, 87563,,,,,,,,, , 0 ,
,1, ,,,,,,,,,,,,,,,, 0, , , , 5,
,1, ,,,,,,5241,,,,, , , 0.4543e-3 , 0 , 111111111,
1, 1000,,,, 9576336e+10, 0.1, 0, 0, ,,, , 8438393, 001,

if by hand I delete all the spaces, and replace all multiple occurrences of comas (two or more consecutive) with a single coma, then the data looks like that below. note the 1st and 5th rows are missing an initial coma, or you could say that the 2nd through 4th rows have an extra coma at the front.

rc11,rc12,rc13,rc14,rc15, etc.
,rc21,rc22,rc23,rc24,rc25, etc.
,rc31,rc32,rc33,rc34,rc35, etc.
,rc41,rc42,rc43,rc44,rc45, etc.
rc51,rc52,rc53,rc54,rc55, etc.
etc.
etc.
etc.

I really hope this helps you! and REALLY appreciate you helping me!

Tabitha
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
merge columns from multiple files vijay_babu1981 Linux - Newbie 21 06-24-2014 07:59 AM
Averaging columns from multiple files carlr Programming 3 03-18-2012 02:24 AM
extracting columns from multiple files with awk orcaja Linux - Newbie 7 02-14-2012 11:24 PM
merge multiple files each with two columns. 11st col same but may have difft values newbie271 Linux - Newbie 2 01-10-2012 07:03 PM
[SOLVED] AWK (or TCL/TK): Matching rows and columns between multiple files Euler2 Programming 6 05-30-2011 07:31 PM


All times are GMT -5. The time now is 02:24 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration