LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-26-2012, 12:38 PM   #1
atjurhs
Member
 
Registered: Aug 2012
Posts: 190

Rep: Reputation: Disabled
deleting every other row, sometimes


Hi guys,

I have some REALLY big files that I am trying to make smaller by deleting a bunch of unwanted rows of data. Below is what the data file looks like, only it is really much much larger, like maybe a 200-900MB is size.

for ease of reference, I listed line numbers in the far left column of this example

Code:
1  header_1,header_2,header_3,header4,header5,header6,header7,header8,header9
2  546 ,564871239 ,0.3654 ,1234567 ,1  ,-36592672 ,1 ,0 ,4856730000
3      ,          ,5398   ,        ,   ,          ,1 ,0 ,  
4  546 ,999999999 ,0.3654 ,1234567 ,0  ,-36592672 ,1 ,0 ,1.27819321  
5      ,          ,10101  ,        ,   ,          ,1 ,0 ,
6  546 ,-0.000001 ,564871239 ,0.3654 ,1234567 ,0  ,-36592672 ,1 ,0 ,1.27819321  
7      ,          ,9613   ,        ,   ,          ,1 ,0 ,
8  546 ,564829411 ,0.3654 ,1234567 ,0  ,-36592672 ,1 ,0 ,1.27819321    
9      ,          ,65764  ,        ,   ,          ,1 ,0 ,
10 546 ,321765987 ,0.3654 ,-999999 ,0  ,-36592672 ,1 ,0 ,1.27819321 
11     ,          ,5398   ,        ,1  ,          ,1 ,0 ,
12 546 ,123456789 ,0.3654 ,7810011 ,0  ,-36592672 ,1 ,0 ,99.9999999  
13     ,          ,5398   ,        ,   ,          ,1 ,0 ,  
14 546 ,564871239 ,0.3656 ,1234567 ,0  ,-36592672 ,1 ,0 ,1.27819321  
15 1234567 ,      ,6061   ,        ,   ,          ,  ,  ,12345
I want to get rid of those rows that are mainly filled with comas, BUT not if in that row's column 3 has the specific value, then I want to keep the entire row of numbers in the row just above it. So let's say my specified value is 5398. Then using the input file above I should get lines: 2, 10, and 12 and look like this

Code:
header_1,header_2,header_3,header4,header5,header6,header7,header8,header9
546 ,564871239 ,0.3654 ,1234567 ,1  ,-36592672 ,1 ,0 ,4856730000
546 ,321765987 ,0.3654 ,-999999 ,0  ,-36592672 ,1 ,0 ,1.27819321 
546 ,123456789 ,0.3654 ,7810011 ,0  ,-36592672 ,1 ,0 ,99.9999999
what I'm thinking is if the tool runs down the 3rd column and looks for the specified value. once it finds 5398 it knows what line it is on and it can skip down to the EOF, then jump back up to the top of the file and read down the file again looking for the line number -1. then right that off to another file. Oh, and I also want to keep the header line.

but there are probably better ways to do this???

thanks soooo much for whatever help you guys have!

Tabitha

Last edited by atjurhs; 09-26-2012 at 12:55 PM.
 
Old 09-26-2012, 01:05 PM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,472

Rep: Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858
I am not sure I follow as you have asked for 2 things and then ignored 1?

Initially you said:
Quote:
I want to get rid of those rows that are mainly filled with comas
And then the second criteria was:
Quote:
BUT not if in that row's column 3 has the specific value, then I want to keep the entire row of numbers in the row just above it.
The rest of your example only refers to the second criteria. Does this mean the first is no longer required?

Potentially this can be solved with something like awk but we need to confirm your requirements before continuing?

Also, what have you tried by way of getting the desired output with your smaller example?
 
Old 09-26-2012, 01:32 PM   #3
atjurhs
Member
 
Registered: Aug 2012
Posts: 190

Original Poster
Rep: Reputation: Disabled
Yea, I think you are right there's no need to delete anyhting as long as the "wanted" data gets written off to an output file, so it's more of the second case.

please let me restate the problem.

in the example above, I want to look at the 3rd column of an odd numbered row (except for the header line, these ere the rows that are mostly filled with comas). if the value in that off row's 3rd column is a match for the specified value (in this example 5398), then I want to save off the entire even numbered row immeadiately preceeding the odd numbered row that contained 5398 in the 3rd column. That's how I came up the solution as rows 2, 10, and 12 along with the header being written of to an output file

hopefully, I did a better job of exlpaining the problem this time?

Taby
 
Old 09-26-2012, 01:56 PM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,472

Rep: Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858
That does seem clearer What efforts have you made to solve this problem yourself and where are you getting stuck?

General solution would be:

1. Print header
2. Save each line read in
3. If current line has 3rd field equal to value, print saved value (ie previous line)
 
3 members found this post helpful.
Old 09-26-2012, 06:31 PM   #5
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
So here's the modified code for that:
Code:
#!/bin/bash

[[ BASH_VERSINFO -ge 3 ]] || {
	echo "You need bash version 3.0 or higher to run this script."
	exit 1
}

# Set columns to keep (columns starts from 1)
CONFIG_KEEP=(1 2 4 6 7 9)
CONFIG_REF=3
CONFIG_REF_VALUE=1234567

# Extension of output file's name.
OUTPUTEXT='out'

IFS=','

KEEP=()
for I in "${CONFIG_KEEP[@]}"; do
	(( J = I - 1 ))
	KEEP[J]=$J
done

function get_valid_fields {
	FIELDS=()
	read -a FIELDS <<< "$LINE"
	for I in "${!FIELDS[@]}"; do
		[[ -z ${KEEP[I]} ]] && unset "FIELDS[$I]"
	done
}

function get_valid_fields_and_ref {
	FIELDS=()
	read -a FIELDS <<< "$LINE"
	read REF <<< "${FIELDS[CONFIG_REF - 1]}"
	for I in "${!FIELDS[@]}"; do
		[[ -z ${KEEP[I]} ]] && unset "FIELDS[$I]"
	done
}

for FILE; do
	if read LINE; then
		get_valid_fields
		echo "${FIELDS[*]}"
		if read LINE; then
			get_valid_fields
			HOLD="${FIELDS[*]}"
			while read LINE; do
				get_valid_fields_and_ref
				[[ $REF = "$CONFIG_REF_VALUE" ]] && echo "$HOLD"
				HOLD="${FIELDS[*]}"
			done
		fi
	fi < "$FILE" > "$FILE.$OUTPUTEXT"
done
Quote:
Originally Posted by grail View Post
General solution would be:

1. Print header
2. Save each line read in
3. If current line has 3rd field equal to value, print saved value (ie previous line)
Agreed.
 
Old 09-26-2012, 07:11 PM   #6
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
If we are only to refer from odd lines, it could be done this way:
Code:
#!/bin/bash

[[ BASH_VERSINFO -ge 3 ]] || {
	echo "You need bash version 3.0 or higher to run this script."
	exit 1
}

# Set columns to keep (columns starts from 1)
CONFIG_KEEP=(1 2 4 6 7 9)
CONFIG_REF=3
CONFIG_REF_VALUE=1234567

# Extension of output file's name.
OUTPUTEXT='out'

IFS=','

KEEP=()
for I in "${CONFIG_KEEP[@]}"; do
	(( J = I - 1 ))
	KEEP[J]=$J
done

function get_valid_fields {
	FIELDS=()
	read -a FIELDS <<< "$LINE"
	for I in "${!FIELDS[@]}"; do
		[[ -z ${KEEP[I]} ]] && unset "FIELDS[$I]"
	done
}

function get_valid_ref {
	FIELDS=()
	read -a FIELDS <<< "$LINE"
	read REF <<< "${FIELDS[CONFIG_REF - 1]}"
}

for FILE; do
	if read LINE; then
		get_valid_fields
		echo "${FIELDS[*]}"
		LINE_NO=1
		while read LINE; do
			if (( (++LINE_NO % 2) == 0 )); then
				get_valid_fields
				HOLD="${FIELDS[*]}"
			else
				get_valid_ref
				[[ $REF = "$CONFIG_REF_VALUE" ]] && echo "$HOLD"
			fi
		done
	fi < "$FILE" > "$FILE.$OUTPUTEXT"
done
Although I don't think referring to odd numbered lines is right if you compare the outputs with the former. But I'm not sure about that.

---- Add ----
Another way:
Code:
#!/bin/bash

[[ BASH_VERSINFO -ge 3 ]] || {
	echo "You need bash version 3.0 or higher to run this script."
	exit 1
}

# Set columns to keep (columns starts from 1)
CONFIG_KEEP=(1 2 4 6 7 9)
CONFIG_REF=3
CONFIG_REF_VALUE=1234567

# Extension of output file's name.
OUTPUTEXT='out'

IFS=','

KEEP=()
for I in "${CONFIG_KEEP[@]}"; do
	(( J = I - 1 ))
	KEEP[J]=$J
done

function get_valid_fields {
	FIELDS=()
	read -a FIELDS <<< "$LINE"
	for I in "${!FIELDS[@]}"; do
		[[ -z ${KEEP[I]} ]] && unset "FIELDS[$I]"
	done
}

function get_valid_ref {
	FIELDS=()
	read -a FIELDS <<< "$LINE"
	read REF <<< "${FIELDS[CONFIG_REF - 1]}"
}

for FILE; do
	if read LINE; then
		get_valid_fields
		echo "${FIELDS[*]}"
		for (( ;; )); do
			read LINE || break
			get_valid_fields
			HOLD="${FIELDS[*]}"
			read LINE || break
			get_valid_ref
			[[ $REF = "$CONFIG_REF_VALUE" ]] && echo "$HOLD"
		done
	fi < "$FILE" > "$FILE.$OUTPUTEXT"
done

Last edited by konsolebox; 09-26-2012 at 07:37 PM.
 
Old 09-26-2012, 07:45 PM   #7
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
Another way is by skipping every reference lines with valid matches to be used for printing later.
Code:
#!/bin/bash

[[ BASH_VERSINFO -ge 3 ]] || {
	echo "You need bash version 3.0 or higher to run this script."
	exit 1
}

# Set columns to keep (columns starts from 1)
CONFIG_KEEP=(1 2 4 6 7 9)
CONFIG_REF=3
CONFIG_REF_VALUE=1234567

# Extension of output file's name.
OUTPUTEXT='out'

IFS=','

KEEP=()
for I in "${CONFIG_KEEP[@]}"; do
	(( J = I - 1 ))
	KEEP[J]=$J
done

function get_valid_fields {
	FIELDS=()
	read -a FIELDS <<< "$LINE"
	for I in "${!FIELDS[@]}"; do
		[[ -z ${KEEP[I]} ]] && unset "FIELDS[$I]"
	done
}

function get_valid_fields_and_ref {
	FIELDS=()
	read -a FIELDS <<< "$LINE"
	read REF <<< "${FIELDS[CONFIG_REF - 1]}"
	for I in "${!FIELDS[@]}"; do
		[[ -z ${KEEP[I]} ]] && unset "FIELDS[$I]"
	done
}

for FILE; do
	if read LINE; then
		get_valid_fields
		echo "${FIELDS[*]}"
		if read LINE; then
			get_valid_fields
			HOLD="${FIELDS[*]}"
			while read LINE; do
				get_valid_fields_and_ref
				if [[ $REF = "$CONFIG_REF_VALUE" ]]; then
					echo "$HOLD"
					read LINE || break
					get_valid_fields
				fi					
				HOLD="${FIELDS[*]}"
			done
		fi
	fi < "$FILE" > "$FILE.$OUTPUTEXT"
done
 
Old 09-27-2012, 03:22 PM   #8
atjurhs
Member
 
Registered: Aug 2012
Posts: 190

Original Poster
Rep: Reputation: Disabled
oooops guys, I goofed up, hopefully it won't be a big change ...

so let me restate the problem again:

in the example above, I want to look at the 3rd column of an odd numbered row (except for the header line, these ere the rows that are mostly filled with comas). if the value in that row's 3rd column is a match for the specified value (in this example 5398), then I want to save off the entire even numbered row immeadiately preceeding the odd numbered row that contained 5398 in the 3rd column AND I also need to save off the odd numbered row that contained 5398. So in the example above, the output that I need is

Code:
header_1,header_2,header_3,header4,header5,header6,header7,header8,header9
2  546 ,564871239 ,0.3654 ,1234567 ,1  ,-36592672 ,1 ,0 ,4856730000
3      ,          ,5398   ,        ,   ,          ,1 ,0 ,
10 546 ,321765987 ,0.3654 ,-999999 ,0  ,-36592672 ,1 ,0 ,1.27819321
11     ,          ,5398   ,        ,1  ,          ,1 ,0  
12   546 ,123456789 ,0.3654 ,7810011 ,0  ,-36592672 ,1 ,0 ,99.9999999
13     ,          ,5398   ,        ,   ,          ,1 ,0 ,
sorry that I goofed things up

Last edited by atjurhs; 09-27-2012 at 03:23 PM.
 
Old 09-27-2012, 06:26 PM   #9
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
Well it's ok. There was also a bug in the former posts.
Code:
#!/bin/bash

[[ BASH_VERSINFO -ge 3 ]] || {
	echo "You need bash version 3.0 or higher to run this script."
	exit 1
}

# Set columns to keep (columns starts from 1)
CONFIG_KEEP=(1 3 6 7 9)
CONFIG_REF=3
CONFIG_REF_VALUE=5398
CONFIG_TRIM=true # or false

# Extension of output file's name.
OUTPUTEXT='out'

IFS=','

KEEP=()
for I in "${CONFIG_KEEP[@]}"; do
	(( J = I - 1 ))
	KEEP[J]=$J
done

if [[ $CONFIG_TRIM = true ]]; then
	shopt -s extglob

	function get_valid_fields {
		FIELDS=()
		read -a FIELDS <<< "$LINE"
		for I in "${!FIELDS[@]}"; do
			if [[ -z ${KEEP[I]} ]]; then
				unset "FIELDS[$I]"
			else
				FIELDS[I]=${FIELDS[I]##+([[:blank:]])}
				FIELDS[I]=${FIELDS[I]%%+([[:blank:]])}
			fi
		done
	}

	function get_valid_fields_and_ref {
		FIELDS=()
		read -a FIELDS <<< "$LINE"
		IFS=$' \t' read REF <<< "${FIELDS[CONFIG_REF - 1]}"
		for I in "${!FIELDS[@]}"; do
			if [[ -z ${KEEP[I]} ]]; then
				unset "FIELDS[$I]"
			else
				FIELDS[I]=${FIELDS[I]##+([[:blank:]])}
				FIELDS[I]=${FIELDS[I]%%+([[:blank:]])}
			fi
		done
	}
else
	function get_valid_fields {
		FIELDS=()
		read -a FIELDS <<< "$LINE"
		for I in "${!FIELDS[@]}"; do
			[[ -z ${KEEP[I]} ]] && unset "FIELDS[$I]"
		done
	}

	function get_valid_fields_and_ref {
		FIELDS=()
		read -a FIELDS <<< "$LINE"
		IFS=$' \t' read REF <<< "${FIELDS[CONFIG_REF - 1]}"
		for I in "${!FIELDS[@]}"; do
			[[ -z ${KEEP[I]} ]] && unset "FIELDS[$I]"
		done
	}
fi


for FILE; do
	if read LINE; then
		get_valid_fields
		echo "${FIELDS[*]}"
		for (( ;; )); do
			read LINE || break
			get_valid_fields
			EVEN="${FIELDS[*]}"
			read LINE || break
			get_valid_fields_and_ref
			if [[ $REF = "$CONFIG_REF_VALUE" ]]; then
				echo "$EVEN"
				echo "${FIELDS[*]}"
			fi
		done
	fi < "$FILE" > "$FILE.$OUTPUTEXT"
done
 
Old 09-28-2012, 09:46 AM   #10
atjurhs
Member
 
Registered: Aug 2012
Posts: 190

Original Poster
Rep: Reputation: Disabled
WOW! there's a bit of looping going on in that )

unfortunately our server went down, so there'll be no working on this little project till Monday
 
Old 09-28-2012, 11:07 AM   #11
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,472

Rep: Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858Reputation: 2858
Or a little awk:
Code:
awk -F" *, *" 'NR==1;$3 == 5398{print x RT $0}{x=$0}' file
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] find a null value in a row/column and delete entire row umix Linux - Newbie 10 10-13-2011 01:26 AM
deleting duplicate lines without deleting first instance of the duplicated line jkeertir Linux - Newbie 2 02-07-2011 06:55 AM
gnome-terminal is missing one pixel row and its an important row rednuht Linux - General 1 12-24-2009 10:30 AM
How to Fetch Particular row value of field ts7300 Debian 4 10-09-2009 05:53 AM
Shell script to parse csv-like output, row by row utahnix Linux - General 8 12-08-2007 05:03 AM


All times are GMT -5. The time now is 12:38 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration