LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 11-26-2018, 07:21 PM   #1
annarizzi
LQ Newbie
 
Registered: Nov 2018
Posts: 5

Rep: Reputation: Disabled
linux command line for large data set?


I am having an issue with a very large data set and I am looking for advice about how to approach the following issue. Thank you for your help in this matter.

the problem: I have a large excel file of data, there are over 1000 columns and over 40,000 rows. I have to identify where a given row has a value of >199 in any given cell. If any row does not have >199 in any given cell then I want to delete those rows. So that I am left with only rows where at least one cell has a value of >199.

I also have the same data file as a text file, so I was thinking that the best approach might be to use linux command line to do this problem rather than using the excel file (which is bulky to work with given the number of rows and columns). But I am a novice at linux and awk so I was looking for general advice of how to approach this issue?

thanks so much

PS
Example image of data set attached. Here I would want only the rows that have highlighted cells (because those are >200) but I can't just use the sort function or complicated if than statements because there are so many columns in my data set, so that is too time consuming...

Last edited by annarizzi; 11-26-2018 at 07:36 PM.
 
Old 11-26-2018, 08:12 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 18,135

Rep: Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935
The attachment is not there, probably as you don't have enough posts. Use the text file - presumably comma deliminated. Try something like this.
Code:
awk -F"," '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt > filtered.txt
It stops processing each record once it finds a field above 199 - more efficient. Creates a new file, so the original is untouched.
 
1 members found this post helpful.
Old 11-27-2018, 01:52 AM   #3
l0f4r0
Member
 
Registered: Jul 2018
Location: Paris
Distribution: Debian
Posts: 854

Rep: Reputation: 286Reputation: 286Reputation: 286
The kind of solution provided by syg00 is certainly what you need but maybe you will have to modify -F"," with something else like -F";"...
Please tell us what delimiter is used in your text file (or provide the first 3 rows in [CODE] tags) so we see it by ourselves.
 
Old 11-27-2018, 03:57 AM   #4
annarizzi
LQ Newbie
 
Registered: Nov 2018
Posts: 5

Original Poster
Rep: Reputation: Disabled
It is tab delimited
I will try it out and see if it works.
thanks
 
Old 11-27-2018, 04:53 AM   #5
l0f4r0
Member
 
Registered: Jul 2018
Location: Paris
Distribution: Debian
Posts: 854

Rep: Reputation: 286Reputation: 286Reputation: 286
Ok so you definitely need to change "-F" but I let you search how
 
Old 11-28-2018, 07:41 PM   #6
annarizzi
LQ Newbie
 
Registered: Nov 2018
Posts: 5

Original Poster
Rep: Reputation: Disabled
linux command line for large data set?

Hi syg00

so I ran this code that you gave me (altering it for my tab deliminator file) and I think it did the opposite of what I wanted.

awk -F"/t" '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt > filtered.txt

I want to KEEP any rows that have any cells containing value that is >199. I want to REMOVE any rows of data without at least one cell with a value of >199. Does that make sense? I think the way you wrote the above script it is actually filtering out the ones I want (rows with >199) and leaving me with one rows with values <199. What needs to be altered to keep the values I want?

Thanks so much for your help!

Last edited by annarizzi; 11-28-2018 at 07:43 PM.
 
Old 11-28-2018, 08:01 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 18,135

Rep: Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935
You have the field separator incorrect, so the entire line (record) will be treated as one value. Change the specification as below - note the back-slash.
Code:
awk -F"\t" '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt > filtered.txt
 
Old 11-29-2018, 07:04 PM   #8
annarizzi
LQ Newbie
 
Registered: Nov 2018
Posts: 5

Original Poster
Rep: Reputation: Disabled
I don't know what is going on but I fixed the code and reran it and it still isn't working.
Reminder I have about 46000 rows of data and roughly 2000 columns. I want to scan each row for a cell of >199, if present then I want to keep that entire row, if absent I want to delete that entire row. So each row in the output should have at least one value of >199...

I ran the following

awk -F"\t" '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt > filtered.txt


and I got a sheet of 40,000 rows with all cells having <199 in every cell.

Then I ran:

awk '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt > filtered.txt


and I got the same result, 40,000 rows with all cells in each row being <199.


then I ran

awk '{for (i=1;i<=NF;i++) {if ($i > 199) {print i ""$i}}}' input.txt > filtered.txt


and that didn't work at all, that was advice from another student.


So I am confused because I think the first two examples are working but giving me the opposite of what I want, which is weird because it should be saying the correct thing. any thoughts would be appreciated...I have also tried to run the $i > 2000 that gives the same result, and I have tried flipping it to $i <199 that also doesn't change anything...
 
Old 11-29-2018, 07:38 PM   #9
berndbausch
Senior Member
 
Registered: Nov 2013
Location: Tokyo
Distribution: Redhat/Centos, Ubuntu, Raspbian, Fedora, Alpine, Cirros, OpenSuse/SLES
Posts: 3,456

Rep: Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905
I can't resist an awk question.
Code:
$ cat input.txt
32 6543 87 089 564 32
12 54 67 98 32
a b c d e
$ awk '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt
32 6543 87 089 564 32
a b c d e
The awk program is copied-and-pasted from your last post without any change. You see that the program works when the columns contain numbers.
When a column contains a string, awk converts the number 199 to string "199" and performs a string comparison (reference).

Perhaps your cells contain non-printable characters that turn the numbers into text. You can check this as follow:
Code:
awk '{ for (i=1;i<NF;i++) print $i " " typeof($i) }' <input.txt | more
Numeric columns have a type of strnum, strings have a type of string.

Last edited by berndbausch; 11-29-2018 at 07:40 PM.
 
Old 11-29-2018, 07:43 PM   #10
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 18,135

Rep: Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935
Let's see the data - well a usable subset anyway. Run this and post the output - all of it, as-is. If the data are restricted, don't.
Code:
head -n 3 input.txt | sed -nr 's/^(.{80}).*/\1/p' | cat -T
 
Old 11-30-2018, 04:00 AM   #11
l0f4r0
Member
 
Registered: Jul 2018
Location: Paris
Distribution: Debian
Posts: 854

Rep: Reputation: 286Reputation: 286Reputation: 286
Quote:
Originally Posted by berndbausch View Post
Perhaps your cells contain non-printable characters that turn the numbers into text. You can check this as follow:
Code:
awk '{ for (i=1;i<NF;i++) print $i " " typeof($i) }' <input.txt | more
Numeric columns have a type of strnum, strings have a type of string.
NB: typeof() is only available from GNU Awk >=4.2

@annarizzi: please use [CODE] tags (see my sig)
 
Old 11-30-2018, 07:17 AM   #12
l0f4r0
Member
 
Registered: Jul 2018
Location: Paris
Distribution: Debian
Posts: 854

Rep: Reputation: 286Reputation: 286Reputation: 286
I think I have a simple solution. You need to force numeric comparison with the following:
Code:
awk -F"\t" '{ for (i=1;i<=NF;i++) {if ($i+0 > 199) {print; next}} }' input.txt > filtered.txt
 
1 members found this post helpful.
Old 11-30-2018, 03:51 PM   #13
annarizzi
LQ Newbie
 
Registered: Nov 2018
Posts: 5

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by l0f4r0 View Post
I think I have a simple solution. You need to force numeric comparison with the following:
Code:
awk -F"\t" '{ for (i=1;i<=NF;i++) {if ($i+0 > 199) {print; next}} }' input.txt > filtered.txt
YIPPEE l0f4r0 You SOLVED IT! Thank you so so much! This is brilliant!
Have a great weekend!!!

Happy coding-
Anna
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] sort large data in large file - commands smithy2010 Linux - Newbie 7 02-03-2013 09:01 AM
grab the line below a blank line and the line above the next blank line awk or perl? Pantomime Linux - General 7 06-26-2008 08:13 AM
LXer: This week at LWN: Large pages, large blocks, and large problems LXer Syndicated Linux News 0 09-27-2007 11:40 AM
XTerm set font to "large" from command line introuble Linux - General 3 06-26-2006 04:11 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 06:11 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration