[SOLVED] linux command line for large data set?

annarizzi · 11-26-2018, 07:21 PM

I am having an issue with a very large data set and I am looking for advice about how to approach the following issue. Thank you for your help in this matter.

the problem: I have a large excel file of data, there are over 1000 columns and over 40,000 rows. I have to identify where a given row has a value of >199 in any given cell. If any row does not have >199 in any given cell then I want to delete those rows. So that I am left with only rows where at least one cell has a value of >199.

I also have the same data file as a text file, so I was thinking that the best approach might be to use linux command line to do this problem rather than using the excel file (which is bulky to work with given the number of rows and columns). But I am a novice at linux and awk so I was looking for general advice of how to approach this issue?

thanks so much

PS
Example image of data set attached. Here I would want only the rows that have highlighted cells (because those are >200) but I can't just use the sort function or complicated if than statements because there are so many columns in my data set, so that is too time consuming...

syg00 · 11-26-2018, 08:12 PM

The attachment is not there, probably as you don't have enough posts. Use the text file - presumably comma deliminated. Try something like this.

Code:

awk -F"," '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt > filtered.txt

It stops processing each record once it finds a field above 199 - more efficient. Creates a new file, so the original is untouched.

l0f4r0 · 11-27-2018, 01:52 AM

The kind of solution provided by syg00 is certainly what you need but maybe you will have to modify -F"," with something else like -F";"...
Please tell us what delimiter is used in your text file (or provide the first 3 rows in [CODE] tags) so we see it by ourselves.

annarizzi · 11-27-2018, 03:57 AM

It is tab delimited
I will try it out and see if it works.
thanks

l0f4r0 · 11-27-2018, 04:53 AM

Ok so you definitely need to change "-F" but I let you search how

annarizzi · 11-28-2018, 07:41 PM

Hi syg00

so I ran this code that you gave me (altering it for my tab deliminator file) and I think it did the opposite of what I wanted.

awk -F"/t" '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt > filtered.txt

I want to KEEP any rows that have any cells containing value that is >199. I want to REMOVE any rows of data without at least one cell with a value of >199. Does that make sense? I think the way you wrote the above script it is actually filtering out the ones I want (rows with >199) and leaving me with one rows with values <199. What needs to be altered to keep the values I want?

Thanks so much for your help!

syg00 · 11-28-2018, 08:01 PM

You have the field separator incorrect, so the entire line (record) will be treated as one value. Change the specification as below - note the back-slash.

Code:

awk -F"\t" '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt > filtered.txt

annarizzi · 11-29-2018, 07:04 PM

I don't know what is going on but I fixed the code and reran it and it still isn't working.
Reminder I have about 46000 rows of data and roughly 2000 columns. I want to scan each row for a cell of >199, if present then I want to keep that entire row, if absent I want to delete that entire row. So each row in the output should have at least one value of >199...

I ran the following

awk -F"\t" '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt > filtered.txt

and I got a sheet of 40,000 rows with all cells having <199 in every cell.

Then I ran:

awk '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt > filtered.txt

and I got the same result, 40,000 rows with all cells in each row being <199.

then I ran

awk '{for (i=1;i<=NF;i++) {if ($i > 199) {print i ""$i}}}' input.txt > filtered.txt

and that didn't work at all, that was advice from another student.

So I am confused because I think the first two examples are working but giving me the opposite of what I want, which is weird because it should be saying the correct thing. any thoughts would be appreciated...I have also tried to run the $i > 2000 that gives the same result, and I have tried flipping it to $i <199 that also doesn't change anything...

berndbausch · 11-29-2018, 07:38 PM

I can't resist an awk question.

Code:

$ cat input.txt
32 6543 87 089 564 32
12 54 67 98 32
a b c d e
$ awk '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt
32 6543 87 089 564 32
a b c d e

The awk program is copied-and-pasted from your last post without any change. You see that the program works when the columns contain numbers.
When a column contains a string, awk converts the number 199 to string "199" and performs a string comparison (reference).

Perhaps your cells contain non-printable characters that turn the numbers into text. You can check this as follow:

Code:

awk '{ for (i=1;i<NF;i++) print $i " " typeof($i) }' <input.txt | more

Numeric columns have a type of strnum, strings have a type of string.

syg00 · 11-29-2018, 07:43 PM

Let's see the data - well a usable subset anyway. Run this and post the output - all of it, as-is. If the data are restricted, don't.

Code:

head -n 3 input.txt | sed -nr 's/^(.{80}).*/\1/p' | cat -T

l0f4r0 · 11-30-2018, 04:00 AM

Quote:

Originally Posted by berndbausch

Perhaps your cells contain non-printable characters that turn the numbers into text. You can check this as follow:

Code:

awk '{ for (i=1;i<NF;i++) print $i " " typeof($i) }' <input.txt | more

Numeric columns have a type of strnum, strings have a type of string.

NB: typeof() is only available from GNU Awk >=4.2

@annarizzi: please use [CODE] tags (see my sig)

l0f4r0 · 11-30-2018, 07:17 AM

I think I have a simple solution. You need to force numeric comparison with the following:

Code:

awk -F"\t" '{ for (i=1;i<=NF;i++) {if ($i+0 > 199) {print; next}} }' input.txt > filtered.txt

annarizzi · 11-30-2018, 03:51 PM

Quote:

Originally Posted by l0f4r0

I think I have a simple solution. You need to force numeric comparison with the following:

Code:

awk -F"\t" '{ for (i=1;i<=NF;i++) {if ($i+0 > 199) {print; next}} }' input.txt > filtered.txt

YIPPEE l0f4r0

You SOLVED IT! Thank you so so much! This is brilliant!
Have a great weekend!!!

Happy coding-
Anna