Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am having an issue with a very large data set and I am looking for advice about how to approach the following issue. Thank you for your help in this matter.
the problem: I have a large excel file of data, there are over 1000 columns and over 40,000 rows. I have to identify where a given row has a value of >199 in any given cell. If any row does not have >199 in any given cell then I want to delete those rows. So that I am left with only rows where at least one cell has a value of >199.
I also have the same data file as a text file, so I was thinking that the best approach might be to use linux command line to do this problem rather than using the excel file (which is bulky to work with given the number of rows and columns). But I am a novice at linux and awk so I was looking for general advice of how to approach this issue?
thanks so much
PS
Example image of data set attached. Here I would want only the rows that have highlighted cells (because those are >200) but I can't just use the sort function or complicated if than statements because there are so many columns in my data set, so that is too time consuming...
The kind of solution provided by syg00 is certainly what you need but maybe you will have to modify -F"," with something else like -F";"...
Please tell us what delimiter is used in your text file (or provide the first 3 rows in [CODE] tags) so we see it by ourselves.
I want to KEEP any rows that have any cells containing value that is >199. I want to REMOVE any rows of data without at least one cell with a value of >199. Does that make sense? I think the way you wrote the above script it is actually filtering out the ones I want (rows with >199) and leaving me with one rows with values <199. What needs to be altered to keep the values I want?
You have the field separator incorrect, so the entire line (record) will be treated as one value. Change the specification as below - note the back-slash.
I don't know what is going on but I fixed the code and reran it and it still isn't working.
Reminder I have about 46000 rows of data and roughly 2000 columns. I want to scan each row for a cell of >199, if present then I want to keep that entire row, if absent I want to delete that entire row. So each row in the output should have at least one value of >199...
and that didn't work at all, that was advice from another student.
So I am confused because I think the first two examples are working but giving me the opposite of what I want, which is weird because it should be saying the correct thing. any thoughts would be appreciated...I have also tried to run the $i > 2000 that gives the same result, and I have tried flipping it to $i <199 that also doesn't change anything...
$ cat input.txt
32 6543 87 089 564 32
12 54 67 98 32
a b c d e
$ awk '{for (i=1;i<=NF;i++) {if ($i > 199) {print ; next}}}' input.txt
32 6543 87 089 564 32
a b c d e
The awk program is copied-and-pasted from your last post without any change. You see that the program works when the columns contain numbers.
When a column contains a string, awk converts the number 199 to string "199" and performs a string comparison (reference).
Perhaps your cells contain non-printable characters that turn the numbers into text. You can check this as follow:
Code:
awk '{ for (i=1;i<NF;i++) print $i " " typeof($i) }' <input.txt | more
Numeric columns have a type of strnum, strings have a type of string.
Last edited by berndbausch; 11-29-2018 at 07:40 PM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.