Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
|
08-17-2012, 02:19 AM
|
#16
|
|
Member
Registered: Mar 2012
Location: Italy
Distribution: Slackware+Debian
Posts: 257
Rep:
|
I just answered the original question, is possible to make it efficient ?
Answare yes but.. What I wrote above. Then he has to understand if is feasible or not
Btw sorting is intended as "on the fly sorting" so file has not to be already sorted
Last edited by Celyr; 08-17-2012 at 02:21 AM.
|
|
|
|
08-17-2012, 04:28 AM
|
#17
|
|
Member
Registered: Sep 2003
Posts: 382
Rep:
|
This is largely just a formalization of a lot of what people have discussed so far, but with asort() and the "in" operator used to reproduce the output order described in the original problem description.
Contents of compare.gawk:
Code:
END {
# Load the file of test values.
test_values_file_name = "test.txt" ;
values_cnt = 0 ;
while ( ( getline value_line < test_values_file_name ) > 0 )
{
++values_cnt ;
values[ values_cnt ] = value_line ;
}
asort( values , sorted_values ) ;
# Load the file of ranges.
value_ranges_file_name = "dummy_sorted.txt" ;
range_cnt = 0 ;
while ( ( getline range_line < value_ranges_file_name ) > 0 )
{
++range_cnt ;
split( range_line , fields ) ;
range_min_val[ range_cnt ] = fields[ 2 ] ;
range_max_val[ range_cnt ] = fields[ 3 ] ;
}
# Compare the values to the ranges.
for ( sorted_val_num = 1 ; sorted_val_num <= values_cnt ; sorted_val_num++ )
{
sorted_val = sorted_values[ sorted_val_num ] ;
for ( range_num = 1 ; range_num <= range_cnt ; range_num++ )
{
if ( sorted_val < range_min_val[ range_num ] )
{
break ;
}
if ( ( sorted_val >= range_min_val[ range_num ] ) && ( sorted_val <= range_max_val[ range_num ] ) )
{
matches_a_range[ sorted_val ] = 1 ;
break ;
}
}
}
# Display the result of the comparisons.
for ( value_num = 1 ; value_num <= values_cnt ; value_num++ )
{
test_val = values[ value_num ] ;
if ( test_val in matches_a_range )
result = "YES" ;
else
result = "NO" ;
printf "%-10d %s\n" , test_val , result ;
}
}
Code:
sort -k2,3n < dummy.txt > dummy_sorted.txt
gawk -f compare.gawk < /dev/null
reproduces the original output order.
I wouldn't be surprised if there might be a way to create a higher level "index" of the values, or ranges, as they are loaded, looking at intervals between ranges, possibly more use of "in" with arrays. Especially if there's some profile of how values or ranges might look.
If the OP is really committed to using some form of awk ( hopefully gawk ), another possibility might be to look for a public repository for gawk extensions. Maybe someone has given out extension(s) that implement better data structures, some type of data trees...
That might allow the approach to be turned around, index the values, look at a range, quickly find all the values from test.txt that fit in the range and mark them as "YES".
Or as Celyr effectively said, don't use awk.
Last edited by kakaka; 08-17-2012 at 05:19 AM.
|
|
|
1 members found this post helpful.
|
08-17-2012, 04:57 AM
|
#18
|
|
Member
Registered: Sep 2003
Posts: 382
Rep:
|
To make sure gawk can handle good size arrays...
Ran a simple gawk program that assigns a value to each element of a 10 million element array.
No complaints from gawk.
For reference, the machine I was using has 6 GB of physical memory.
|
|
|
|
08-17-2012, 06:32 AM
|
#19
|
|
Member
Registered: Apr 2012
Distribution: RedHat
Posts: 40
Original Poster
Rep:
|
very well said .. i just need to keep "test.txt" file in the same order as it is so that i can tell how many YES's and NO's were labeled in first and second half of that file. But we can always create an index and sort this file and revert back for counting purposes. "dummy.txt" can be sorted as i am using it as a reference to check against. sorry for not being clear and thanks for pointing it out!!!
Quote:
Originally Posted by kakaka
I always wonder, how far away from the original problem statement, can the solution be?
If someone asked a question in French, it wouldn't seem legitimate to answer it in German, unless you knew the person also spoke German.
In the same way, if someone asks a question about awk, it wouldn't seem legitimate to provide a solution in Perl.
In this case, unless I know it's OK, I'd avoid providing an output in an order different than what's shown, so I wouldn't sort the test.txt file.
I could see using a copy of dummy.txt that's sorted, e.g. dummy_sorted.txt just to get sorted ranges into the awk program. But I wouldn't replace dummy.txt with a sorted version. Especially since the 1st and 4th fields might have some significance in a larger situation of which this problem might only be a part.
Naturally it's up to the OP to accept or reject a proposed solution. I just sometimes think it could save time to have something that could be used to very quickly/easily specify what type of solutions are considered meaningful for a particular question. Maybe some special check boxes added to the UI for the forum, to be used when starting a thread, to say how exactly the solution has to match the original problem statement.
|
Last edited by ip_address; 08-17-2012 at 06:58 AM.
Reason: minor modification
|
|
|
|
08-17-2012, 06:47 AM
|
#20
|
|
Member
Registered: Apr 2012
Distribution: RedHat
Posts: 40
Original Poster
Rep:
|
which language do you suggest for solving such kind of problems and why?
Quote:
Originally Posted by Celyr
I'm sorry for the lack of code but I won't even try to do that in bash scripting or awk.
|
|
|
|
|
08-17-2012, 07:23 AM
|
#21
|
|
Member
Registered: Mar 2012
Location: Italy
Distribution: Slackware+Debian
Posts: 257
Rep:
|
Just because I don't have the knowledge to do it, I can give you pseudo code if you wish 
|
|
|
|
08-17-2012, 07:40 AM
|
#22
|
|
Member
Registered: Apr 2012
Distribution: RedHat
Posts: 40
Original Poster
Rep:
|
Sure .. please go ahead. any feedback is useful. Thanks!!!
Quote:
Originally Posted by Celyr
Just because I don't have the knowledge to do it, I can give you pseudo code if you wish 
|
|
|
|
|
08-17-2012, 08:19 AM
|
#23
|
|
Member
Registered: Mar 2012
Location: Italy
Distribution: Slackware+Debian
Posts: 257
Rep:
|
Well, assuming that you are seeking for time efficiency I would:
1) Create a file containing a bucket of the ranges
2) index every line of test.txt and write the answare
1)
Code:
Create an array of boolean and initialize it to false (make sure is not a byte per boolean but a bit per boolen)
for each line in dummy.txt
read 2ndcol
read 3rdcol
while 2ndcol <= 3rdcol (be careful with bounds)
array[2ndcol]=true
increment 2ndcol
endwhile
endfor
Dump all the array to a file
Note: you can run 1 only once and you can update it very easly if you are adding ranges, but if you are removing ranges you have to rebuild it. The complexity of this is O(n*m) where n is number of ranges and m is the average length of a range
2)
Code:
Read the array of boolean from the file (in C i would use fwrite\fread)
for each line in text.txt
read 1stcol
if array[1stcol] == true (make sure to not go out of bounds, you can use a simple if 1stcol < size before)
YES
else
NO
endfor
The complexity is O(n) where n is the number of lines in text
Note: memory usage is maxrange bits where maxrange is the highest number in range in your example the highest is 90435, even if the largest would be 904350 (ten times bigger) the memory usage would be around 110Kbyte
Last edited by Celyr; 08-17-2012 at 08:21 AM.
|
|
|
1 members found this post helpful.
|
08-17-2012, 08:22 AM
|
#24
|
|
Senior Member
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 4,554
|
Quote:
Originally Posted by ip_address
which language do you suggest for solving such kind of problems and why?
|
The key point is that you have a wide range of excellent choices ... and, thanks to the #!shebang feature of Linux shells, the user will neither know nor care which one you used. You have a choice of true programming languages which were expressly designed for that purpose. Bash scripting was not. (Only the Korn shell ever attempted to integrate a true programming language.)
"Look! I can paddle upstream using a toothpick!"
"O-kay-y-y-y ... now, do you expect to get Brownie points for that? Next time, try using that thingy on the back of the dinghy. It's called an outboard motor ..."
|
|
|
|
08-17-2012, 10:49 AM
|
#25
|
|
Senior Member
Registered: Nov 2005
Distribution: Debian
Posts: 2,017
|
Quote:
Originally Posted by Celyr
1) Create a file containing a bucket of the ranges
...
The complexity of this is O(n*m) where n is number of ranges and m is the average length of a range
...
Note: memory usage is maxrange bits where maxrange is the highest number in range in your example the highest is 90435, even if the largest would be 904350 (ten times bigger) the memory usage would be around 110Kbyte
|
Actually the complexity is O(maxrange) because that's how long it takes to initialize all the buckets.
PS are meaningful names (eg "intervals.txt" instead of "dummy.txt") too much to ask for? 
|
|
|
|
08-17-2012, 10:56 AM
|
#26
|
|
Member
Registered: Mar 2012
Location: Italy
Distribution: Slackware+Debian
Posts: 257
Rep:
|
O(n*m) could be bigger, then reading it from ram and writing is O(1)
Last edited by Celyr; 08-17-2012 at 10:58 AM.
|
|
|
|
08-17-2012, 11:04 AM
|
#27
|
|
Senior Member
Registered: Nov 2005
Distribution: Debian
Posts: 2,017
|
Quote:
Originally Posted by Celyr
O(n*m) could be bigger, then reading it from ram and writing is O(1)
|
No, n*m counts the number of times you write true to a bucket, obviously this must be less than or equal to the number of buckets. You have to write true or false to every bucket, that takes O(maxrange).
|
|
|
|
08-17-2012, 11:06 AM
|
#28
|
|
Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 6,314
|
The solution presented by Tink and myself still functions if only the dummy.txt file is sorted.
Also, if we want to use a "proper" language, see how this fits (also no sorting required on either file):
Code:
ruby -ane 'BEGIN{a=[]};s="NO";if $FILENAME == "dummy.txt"; a<<($F[1]..$F[2]);else s = "YES" if ! a.detect{|x| x === $F[0] }.nil?;puts "$F[0]\t#{s}";end' dummy.txt test.txt
|
|
|
1 members found this post helpful.
|
08-17-2012, 11:07 AM
|
#29
|
|
Member
Registered: Mar 2012
Location: Italy
Distribution: Slackware+Debian
Posts: 257
Rep:
|
Overlapping ranges.
We are both right 
|
|
|
|
08-17-2012, 11:30 AM
|
#30
|
|
Senior Member
Registered: Nov 2005
Distribution: Debian
Posts: 2,017
|
Quote:
Originally Posted by Celyr
Overlapping ranges.
We are both right 
|
Or both wrong  I guess the most correct would be O(maxrange + n*m).
|
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 12:56 PM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|