Challenging for loop

colucix · 10-04-2011, 12:39 PM

Quote:

Originally Posted by messinwu

colucix - I think I found a bug of sorts in the code you wrote for me. The 'minlat' variable seems to not change sometimes; it is being re-used in the next block of text, thus resulting in incorrect calculations.

Yeah, sorry! I forgot to reset the min and max values. Here is a corrected version (see the part highlighted in red):

Code:

#!/usr/bin/awk -f

BEGIN {
  minlat = 90
  maxlat = -90
  minlon = 180
  maxlon = -180
  OFMT = "%.6f"
}

/<way /,/<\/way>/ {

  while ( $0 ~ /<nd ref=/ ) {
  
    c++
    lat[c] = gensub(/.*="([^ ]+) .*/,"\\1",1)
    lon[c] = gensub(/.*="[^ ]+ ([^"]+)".*/,"\\1",1)
    
    if ( lon[c] < minlon ) minlon = lon[c]
    if ( lon[c] > maxlon ) maxlon = lon[c]
    if ( lat[c] < minlat ) minlat = lat[c]
    if ( lat[c] > maxlat ) maxlat = lat[c]
    
    getline
  
  }
  
  if ( $0 ~ /<tag k="name"/ ) {
    street=gensub(/.*v="([^"]+).*/,"\\1",1)
    print street ",", maxlon-minlon ",", maxlat-minlat
    minlat = 90
    maxlat = -90
    minlon = 180
    maxlon = -180
  }

}

messinwu · 10-04-2011, 12:49 PM

Hi Rod - thanks for chiming in. You might be right. If they change the formatting, my script will break. Then I'll have to see what they changed, and change my script to accomadate. I'm really not sure what a "real" XML parser would do for me. Would it recognize that they changed the tags on the fly, etc? I looked at the links that sundial provided, and to be honest, they're all complete greek to me. I don't understand almost anything on those pages. I'm a real estate broker, not a computer programmer, hence why I came here to get a little help. I would love to do things the "right" way, but you have to understand that different people with different skillsets will require different levels of spoon-feeding.

Thus far, I've been fairly successful with using bash to do basic automated logins using curl for data scraping to assist with my specialized niche. I know there's other/better ways of doing some of the stuff my scripts do, but if they work, I'm happy. I'm completely open to any level of tutoring anyone wants to provide.

Wow, although I did get a fast and accurate answer to my problem, I'm not sure this is the right forum for me. Maybe I posted my question in the wrong category, since I'm not an experienced programmer. I do feel like I was "pounced" on by dis-approving peers. Perhaps you have fallen in the 'you should use this very complex tool because it can do so much more than that barbaric simple tool' syndrome just the same...

messinwu · 10-04-2011, 12:51 PM

colucix - thank you, I figured that out as well, and was about to post it. I want you to know that I greatly appreciate your kind assistance today.

Along that line though, why didn't the values get reset since they're at the beginning of the script? I guess it's because they're outside of the loop which identifies the lat/lon variables. I tried sticking the reset code in a couple other places, but it only seems to work being at the end like you have it.

colucix · 10-04-2011, 12:54 PM

You're welcome!

theNbomr · 10-04-2011, 01:08 PM

Sorry, I didn't intend to pounce, merely to point to ways to produce better code. People who reply here often see things which you would overlook, and see it as helpful to point them out. Please don't be offended by that; it is not the intended reaction.

With respect to XML parsing, I tried to explain that the way in which the XML is formatted should not be built into your parser. If the XML were generated as one long line of text, a proper parser would not break because of it. The way XML is often laid out for human visualization does not convey any information. Whitespace in and around any elements is completely ignored, and serves only human readability, and can change without affecting the content. It is realistic to expect the content to stay consistent. For instance, the following two fragments of your XML data are exactly equal, with respect to their content:

Code:

  <nd ref="41.4415540 -97.0669980"/>
  <nd ref="41.4415510 -97.0676330"/>
  <nd ref="41.4415480 -97.0682330"/>
  <nd ref="41.4415450 -97.0688240"/>
  <tag k="highway" v="residential"/>
  <tag k="name" v="West 5th Street"/>
  <tag k="tiger:cfcc" v="A41"/>
  <tag k="tiger:county" v="Colfax, NE"/>

Code:

  <nd ref="41.4415540 -97.0669980"/><nd ref="41.4415510 -97.0676330"/><nd ref="41.4415480 -97.0682330"/><nd ref="41.4415450 -97.0688240"/>
<tag k="highway" 
v="residential"/><tag k="name"                                                          v="West 5th Street"/>
<tag 
k="tiger:cfcc" v="A41"/><tag 
k="tiger:county" v="Colfax, NE"/>

A proper parser should be unaffected by this. It is difficult to write such a parser.

--- rod.

messinwu · 10-04-2011, 01:55 PM

Hmm, I see your point. I will look into using an XML parser, but thus far everything I've found makes no sense to me. I guess I tend to learn best by example, so what's what I look for when searching google.

theNbomr · 10-04-2011, 05:56 PM

Well, when I said 'It is difficult to write such a parser', I left out an additional point, which is that to write something of that complexity in bash would be just plain ridiculous (apologies to those who've already done so, and I'm sure someone must have). Given that, it would seem that using a more complete programming language is necessary. Since you say you're doing a bit of programming already, it probably makes sense that you can be more productive with a more powerful language anyway. As a sort of part-time programmer, I'd suggest looking at some other scripting language such as Perl or Python (this is where others will jump in to complete the list of 50 or so other candidates). sundialsvcs has given you examples of Perl modules which can be used to robustly parse XML, and I'm sure most languages will have similar modules available. You'll have to just choose a language. In most cases, it will be a bit painful at first, just like it was to learn bash, but by now you probably have a little bit to build on.
With respect to XML parsing in specific, there are some generalities to explain. In whatever language you use, the XML parser module (a language-agnostic description) will have some documented API (application programmer's interface), which is a collection of function calls and/or variables to read/write to extract data from your XML source. Happily, for XML, these tend to follow either of two somewhat standard forms. One form is that the XML parser reads the XML data, and as it does so, it calls bits of your code to hand off chunks of data that your program wants. At each of these callbacks, you can do whatever is necessary with the data (like print it to a file). Another style is that the parser just swallows the whole thing, breaking it into component pieces, and then provides a collection of functions to navigate around in and extract specified data from the XML data.
What is nice about this is that once you've done this with one language, you can apply what you know to almost any language that has an XML parser. Nicer still is that documentation for one parser applies fairly well to same-style parsers, even ones written for a different language.
If you do choose to take the jump, there are plenty of people in these forums and elsewhere who can provide guidance along the way. Since you seem to be inclined to self start, you'll probably find that helpful people will gravitate to the questions you ask. Good luck.

--- rod.

ntubski · 10-04-2011, 10:15 PM

Just for fun, here is a shell-with-xmlstarlet solution:

Code:

#!/bin/sh
xml ed \
    -i //way/nd -t attr -n lat -v '' \
    -i //way/nd -t attr -n lon -v '' \
    -u //way/nd/@lat -x 'substring-before(../@ref, " ")' \
    -u //way/nd/@lon -x 'substring-after (../@ref, " ")' \
    "$1" \
    | \
    xml sel -T -t -m //way -v 'tag[@k="name"]/@v' -o ': ' \
    -v 'math:highest(nd/@lat) - math:lowest(nd/@lat)' \
    -o ', ' \
    -v 'math:highest(nd/@lon) - math:lowest(nd/@lon)' \
    --nl

XPath could really use some higher order functions: a function that operates on strings will ignore all but the first node when given a nodeset, making it pretty useless; so I had break up the latitude and longitude into their own attributes in a separate step.

theNbomr · 10-05-2011, 11:07 AM

Nice tip, ntubski. xmlstarlet looks like the definitive solution for the OP. I never considered the possibility that a bash-friendly tool already existed for XML parsing. I think I will have to give it a spin.

--- rod.