ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Thank you ntubski.
I know that this is not a valid xml, I only put the critical part to explain my problem.
I will check the xmlstarlet but I prefer one solution using linux standard commands.
It's possible?
XML is not readily parsed with simple regex tools. That is the reason why tools like xmlstarlet and proper XML parser modules for scripting languages like Perl & Python were created. If your XML is known to always use a constant tag-per-line format, you will probably be able to solve your problem with AWK. With luck, it can be done as a one-liner suitable for embedding in a broader script.
When solving problems like yours, it is helpful to explain verbosely what pattern of matching/deleting/substitution you are trying to accomplish. Use terms that describe the target text and the relationships to surrounding text. For example "text matching one alpha character followed by an underscore and one or more numeric characters, all enclosed in '<' & '>'". Using such language will force you to unambiguously identify the patterns, and once you have done this, the translation to code will be much easier. It is something like a mental specification of the problem, and working from a specification is always much more productive than making stuff up on the fly.
I thought when you said "delete odd lines" it might mean something like "delete elements in 'item' tags where the tagname is suffixed with an odd number character". However, in your sample output, I see the tags '<a_3>78</a_3>', so my hypothesis about your intention must be incorrect. I cannot see any unambiguous pattern that could translate the input to the supplied output.
I guess it is unlikely that you have written the XML generator, but it is worth mentioning that the format is not well chosen, since the tag names appear to contain information about the content of the tag. Numeric indices applied to the tagname would be better implemented as attributes to the tag. This will reduce the complexity of any attached DTD and simplify the work of any parser. It probably makes the XML generator simpler as well.
I highly suggest you carefully read again what ntubski and theNbomr posted. As always, use the right tool for the job.
Regex-based tools are designed for use on line-oriented text, but xml is tag-oriented. You can never be completely assured that a sed or awk solution will always parse it accurately.
FYI, xmlstarlet uses standard xpath expressions in order to match and modify entries. It's takes a bit of learning (and I'm still pretty much a novice at it), but it really is quite clean and flexible once you know what you are doing.
$ sed -rn '/<item /,/<\/item>/{/<item /be; /<\/item>/be; n}; :e;p' infile
is pretty simple. Flags -rn tell sed that we want to use extended regular expressions (-r; you can omit it, it is not necessary) and that we don't want sed to print every line to standard output automatically.
/<item/,/<\/item/{commands} == "Execute given commands for each line between (inclusively) the one, matching first regular expression /<item/ and that matching the second regular expression /\/item/." It is so called address range.
/<item/be; /<\/item>/be; == "For lines matching regular expression branch (goto) to label :e (see last two commands)". This effectively prints open and close item tags, because we have the `p' (print) command after label `e'.
n; -- read next line, discarding current line from the buffer. This command does the job -- skips first, third, etc lines inside current item. Other lines get printed by the `p' command at the end.
BTW, I completely agree with previous speakers: line-oriented tools are not good for xml/html/.*ml.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.