[SOLVED] Perl regex not matching across multiple lines despite ms flags

gfarrell · 08-17-2010, 10:26 AM

Sorted it out in AWK I think (thanks for pointing me towards it, worked out how to use it after a couple of mins):

Code:

awk 'BEGIN{LVL=0; INFN=0} /function/ {INFN=1; if(LVL==1) sub(/function/, "(&");} /{/ {LVL++} /}/ {LVL--; if(LVL==1) sub(/}/, "&)"); if(LVL==1) INFN=0;}1' $file

The "level 1" stuff is because the class opens with a curly brace so we look for all "level 1" functions (class methods).

If you spot any problems with this please let me know.

[EDIT]

Spotted a problem, corrected below. Seems to have worked, thanks guys.

Code:

#! /bin/bash

for file in `find /Users/gid/Sites/uikit_testing/app/webroot/js/ -name *.js`;
do
	echo "Processing $file"
	cp $file "/Users/gid/Desktop/uikitBackup/`basename $file`";
	mv $file $file.tmp
	awk 'BEGIN{LVL=0; INFN=0} /function/ {INFN=1; if(LVL==1) sub(/function/, "(&");} /{/ {LVL++} /}/ {LVL--; if(LVL==1 && INFN==1) sub(/}/, "&)"); if(LVL==1) INFN=0;}1' $file.tmp >> $file;
	rm $file.tmp
done

konsolebox · 08-17-2010, 10:38 AM

Odd. Both of the scripts you presented does not seem to be working.

Code:

awk 'BEGIN{LVL=0; INFN=0} /function/ {INFN=1; if(LVL==1) sub(/function/, "(&");} /{/ {LVL++} /}/ {LVL--; if(LVL==1) sub(/}/, "&)"); if(LVL==1) INFN=0;}1' ...
awk 'BEGIN{LVL=0; INFN=0} /function/ {INFN=1; if(LVL==1) sub(/function/, "(&");} /{/ {LVL++} /}/ {LVL--; if(LVL==1 && INFN==1) sub(/}/, "&)"); if(LVL==1) INFN=0;}1' ...

Can you show us the output?

gfarrell · 08-17-2010, 10:47 AM

Quote:

Originally Posted by konsolebox

Odd. Both of the scripts you presented does not seem to be working.

Code:

awk 'BEGIN{LVL=0; INFN=0} /function/ {INFN=1; if(LVL==1) sub(/function/, "(&");} /{/ {LVL++} /}/ {LVL--; if(LVL==1) sub(/}/, "&)"); if(LVL==1) INFN=0;}1' ...
awk 'BEGIN{LVL=0; INFN=0} /function/ {INFN=1; if(LVL==1) sub(/function/, "(&");} /{/ {LVL++} /}/ {LVL--; if(LVL==1 && INFN==1) sub(/}/, "&)"); if(LVL==1) INFN=0;}1' ...

Can you show us the output?

Yeah sure, one of my class files is interfering with it somehow (and I can't work it out) but it's working in almost all circumstances.

Script:

Code:

#! /bin/bash

for file in `find /Users/gid/Sites/uikit_testing/app/webroot/js/ -name *.js`;
do
	echo "Processing $file"
	cp $file "/Users/gid/Desktop/uikitBackup/`basename $file`";
	mv $file $file.tmp
	awk 'BEGIN{LVL=0; INFN=0} /function/ {INFN=1; if(LVL==1) sub(/function/, "(&");} /{/ {LVL++} /}/ {LVL--; if(LVL==1 && INFN==1) sub(/}/, "&)"); if(LVL==1) INFN=0;}1' $file.tmp >> $file;
	rm $file.tmp
done

Input:

Code:

/**
 * UIKit initialiser class
*/

var UIKit = {
        init: function init() {
                this.registry.each(function(entry, key) {
                        $$(entry.selector).each(function(item, index){
                                if(item.hasClass("no-replace") || $chk(item.retrieve("uikit"))) {
                                        return;
                                }
                                if(eval("window."+entry["class"]) !== undefined) {
                                        var ui = eval("new " + entry["class"] + "(item)");
                                        item.store("uikit", ui);
                                }
                        });
                }, this);
        },
        
        registry: new Hash(),
        
        register: function register(className, selector) {
                this.registry.set(className, {
                        "class":        className,
                        "selector":     selector
                });
        },
        
        enhance: function enhance(element, uiclass) {
                if(!(element.hasClass("no-replace") || $chk(element.retrieve("uikit")))) {
                        if(!$chk(uiclass)) {
                                for(var name in this.registry) {
                                        var item = this.registry[name];
                                        if($$(item.selector).contains(element)) {
                                                uiclass = eval(item["class"]);
                                                break;
                                        }
                                }
                                
                                if(!$chk(uiclass)) { return false; }
                        }
                        var ui = new uiclass(element);
                        element.store("uikit", ui);
                        return true;
                }
                return false;
        } 
}

/**
 * UIKit retrieval function.
 *
 * @param       string|element  el.     The DOM element or element id of the element for which you want the UIKit to be retrieved.
 * @return      UI|false        The UI derived object or false if none is set.
*/

function $UI(el) {
        //First check if el is an instanceof UI
        if(el instanceof UI) {
                return el;
        } else {
                //Otherwise get it
                el = $(el);
                
                if(!el) {
                        throw new UIKitException("Invalid argument for $UI.");
                }
                
                if(!$chk(el.retrieve("uikit"))) {
                        throw new UIKitException("Passed object does not have an associated UI object.");
                } else {
                        return el.retrieve("uikit");
                }
        }
}

Output:

Code:

/**
 * UIKit initialiser class
*/

var UIKit = {
	init: (function init() {
		this.registry.each(function(entry, key) {
			$$(entry.selector).each(function(item, index){
				if(item.hasClass("no-replace") || $chk(item.retrieve("uikit"))) {
					return;
				}
				if(eval("window."+entry["class"]) !== undefined) {
					var ui = eval("new " + entry["class"] + "(item)");
					item.store("uikit", ui);
				}
			});
		}, this);
	}),
	
	registry: new Hash(),
	
	register: (function register(className, selector) {
		this.registry.set(className, {
			"class":	className,
			"selector":	selector
		});
	}),
	
	enhance: (function enhance(element, uiclass) {
		if(!(element.hasClass("no-replace") || $chk(element.retrieve("uikit")))) {
			if(!$chk(uiclass)) {
				for(var name in this.registry) {
					var item = this.registry[name];
					if($$(item.selector).contains(element)) {
						uiclass = eval(item["class"]);
						break;
					}
				}
				
				if(!$chk(uiclass)) { return false; }
			}
			var ui = new uiclass(element);
			element.store("uikit", ui);
			return true;
		}
		return false;
	}) 
}

/**
 * UIKit retrieval function.
 *
 * @param	string|element	el.	The DOM element or element id of the element for which you want the UIKit to be retrieved.
 * @return	UI|false	The UI derived object or false if none is set.
*/

function $UI(el) {
	//First check if el is an instanceof UI
	if(el instanceof UI) {
		return el;
	} else {
		//Otherwise get it
		el = $(el);
		
		if(!el) {
			throw new UIKitException("Invalid argument for $UI.");
		}
		
		if(!$chk(el.retrieve("uikit"))) {
			throw new UIKitException("Passed object does not have an associated UI object.");
		} else {
			return el.retrieve("uikit");
		}
	})
}

As you can see, it works quite well (I think...)

grail · 08-17-2010, 11:03 AM

Whatever you put between the // is what will be looked for, so if you had:

Code:

someName: function someName... }

someOtherName: function someOtherName... }

You could change it so:

Code:

awk 'BEGIN{RS=""}/^someName:/{sub(/function/,"(&");$NF=$NF ")"}1' file

This will only change the one with 'someName:' at the start of the line

gfarrell · 08-17-2010, 11:19 AM

Quote:

Originally Posted by grail

Whatever you put between the // is what will be looked for, so if you had:

Code:

someName: function someName... }

someOtherName: function someOtherName... }

You could change it so:

Code:

awk 'BEGIN{RS=""}/^someName:/{sub(/function/,"(&");$NF=$NF ")"}1' file

This will only change the one with 'someName:' at the start of the line

Thanks for that, unfortunately not versatile enough (which is why I was using that nice regex pattern). I worked it out (see earlier post).

konsolebox · 08-17-2010, 11:25 AM

@gfarrell Ok the code seems to work fine only that it still doesn't work on lines that contains multiple }'s. Before I really had an idea that it could also be done in awk but this was really the limit that I was expecting.

@grail There's a problem in the RS="" method if a section contains blank lines within.

gfarrell · 08-17-2010, 11:30 AM

Quote:

Originally Posted by konsolebox

@gfarrell Ok the code seems to work fine only that it still doesn't work on lines that contains multiple }'s. Before I really had an idea that it could also be done in awk but this was really the limit that I was expecting.

@grail There's a problem in the RS="" method if a section contains blank lines within.

I think you just worked out my problem, multiple braces in a line! Thanks =]

(Not that I know how to fix it (or really need to now)).

The other problem I got was in comment doc-blocks but it was largely not a problem.

grail · 08-17-2010, 11:33 AM

So using the data you provided, the following worked but needed to exceptions:

Code:

awk 'BEGIN{RS="";ORS="\n\n"}/: function/{sub(/function/,"(&");sub(/},$/,"}),")}1' file

The exceptions:

1. The input although it has some lines that appear blank they actually contain whitespace. So I had to remove those through vim prior to running

2. The enhance function is not terminated the same as the others, ie it does not finish with }, and so would need to be changed manually.

I will see if I can come up with a more full solution tomorrow as it's 2am and I am tired

grail · 08-17-2010, 11:37 AM

Just ran yours to and found that it supplies an extra round closing bracket at the end of your $UI function

gfarrell · 08-17-2010, 11:44 AM

Quote:

Originally Posted by grail

Just ran yours to and found that it supplies an extra round closing bracket at the end of your $UI function

Not when I ran it it didn't...

konsolebox · 08-17-2010, 11:49 AM

Ok here are just sad some points. I'm sorry if I have to tell you these.

I already had same problem with my own purpose but I just gave it up and started to think about letting it get solved in other languages... like perl or parrot. Awk really have its limits. It's not only about counting the total braces then making a deduction in every recursion. It's also about knowing if the braces are just part of an ordinary string or not or part of ... etc. Also, what if there are 4 braces in a line, 3 of them is part of the current function but the 4th is part of the container block holding the function. How can you tell that it's part of the container block since you're only in the context of the function?

Probably this can still be solved but that would only mean imitating a real language parser. Still doing that in awk IMO is really no longer practical. e.g. reading single chars and not phrases or lines since you can't tell when do compound statements or blocks ends or separates.. etc.

P.S. Maybe using another script that's similar to HTML TIDY for awk scripts then using your methods will do the trick.

gfarrell · 08-17-2010, 12:03 PM

Quote:

Originally Posted by konsolebox

Ok here are just sad some points. I'm sorry if I have to tell you these.

I already had same problem with my own purpose but I just gave it up and started to think about letting it get solved in other languages... like perl or parrot. Awk really have its limits. It's not only about counting the total braces then making a deduction in every recursion. It's also about knowing if the braces are just part of an ordinary string or not or part of ... etc. Also, what if there are 4 braces in a line, 3 of them is part of the current function but the 4th is part of the container block holding the function. How can you tell that it's part of the container block since you're only in the context of the function?

Probably this can still be solved but that would only mean imitating a real language parser. Still doing that in awk IMO is really no longer practical. e.g. reading single chars and not phrases or lines since you can't tell when do compound statements or blocks ends or separates.. etc.

P.S. Maybe using another script that's similar to HTML TIDY for awk scripts then using your methods will do the trick.

While I realise that the method I used was not perfect (and the regex I used was (it worked during testing, just not in a bash script with perl), it worked for my purpose and therefore I'm happy enough with it. In terms of the code files I was parsing, none of those problems were encountered except in one file which I did manually, it still saved me time in the other 29 files. I really can't be bothered to write a proper code parser because then, as you say, I'd be basically writing an interpreter and I have absolutely no interest in doing that.

I hope you manage to work out the problems you were encountering but for me it's done its job.

grail · 08-17-2010, 12:04 PM

Ok I am really going this time, but I did run your several times and was unable to get it to not print the extra bracket

This is a little untidy due to tiredness but seems to work

You can take any bits that help you

Code:

awk '/: function/{sub(/function/,"(function");f=1}
     f{if(/{/)a++;
       if(/}/)a--;
       if(!a){sub(/}/,"&)");
              f=0}
}1' file

David the H. · 08-17-2010, 07:58 PM

Well, this conversation really moved beyond me while I was away. Time to bow out, I think.

Quote:

Originally Posted by konsolebox

I searched again. It appears that it can also be done in sed:

Yes, of course I know it's possible. However, as you just demonstrated, it takes a lot of mucking about with the hold buffer to build up the line before you can run the regex on it. But if sed had s & m switches similar to perl's (hmm, that doesn't sound right...

), then you'd be able to simply choose to treat the newlines like any other character straight out of the starting gate. Much easier to grasp conceptually and more flexible overall.

Time to learn more about perl, I guess.

ghostdog74 · 08-17-2010, 08:35 PM

Code:

awk -vRS="}" '/myFunction:/{
    gsub(/.*myFunction:/,"myFunction: (")
    gsub(/{.[^}]*/,"...\n"); 
    print $0RT")" 
}' file