LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   regular expressions (https://www.linuxquestions.org/questions/linux-newbie-8/regular-expressions-4175478394/)

schmitta 09-24-2013 03:30 PM

regular expressions
 
Is there a program for which you enter a regular expression and it tells you in english what it will perform?

sycamorex 09-24-2013 03:58 PM

Perhaps, something like that would help:

http://regex101.com/
or

http://www.myezapp.com/apps/dev/regexp/show.ws

schmitta 09-25-2013 12:11 AM

Thanks sycamorex! The myezapp site did the trick and gave me what I wanted. They did not have flex but the java version seems to work identical. Thanks again! Alvin....

schmitta 09-25-2013 01:10 AM

I am trying to get a label that is defined as starting with a letter then following zero to 7 letters or numbers followed by a colon. I would also like it to be the first item in the statement. I tried:

lines ::= labeldef statement
labeldef ::= [A-Za-z][A-Za-z0-9]{0,7}:

but it allows labels longer than 8 characters, matching only the last 8 chars before the colon plus the colon. How do I write the reg expression to allow only eight chars total and give an error for 9 or more characters (or not starting with a letter)? Maybe I need to precede it with a null or get it to start at the first character on the line. Thank you. Alvin.... (note I am not in college doing this for a course but for my business).

Firerat 09-25-2013 02:34 AM

it appears fine here, at least in bash

Code:

Check=""
for i in a {1..10};do
  Check=${Check%:}${i}:
  printf "$Check "
  [[ $Check =~ [A-Za-z][A-Za-z0-9]{0,7}: ]] \
    && echo True \
    || echo False
done

gets the following output

Code:

a: True
a1: True
a12: True
a123: True
a1234: True
a12345: True
a123456: True
a1234567: True
a12345678: False
a123456789: False
a12345678910: False


pan64 09-25-2013 02:52 AM

starting means ^, so you would need to write: ^[A-Za-z][A-Za-z0-9]{0,7}:

Firerat 09-25-2013 03:11 AM

good point pan64 :)

Still, it should work without ^
Code:

Check="";for i in 0 {1..10};do  Check=${Check%:}${i}:;  printf "$Check ";  [[ $Check =~ [A-Za-z][A-Za-z0-9]{0,7}: ]]    && echo True    || echo False; done
results in all False
But 100% agree, ^ should be used, along with $ on the end
Code:

Check="";for i in z {1..10};do  Check=${Check%:}${i}:;  printf "$Check ";  [[ $Check =~ ^[A-Za-z][A-Za-z0-9]{0,7}:$ ]]    && echo True    || echo False; done

As I mentioned earlier,. your regexpr. is working in bash..

Do you have sample code where it is not working?

schmitta 09-25-2013 02:59 PM

I was using the software at:

http://regex101.com/
or

http://www.myezapp.com/apps/dev/regexp/show.ws

to test with. The myezapp program shows the match with a colored bar. For "testabc0:" all was colored. for "testabc12:" "stabc12:" was colored as a match. The label needs to be at the beginning of the line so I will use the ^ first. Other tokens can follow so I will leave off the $. I just tried it with http-//regex101.com and now it seems to work correctly. I have: ^[A-Za-z][A-Za-z0-9]{0,7}: which now seems to work rejecting "testabc01:" and " testabc0:" (not first in the line.) Thank you for your help. Do you know if flex will reject it and just pass the no match through or will it give me some way to flag it as an error? Thanks. Alvin...

jpollard 09-25-2013 03:34 PM

Quote:

Originally Posted by Firerat (Post 5034366)
good point pan64 :)

Still, it should work without ^

No, it should work exactly as shown: aaab0123456 would match "b0123456"... even though it is preceded by aaa string. Only by giving the ^ does it specify that it match from the beginning of the string.

One other note - it is part of a flex scanner/tokenizer. Now specifying the ^ will identify it as valid, but usually this would be counted as a SEMANTIC error, rather than a syntax error. Leaving the ^ off would allow the action part to make more detailed analysis and determine the difference between a valid label, and a label that is too long, and thus provide better error diagnostics for the user to be able to make corrections faster, and more accurately.

Habitual 09-25-2013 06:00 PM

also http://www.gskinner.com/RegExr/

schmitta 09-26-2013 08:22 PM

Should I leave the ^ in or out? I changed mine to ^[A-Za-z][A-Za-z0-9]{0.7}[ \t\n] to catch a blank or tab between the labeldef and the next token on the line or just the labeldef on the line. But how will I catch a label too long as it will probably just pass through if flex works as I think.

jpollard 09-27-2013 03:45 AM

It depends on the grammar. Don't forget that a scanner is only supposed to recognize tokens. If the grammar uses white space for a token or just a separator is two different things.

The scanner can easily consume white space if it isn't significant with a very simple rule. For instance. If the grammar specifies a label as:

Code:

label : symbol ':' {whatever to do with a label}
      .

Then the scanner only has to identify what a symbol is. Length of a symbol is not really relevant to the grammar. The action part of the grammar can look at the length and decide if it is too long, report an error (label length too long) that is specific to the label.

If all symbols are limited then the scanner can identify the error, but still not abort scanning - translators work best by identifying as many errors as possible, and not terminate on the first one.

The scanner could identify the label with:
Code:

id [A-Za-z]{1,}[0-9]*
ws [ \t]
nl '\n'
coln ':'
%%
id    {return ID};
coln  {return COLON};
..... /* other tokens */
ws  ;  /* discard */
nl    {linecount++;};

Now the whitespace is discarded - but the newline is checked for to maintain a line count for error messages. This allows the grammar to identify whether something is a label or not with:

Code:

label:  ID COLON    { if ($1.length > 8) {
                            /* print error message with linecount */
                            errorcount++;
                      }
                      /* do other stuff with label - update symbol table... */
                     
                      }

It really depends on how you decide to handle things, and how complex the grammar is. Scanners generated by flex are ok - though sometimes they are not terribly clear and sometimes it is easier to create one by hand (especially simple ones).
Flex is good to use when speed of implementation is more important, but it does assume you are already familiar with what/how scanners are used and if you are interfacing it with bison (as in generating the appropriate include files...)

They are supposed to make it easier for the parser to handle things and separate semantics from tokenizing - and help make error messages more meaningful and parsing recovery possible. The simplest error handling is to abort on the first error - but that makes USING the application/translator/... much harder as you have to keep re-running the application just to find the next error.

schmitta 09-28-2013 01:49 PM

Thanks jpollard. I really appreciate the extra effort you took to make those points. Alvin...

schmitta 09-30-2013 04:14 PM

I am writing a BASIC interpreter. I was going to write the bnf and run it through flex and bison but I am not sure they would be appropriate for writing an interpreter with. The original idea was to generate the interpreter with flex and bison and compile the C code to run in a MCU. The MCU has 170k words of flash and 56kb of ram in a harvard architecture. But I am considering using flex and bison to write a c program that would run on a PC and generate an intermediate psudo assembler language to be interpreted on the mcu. I am including DCL (declare) statements in the BASIC which I would like to find on an initial pass through the code. Pass zero would also find forward branches and the WEND in a WHILE WEND statements. Can multiple passes be done with flex and bison or is there a better way? The PC program would be written in JAVA so as to run under Windows, MAC and Linux systems. Any ideas you have would be greatly appreciated.

jpollard 09-30-2013 04:54 PM

The first pass of a compiler translates the source into something more amenable to analysis. This is the intermediate language that can be in one of many forms - a parse tree plus symbol table, or multiple parse trees and symbol tables (this is the one I'm most familiar with, but there are others). Consider the inclusion of subroutines/functions for instance. The ability to include "precompiled" parse trees (or whatever the intermediate language is) allows for global optimization of the code. After the optimization pass, a third pass can be made that generates the optimized assembler.

You might want to look into the LLVM/Clang compilers (http://llvm.org/) - they are designed for this.

Another back end (I have only read about, not used) is what is used for Android - the Dalvic bytecode interpreter. This is supposed to provide an efficient interpreter with a minimum of actual code, but allows a higher level language to be converted into a smaller size (good), though it is slower than native code (bad), it is MUCH easier to generate good code for... And allows the code to run on anything that the interpreter runs on (making it easier to develop for).

This is also what the goal for the Forth language (http://www.forth.org/) - a very small interpreter, with an easily parsed language to run on very small processors.


All times are GMT -5. The time now is 01:47 AM.