Parsing Challenge - Fixing Broken Syntax

Multi tool use
Parsing Challenge - Fixing Broken Syntax
I have thousands of lines of code making use of a particular non-standard syntax. I need to be able to compile the code with a different compiler, that does not support this syntax. I have tried to automatize the changes that need to be made, but being not very good with regex etc. I have failed.
Here is what I want to achieve: currently in my code an object's methods and variables are called/accessed with the following possible syntaxes:
call obj.method()
obj.method( )
obj.method( arg1, arg2, kwarg1=kwarg1 )
obj1.var = obj2.var2
Instead I want this to be:
call obj%method()
obj%method( )
obj%method( arg1, arg2, kwarg1=kwarg1 )
obj1%var = obj2%var2
And I want to make these changes without effecting the following possible occurrences of "."s:
Decimal numbers:
a = 1.0
b = 1.d0
Logical opertors (note possible spaces and method calls):
if (a.or.b) then
if ( a .and. .not.(obj.l1(1.d0)) ) then
Anything that is commented (the exclamation point "!" is used for this purpose)
!>I am a commented line.
! > I am.a commented line with..leading blanks and extra periods.1.
b=a1.var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
Anything that is in quotes (i.e. string literals)
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
Does anyone know how to approach this. I guess regex is the natural approach, but I am open to anything. (In case anyone cares: the code is written in fortran. ifort is happy with the "." syntax; gfortran isn't)
c="I am a string!"; b=a1.var()
!
Isolating method calls is pretty easy. The difficult part will be matching things like
obj1.var = obj2.var2
without matching b = 1.d0
. I'm not sure you'll be able to write a pattern tight enough to change what you want without changing more than what you want.– emsimpson92
1 hour ago
obj1.var = obj2.var2
b = 1.d0
Maybe you could try doing it in two steps
– emsimpson92
1 hour ago
2 Answers
2
You can't do this 100% robustly without a language parser (e.g. the following will fail in some cases if you have "
inside double quoted strings - easily handled but just one of many possible failures not covered by your use cases) but this will handle what you've shown us so far and a bit more. It uses GNU awk for gensub() and the 3rd arg to match().
"
Sample Input:
$ cat file
call obj.method()
obj.method( )
obj.method( arg1, arg2, kwarg1=kwarg1 )
obj1.var = obj2.var2
a = 1.0
b = 1.d0
if (a.or.b) then
if ( a .and. .not.(obj.l1(1.d0)) ) then
!>I am a commented line.
! > I am.a commented line with..leading blanks and extra periods.1.
b=a1.var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1.var()
Expected Output:
$ cat out
call obj%method()
obj%method( )
obj%method( arg1, arg2, kwarg1=kwarg1 )
obj1%var = obj2%var2
a = 1.0
b = 1.d0
if (a.or.b) then
if ( a .and. .not.(obj%l1(1.d0)) ) then
!>I am a commented line.
! > I am.a commented line with..leading blanks and extra periods.1.
b=a1%var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1%var()
The Script:
$ cat tst.awk
{
# give us the ability to use @<any other char> strings as a
# replacement/placeholder strings that cannot exist in the input.
gsub(/@/,"@=")
# ignore all !s inside double-quoted strings
while ( match($0,/("[^"]*)!([^"]*")/,a) ) {
$0 = substr($0,1,RSTART-1) a[1] "@-" a[2] substr($0,RSTART+RLENGTH)
}
# ignore all !s inside single-quoted strings
while ( match($0,/('[^']*)!([^']*')/,a) ) {
$0 = substr($0,1,RSTART-1) a[1] "@-" a[2] substr($0,RSTART+RLENGTH)
}
# Now we can separate comments from what comes before them
comment = gensub(/[^!]*/,"",1)
$0 = gensub(/!.*/,"",1)
# ignore all .s inside double-quoted strings
while ( match($0,/("[^"]*).([^"]*")/,a) ) {
$0 = substr($0,1,RSTART-1) a[1] "@#" a[2] substr($0,RSTART+RLENGTH)
}
# ignore all .s inside single-quoted strings
while ( match($0,/('[^']*).([^']*')/,a) ) {
$0 = substr($0,1,RSTART-1) a[1] "@#" a[2] substr($0,RSTART+RLENGTH)
}
# convert all logical operators like a.or.b to a@#or@#b so the .s wont get replaced later
while ( match($0,/.([[:alpha:]]+)./,a) ) {
$0 = substr($0,1,RSTART-1) "@#" a[1] "@#" substr($0,RSTART+RLENGTH)
}
# convert all obj.var and similar to obj%var, etc.
while ( match($0,/<([[:alpha:]]+[[:alnum:]_]*)[.]([[:alpha:]]+[[:alnum:]_]*)>/,a) ) {
$0 = substr($0,1,RSTART-1) a[1] "%" a[2] substr($0,RSTART+RLENGTH)
}
# Convert all @#s in the precomment text back to .s
gsub(/@#/,".")
# Add the comment back
$0 = $0 comment
# Convert all @-s back to !s
gsub(/@-/,"!")
# Convert all @=s back to @s
gsub(/@=/,"@")
print
}
Running The Script And Its Output:
$ awk -f tst.awk file
call obj%method()
obj%method( )
obj%method( arg1, arg2, kwarg1=kwarg1 )
obj1%var = obj2%var2
a = 1.0
b = 1.d0
if (a.or.b) then
if ( a .and. .not.(obj%l1(1.d0)) ) then
!>I am a commented line.
! > I am.a commented line with..leading blanks and extra periods.1.
b=a1%var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1%var()
Have you looked into solving the problem with flex? It uses regular expressions, but is more advanced, as it tries different patterns and returns the longest matching option. The rules could look like this:
%% /* rule part of the program */
!.*n printf(yytext); /* ignore comments */
".*"|'.*' printf(yytext); /* ignore strings */
[^A-Za-z_][0-9]+. printf(yytext); /* ignore numbers */
".and."|".or."|".not." printf(yytext); /* ignore logical operators */
. printf("%%"); /* now, replace the . by % */
[^.] printf(yytext); /* ignore everything else */
%% /* invoke the program */
int main() {
yylex();
}
You may have to modify the third line. Currently it ignores any .
that occurs after any number of digits, if there is none of the characters from A
to Z
, from a
to z
or the character _
before the digits. If there are more legal characters in identifiers, you can add them.
.
A
Z
a
z
_
If everything is correct, you should be able to turn that into a program. Copy it into a file called lex.l
and execute:
lex.l
$ flex -o lex.yy.c testlex.l
$ gcc -o lex.out lex.yy.c -lfl
Then you have the C program lex.out
. You can just use that in the command line:
lex.out
cat unreplaced.txt | ./lex.out > replaced.txt
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Put all of your input examples in 1 file so we just have 1 input file to test with and provide the expected output for that file. Add a more complicated example that has to be handled, e.g.
c="I am a string!"; b=a1.var()
would be tricky because in that case the!
is not the start of a comment.– Ed Morton
2 hours ago