m4: Input processing

 
 3.5 How 'm4' copies input to output
 ===================================
 
 As 'm4' reads the input token by token, it will copy each token directly
 to the output immediately.
 
    The exception is when it finds a word with a macro definition.  In
 that case 'm4' will calculate the macro's expansion, possibly reading
 more input to get the arguments.  It then inserts the expansion in front
 of the remaining input.  In other words, the resulting text from a macro
 call will be read and parsed into tokens again.
 
    'm4' expands a macro as soon as possible.  If it finds a macro call
 when collecting the arguments to another, it will expand the second call
 first.  This process continues until there are no more macro calls to
 expand and all the input has been consumed.
 
    For a running example, examine how 'm4' handles this input:
 
      format(`Result is %d', eval(`2**15'))
 
 First, 'm4' sees that the token 'format' is a macro name, so it collects
 the tokens '(', '`Result is %d'', ',', and ' ', before encountering
 another potential macro.  Sure enough, 'eval' is a macro name, so the
 nested argument collection picks up '(', '`2**15'', and ')', invoking
 the eval macro with the lone argument of '2**15'.  The expansion of
 'eval(2**15)' is '32768', which is then rescanned as the five tokens
 '3', '2', '7', '6', and '8'; and combined with the next ')', the format
 macro now has all its arguments, as if the user had typed:
 
      format(`Result is %d', 32768)
 
 The format macro expands to 'Result is 32768', and we have another round
 of scanning for the tokens 'Result', ' ', 'is', ' ', '3', '2', '7', '6',
 and '8'.  None of these are macros, so the final output is
 
      =>Result is 32768
 
    As a more complicated example, we will contrast an actual code
 example from the Gnulib project(1), showing both a buggy approach and
 the desired results.  The user desires to output a shell assignment
 statement that takes its argument and turns it into a shell variable by
 converting it to uppercase and prepending a prefix.  The original
 attempt looks like this:
 
      changequote([,])dnl
      define([gl_STRING_MODULE_INDICATOR],
        [
          dnl comment
          GNULIB_]translit([$1],[a-z],[A-Z])[=1
        ])dnl
        gl_STRING_MODULE_INDICATOR([strcase])
      =>  
      =>        GNULIB_strcase=1
      =>  
 
    Oops - the argument did not get capitalized.  And although the manual
 is not able to easily show it, both lines that appear empty actually
 contain two trailing spaces.  By stepping through the parse, it is easy
 to see what happened.  First, 'm4' sees the token 'changequote', which
 it recognizes as a macro, followed by '(', '[', ',', ']', and ')' to
 form the argument list.  The macro expands to the empty string, but
 changes the quoting characters to something more useful for generating
 shell code (unbalanced '`' and ''' appear all the time in shell scripts,
 but unbalanced '[]' tend to be rare).  Also in the first line, 'm4' sees
 the token 'dnl', which it recognizes as a builtin macro that consumes
 the rest of the line, resulting in no output for that line.
 
    The second line starts a macro definition.  'm4' sees the token
 'define', which it recognizes as a macro, followed by a '(',
 '[gl_STRING_MODULE_INDICATOR]', and ','.  Because an unquoted comma was
 encountered, the first argument is known to be the expansion of the
 single-quoted string token, or 'gl_STRING_MODULE_INDICATOR'.  Next, 'm4'
 sees '<NL>', ' ', and ' ', but this whitespace is discarded as part of
 argument collection.  Then comes a rather lengthy single-quoted string
 token, '[<NL>    dnl comment<NL>    GNULIB_]'.  This is followed by the
 token 'translit', which 'm4' recognizes as a macro name, so a nested
 macro expansion has started.
 
    The arguments to the 'translit' are found by the tokens '(', '[$1]',
 ',', '[a-z]', ',', '[A-Z]', and finally ')'.  All three string arguments
 are expanded (or in other words, the quotes are stripped), and since
 neither '$' nor '1' need capitalization, the result of the macro is
 '$1'.  This expansion is rescanned, resulting in the two literal
 characters '$' and '1'.
 
    Scanning of the outer macro resumes, and picks up with '[=1<NL>  ]',
 and finally ')'.  The collected pieces of expanded text are
 concatenated, with the end result that the macro
 'gl_STRING_MODULE_INDICATOR' is now defined to be the sequence
 '<NL>    dnl comment<NL>    GNULIB_$1=1<NL>  '.  Once again, 'dnl' is
 recognized and avoids a newline in the output.
 
    The final line is then parsed, beginning with ' ' and ' ' that are
 output literally.  Then 'gl_STRING_MODULE_INDICATOR' is recognized as a
 macro name, with an argument list of '(', '[strcase]', and ')'.  Since
 the definition of the macro contains the sequence '$1', that sequence is
 replaced with the argument 'strcase' prior to starting the rescan.  The
 rescan sees '<NL>' and four spaces, which are output literally, then
 'dnl', which discards the text ' comment<NL>'.  Next comes four more
 spaces, also output literally, and the token 'GNULIB_strcase', which
 resulted from the earlier parameter substitution.  Since that is not a
 macro name, it is output literally, followed by the literal tokens '=',
 '1', '<NL>', and two more spaces.  Finally, the original '<NL>' seen
 after the macro invocation is scanned and output literally.
 
    Now for a corrected approach.  This rearranges the use of newlines
 and whitespace so that less whitespace is output (which, although
 harmless to shell scripts, can be visually unappealing), and fixes the
 quoting issues so that the capitalization occurs when the macro
 'gl_STRING_MODULE_INDICATOR' is invoked, rather then when it is defined.
 It also adds another layer of quoting to the first argument of
 'translit', to ensure that the output will be rescanned as a string
 rather than a potential uppercase macro name needing further expansion.
 
      changequote([,])dnl
      define([gl_STRING_MODULE_INDICATOR],
        [dnl comment
        GNULIB_[]translit([[$1]], [a-z], [A-Z])=1dnl
      ])dnl
        gl_STRING_MODULE_INDICATOR([strcase])
      =>    GNULIB_STRCASE=1
 
    The parsing of the first line is unchanged.  The second line sees the
 name of the macro to define, then sees the discarded '<NL>' and two
 spaces, as before.  But this time, the next token is '[dnl
 comment<NL>  GNULIB_[]translit([[$1]], [a-z], [A-Z])=1dnl<NL>]', which
 includes nested quotes, followed by ')' to end the macro definition and
 'dnl' to skip the newline.  No early expansion of 'translit' occurs, so
 the entire string becomes the definition of the macro.
 
    The final line is then parsed, beginning with two spaces that are
 output literally, and an invocation of 'gl_STRING_MODULE_INDICATOR' with
 the argument 'strcase'.  Again, the '$1' in the macro definition is
 substituted prior to rescanning.  Rescanning first encounters 'dnl', and
 discards ' comment<NL>'.  Then two spaces are output literally.  Next
 comes the token 'GNULIB_', but that is not a macro, so it is output
 literally.  The token '[]' is an empty string, so it does not affect
 output.  Then the token 'translit' is encountered.
 
    This time, the arguments to 'translit' are parsed as '(',
 '[[strcase]]', ',', ' ', '[a-z]', ',', ' ', '[A-Z]', and ')'.  The two
 spaces are discarded, and the translit results in the desired result
 '[STRCASE]'.  This is rescanned, but since it is a string, the quotes
 are stripped and the only output is a literal 'STRCASE'.  Then the
 scanner sees '=' and '1', which are output literally, followed by 'dnl'
 which discards the rest of the definition of
 'gl_STRING_MODULE_INDICATOR'.  The newline at the end of output is the
 literal '<NL>' that appeared after the invocation of the macro.
 
    The order in which 'm4' expands the macros can be further explored
 using the trace facilities of GNU 'm4' (⇒Trace).
 
    ---------- Footnotes ----------
 
    (1) Derived from a patch in
 <http://lists.gnu.org/archive/html/bug-gnulib/2007-01/msg00389.html>,
 and a followup patch in
 <http://lists.gnu.org/archive/html/bug-gnulib/2007-02/msg00000.html>