Results 1 to 3 of 3
  1. #1
    Join Date
    Jan 2014
    Location
    Lynnwood, WA
    Posts
    46

    Question ANTLR - When to use Parser Rules vs Lexer Rules?

    In trying to take the lessons from the videos and write my own parser for excel formulas, I'm suddenly realizing, I really don't understand when to use a lexer rule, and when to use a parser rule. Can someone explain when something stops being lexing and starts being parsing?

    In my mind there seems a spectrum, where things are clearly parse rules and clearly lex rules. But its the fuzzy line where lex rules are transitioning into parse rules that I feel like I don't get it. Whats to say i don't parse a level of complexity higher or lower than my chosen point.
    Last edited by sdiguana; 06-24-2015 at 06:53 PM.

  2. #2
    Join Date
    Mar 2004
    Location
    Anacortes, WA
    Posts
    4,164
    Most times you're wanting to parse text, you're going to do three steps.
    1) Turn characters into tokens
    2) Turn tokens into an abstract syntax tree
    3) Use the abstract syntax tree do do some kind of processing

    The lexer is responsible for the first step, and it's *only* job is to create a "token stream" from text. It is not responsible for understanding the semantics of your language, it is only interested in understanding the syntax of your language.

    For example, syntax is the rule that an identifier must only use characters, numbers and underscores - as long as it doesn't start with a number. The responsibility of the lexer is to understand this rule. In this case, the lexer would accept the sequence of characters "asd_123" but reject the characters "12dsadsa" (assuming that there isn't another rule in which this text is valid). When seeing the valid text example, it may emit a token into the token stream such as IDENTIFIER(asd_123).

    Note that I said "identifier" which is the general term for things like variable names, function names, namespace names, etc. The parser would be the thing that would understand the context in which that identifier appears, so that it would then further specify that token as being a certain thing's name.

    (sidenote: the token is just a unique name given to an element of the token stream. The lexeme is the text that the token was matched from. I write the lexeme in parentheses next to the token. For example, NUMBER(123). In this case, this is a NUMBER token with a lexeme of '123'. However, with some tokens, such as operators, I omit the lexeme since it's redundant. For example, I would write SEMICOLON for the semicolon token, not SEMICOLON( ; )).

    When the lexer is done, it will output a token stream (assuming that there were not any errors). It would take a string such as:
    Code:
    var test = 1 + 2;
    And turn it into a token stream of:
    Code:
    KEYWORD_VAR IDENTIFIER(test) OPERATOR_ASSIGNMENT NUMBER(1) OPERATOR_ADDITION NUMBER(2) SEMICOLON
    Notice that in this example, I'm ignoring whitespace in the lexer (as in, the lexer just reads whitespace and discards it).

    What is important here is that the token stream is FLAT and has no structure or hierarchy. It is literally just an array of tokens.

    Semantics are the meaning of your language. The job of understanding this is the parser. The parser is responsible for creating structure from the "flat" token stream. Almost always, a parser will emit an abstract syntax tree. Remember how the lexer has no concept of the structure or meaning of your language, whereas your parser does. Taking our example from earlier, the parser would extract meaning from our statement, such as understanding that this is an ASSIGNMENT node between a NEW_VARIABLE and an EXPRESSION and the EXPRESSION node is an ADDITION node between NUMBER(1) and NUMBER(2) nodes.

    For example, the tree would look something like:
    Code:
    (assignment ((newVariable IDENTIFIER(test)) (expression (add NUMBER(1) NUMBER(2)))))
    Take a close look at the shift in language. With the lexer, we only spoke of simple TOKENS that described what things are, but with the parser we speak in terms of what things mean. You would never use a lexer to understand the concept of addition; yet you would never use a parser to understand the concept of what defines a number, or a string literal, or a semicolon.

    So we've introduced, in the parser, the concept of an assignment, newVariable, expression and add nodes. These aren't tokens, these are nodes in our abstract syntax tree. Antlr sometimes makes this a little more complicated, since tokens can be used as nodes - but nodes cannot be used as tokens. Notice in this example how I am differentiating tokens from nodes - tokens are uppercase whereas nodes are not. However, they're both used in the AST.

    The parser is also responsible for enforcing the semantic rules of our language. For example, imagine that every statement must end with a semicolon - or that the block "invalid(true) { alert("HI"); }" is invalid, since the IDENTIFIER(invalid) token is not a language keyword such as if, while, for, etc.

    You may have multiple passes of a parser, which simplify the tree. Perhaps you do some optimization with processing, too: such as convert a literal expression "1 + 1" into an equivalent node of "2". Then, all the way at the end, you simply navigate your tree and do whatever it is your language does: such as emit bytecode, populate a data structure, etc.

    To answer your question: if you need to figure out what something syntactically is, then use a lexer rule. If you need to extract meaning from a sequence of already-lexed tokens, use a parser rule. Parsers are intelligent, lexers are dumb. Also, because of that fact, many things that are possible with parsers that are impossible with lexers. Does that make sense?
    Need any help? Feel free to PM me - or send an email directly to nelson@3dbuzz.com!

  3. #3
    Join Date
    Jan 2014
    Location
    Lynnwood, WA
    Posts
    46
    It does, Thanks Nelson. I thought I was understanding it well... until I tried it out myself. This definitely clears it up for me though. In my excel grammar, i was starting to do combinations as lexer rules that were more like parser rules, and the mental red light went off, but I couldn't explain to myself why it looked wrong.

    I picked up the ANTLR reference book as well, and it is also helping a ton.

    In a tangentially related note: I have to say that for Antlr 4, having to go through a test rig instead of the interpreter is kind of painful. I can see why you didn't spend too long hunting it in the videos for Ch 9. Took me most of an evening to get everything sorted out till I was happy I could work in v4.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •