World Library  
Flag as Inappropriate
Email this Article

Scannerless parsing

Article Id: WHEBN0003235935
Reproduction Date:

Title: Scannerless parsing  
Author: World Heritage Encyclopedia
Language: English
Subject: Syntax Definition Formalism, Scannerless Boolean Parser, Compiler construction, Identifier, Compiler
Collection: Parsing Algorithms
Publisher: World Heritage Encyclopedia
Publication
Date:
 

Scannerless parsing

In computer science, scannerless parsing (also called lexerless parsing) refers to performing both tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a pipeline of a lexer followed by a parser, executing concurrently. It also refers to the associated grammar: using a single formalism to express both the lexical (word level) grammar and phrase level grammar used to parse a language.

Dividing processing up into a lexer followed by a parser is generally viewed as better design because it is more modular, and scannerless parsing is primarily used when a clear lexer–parser distinction is unneeded or unwanted. Examples of when this is appropriate include TeX, most wiki grammars, makefiles, simple per application scripting languages, and Perl 6.

Contents

  • Advantages 1
  • Disadvantages 2
  • Required extensions 3
  • Implementations 4
  • Notes 5
  • Further reading 6

Advantages

  • Only one metasyntax is needed
  • Non-regular lexical structure is handled easily
  • "Token classification" is unneeded which removes the need for design accommodations such as "the lexer hack" and language keywords (such as "while" in C)
  • Grammars can be compositional (can be merged without human intervention)

Disadvantages

  • Since the lexical scanning and syntactic parsing processing is combined, the resulting parser tends to be harder to understand and debug for more complex languages
  • Most parsers of character-level grammars are nondeterministic
  • There is no guarantee that the resulting grammar is unambiguous

Required extensions

Eelco Visser identified five key extensions to classical context-free syntax which handle almost all common non-context-free constructs arising in practice:

  • Follow restrictions, a limited form of "longest match"
  • Reject productions, a limited form of negative matching (as found in boolean grammars)
  • Preference attributes to handle the dangling else construct in C-like languages
  • Per-production transitions rather than per-nonterminal transitions in order to facilitate:
    • Associativity attributes, which prevent a self-reference in a particular production of a nonterminal from producing that same production
    • Precedence/priority rules, which prevent self-references in higher-precedence productions from producing lower-precedence productions

Implementations

  • SGLR is a parser for the modular Syntax Definition Formalism SDF, and is part of the ASF+SDF Meta-Environment and the Stratego/XT program transformation system.
  • JSGLR, a pure Java implementation of SGLR, also based on SDF.
  • TXL supports character-level parsing.
  • dparser generates ANSI C code for scannerless GLR parsers.
  • Spirit allows for both scannerless and scanner-based parsing.
  • SBP is a scannerless parser for boolean grammars (a superset of context-free grammars), written in Java.
  • Laja is a two phase scannerless parser generator with support for mapping the grammar rules into objects, written in Java.
  • The Perl 6 Grammars feature of the general purpose programming language Perl 6.
  • PyParsing is a scannerless parser written in pure Python.
  • META II Has built in token parsers functions.
  • TREE-META Like META II also is scanerless having builtin lexer functions.
  • CWIC Compiler for Writing and Implementing Compilers. Has token rules as part of its language. Rules in CWIC were compiled into boolean functions returning success or failure.

Notes

  • ^ This is because parsing at the character level makes the language recognized by the parser a single context-free language defined on characters, as opposed to a context-free language of sequences of strings in regular languages. Some lexerless parsers handle the entire class of context-free languages, which is closed under composition.

Further reading

This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
 
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
 
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.
 


Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.