|
Comments
Did you read today's front page stories & breaking news?
SYS-CON.TV
|
General Java JavaCC - The Evolution of New Wave Parser Generation Technology
JavaCC - The Evolution of New Wave Parser Generation Technology
By: Sriram Sankar
Mar. 1, 2002 12:00 AM
The technology to automatically generate a parser from this syntax specification has existed for around 20 years and is now mature enough to use in a product setting. A parser generator is a software program that accepts a syntax specification as input and generates a parser for that syntax as output. JavaCC (Java Compiler Compiler) is a parser generator that embodies the state of the art in parser generation technology and generates parsers in Java. Sreeni Viswanadha and I jointly developed JavaCC while we were at Sun Microsystems. Robert Duncan also made some important contributions. Although JavaCC is owned by Sun Microsystems and Sreeni and I are currently at WebGain, we continue to maintain JavaCC. JavaCC is available for a free download from WebGain’s Web site, as well as from Sun Microsystems. The use of parsers and parser generators is ubiquitous. There’s always a need for parsers in any large software system. Products such as WebLogic, SQL/JDBC, JPython, Cloudscape, Velocity, BeanShell, JFactor, and Dynamo all use parsers generated by JavaCC or its main competitor – ANTLR. Especially now, with so many text processing applications being built, it’s quite valuable to possess the skills to use parser generators – the time saved by using a parser generator can be immense. As an extreme example, I recently had to write an application that extracted all the type names from a Java program. What took a couple of hours to do with JavaCC would’ve taken a month to do from scratch. This article provides a background on parser generation technology, a summary of how JavaCC evolved, and a tutorial on its use. Parsing, Semanticization, et al. Consider the following Java class: class Interval_1 { This is a correct Java class, which is accepted by the Java compiler, and bytecode will be produced. Let’s modify this class as follows: class Interval_2 { This is no longer correct because a type name is used (nosuchtype) that doesn’t correspond to any declared type. However, this class is syntactically correct (i.e., it conforms to the syntax of the Java language). Let’s make another modification: class Interval_3 { This isn’t a correct Java class either because it uses a semicolon where it should have used a comma. This class is wrong in a manner different from that of Interval_2 in that it isn’t syntactically correct. Parsing catches errors such as the one in Interval_3. Parsing won’t catch the error in Interval_2. To catch that error, processing beyond parsing is required. This requires analyzing the data structures produced by the parsing process. The term semanticization is used to describe the analysis that takes place after parsing on programs to catch errors such as in Interval_2. Typically, parsing can detect errors that can be described by a grammar for the language. Depending on the sophistication of the parser and the complexity of the grammar, the delineation of responsibility between parsing and semanticization can vary. The Evolution of Parsing Technologies • Turing languages/Turing machines: The Turing machine is an abstraction of a computer program. Every computer program has a corresponding Turing machine. Turing languages are those that can be recognized by Turing machines. For the purposes of this article, this definition is mainly theoretical; it simply provides an upper limit to the complexity of computer languages. All programming languages fit under the umbrella of Turing languages. • Context-free languages: Those languages that can be described by grammars, which consist of a set of productions, each describing a particular aspect of the language. You must have seen such grammars in your Java texts. An example of a production is: methodcall ::= name "(" parameterlist ")" This says that a method call starts with a name followed by an open parenthesis, then a list of parameters, and then the closing parenthesis. To complete this definition, other productions must define name and parameterlist. Note: Java is not in itself context-free; it’s just that a large portion of the language can be described using a context-free grammar. Java language rules such as “A variable must be declared before it can be used” aren’t context-free (and can’t be described by a grammar). Such rules are processed in the semanticization phase. • Regular expressions: The simplest of the three language forms described here, regular expressions can be recognized by finite-state machines. A finite-state machine is a graph where the nodes are the different states that the machine may be in, and the directed edges between the states are the characters allowed as the next character in the input to transition from one state to the other. Regular expressions (and finite-state automata) can be described by a simple grammar (which is less expressive than a context-free grammar). The Java identifier is an example of a regular expression that can be described by the following grammar: identifier ::= firstcharacter othercharacter* This says that an identifier is composed of a first character followed by 0 or more (this is what the asterisk indicates) other characters. To complete this definition, firstcharacter and othercharacter need to be defined. A main simplification in a regular grammar as compared to a context-free grammar is that the regular grammar can’t be recursive (i.e., a production can’t use itself directly or indirectly in its description). A significant step in the evolution of parsing technologies was the development of algorithms and tools that could convert context-free grammars and regular grammars into programs that could check for the correctness of an input with respect to the grammar. Since then, most computer languages have been designed so that programs are a sequence of tokens, each of which is a regular expression, and the tokens in turn are structured according to a context-free grammar. Not all of the tokens participate in the context-free structure; some are discarded. In the case of Java, discarded tokens include comments and spaces. The following Java program example illustrates this: // This is a simple Java class The tokens in this Java program are listed below: 1. Comment: // This is a simple Java class Before the context-free grammar is matched, some tokens are discarded. In this example tokens 1, 3, 5, 7, 8, 10, and 13 are discarded (the discarded tokens are referred to as “white space”), leaving behind the tokens “class”, “Simple”, “{“, “int”, “dummy”, “;”, and “}”. These tokens are then matched with the context-free grammar for Java. Using JavaCC – A Simple Example 1. The class specification: A Java class into which the parser is generated by JavaCC, this class will have the fields and methods described in the class specification. In addition, a method is generated that corresponds to each production in the grammar. In this example the class specification states that the class is called Simple and it contains a main method that simply creates a parser object (of type Simple) and invokes it on the standard input. The method Input, which is called by the main method, is generated by JavaCC to correspond to one of the productions in the grammar specification. 2. The lexical specification: Describes the tokens present in the input file using regular expressions. In this example the tokens are quite simple and can therefore be expressed as strings (the simplest form of regular expressions). Six tokens are described: the space, tab, newline, carriage return, left brace, and right brace. The first four are specified in a SKIP declaration, which classifies them as white space (tokens to be discarded), while the left and right braces are specified as the tokens that participate in the (context-free) parsing process. Tokens that participate in this process are called terminals. The brace tokens are given the names LBRACE and RBRACE to make it possible to use them in the grammar. An input file matches this lexical specification only if it contains a sequence of tokens from this list of six tokens. 3. The grammar specification: It’s the context-free grammar that describes the input. In this example there are two productions – the first defines Input; the second, MatchedBraces. These are called nonterminals . The first production states that Input is a MatchedBraces followed by the end of the file. <EOF> is a special token (and therefore a terminal) that is matched by the end of the file. This states that the input file can only contain MatchedBraces, with nothing following. The second production defines MatchedBraces recursively in terms of itself. It says that it starts with a left brace (<LBRACE>), followed optionally by another MatchedBraces, and ends with a right brace (<RBRACE>). The brackets [...] indicate that their content is optional. Now consider the following input file: { This file has the following tokens: <LBRACE>, newline, space, <LBRACE>, newline, space, <RBRACE>, newline, <RBRACE>, <EOF>. The process of matching the lexical specification is successful and the white space is discarded, leaving behind only: <LBRACE> <LBRACE> <RBRACE> <RBRACE> <EOF> The parsing process then successfully matches these tokens with the grammar as follows:
Input The diagram shown above is called a parse tree. You can try this example yourself. First create a grammar file with the above contents. Call the file Simple.jj (you can call it anything, but by convention you should give it the jj extension). Download JavaCC from www.webgain.com and install it on your computer. Add the bin directory within this installation into your path. Now you should be able to run “javacc” from your command prompt (DOS window or UNIX xterm). Type: javacc Simple.jj It produces a bunch of Java files as output. Compile these Java files by typing: javac *.java Now create different kinds of input files – both good and bad inputs – and then parse them using the generated parser by typing: java Simple < InputFile At this point you’ve taken the biggest and most important step toward learning JavaCC. The rest of the learning process will happen very quickly. Actions It’s also necessary to be able to make the parser do something useful on a successful parse beyond simply declaring success. For this purpose JavaCC provides actions, Java code that’s inserted at an appropriate location in the grammar and executed as the parser matches that portion of the grammar. Actions may be arbitrarily complex and may invoke library methods. There are two general kinds of actions: lexical and parser. Lexical actions are executed after a successful match of a token. Parser actions are embedded within the grammar and are executed as the parser goes past a point in the parsing process. The grammar file in Listing 2 extends the previous one with both lexical and parser actions. These actions are simply print statements; try running the parser generated from this grammar file with the inputs you created for the previous exercise to see how the actions are executed. Once you complete this exercise, you should be able to go through the sequence of examples in the JavaCC download and systematically learn more details about JavaCC. Once you’ve gone through the examples, you should be ready to embark on any complex parsing application. Lookahead and Ambiguity Resolution Consider the lexical specification for the Java language. It defines all the Java reserved words as well as Java identifiers. What should the parser do when it encounters the sequence “interface”? The obvious answer is that it should recognize it as the token corresponding to the reserved word “interface”. However, there are other ways to recognize it, such as an identifier “interface” or the reserved word “int” followed by the identifier “erface”. The ambiguity resolution scheme for lexical specifications is that the longest match is preferred (so, in the above example, “interface” is preferred over “int”), and when there are multiple matches of the same length, the match that occurs first in the grammar file is used. In the example above, the reserved word “interface” is selected over the identifier “interface” if the lexical specification for the reserved word occurs earlier than that of identifier. Ambiguities in grammar specifications are resolved by looking at the next available token and proceeding with the first successful match. This resolution scheme is referred to as using a lookahead of one token. Consider the grammar for demand import declarations in Java. void DemandImportDeclaration() : On seeing the input “import x.*;”, the parser wil match the "." with the one in the production for Name. It will fail the parse because it expects an identifier and not a "*". This grammar is therefore wrong because the ambiguity is resolved incorrectly. The parse tree describing this parse process is shown in the following diagram. DemandImportDeclaration To correct this problem, we must specify that the parser looks ahead more tokens before it decides how to parse. In this example a lookahead of two tokens will work. This can be specified as a global setting (so that a lookahead of two tokens is used for all choices) or a lookahead of two tokens can be specified for this particular point only. The global setting causes the parser to become rather inefficient – typically, most choices can be made correctly with a lookahead of one – so the latter approach is the correct one. The grammar, modified with the additional lookahead specification, is shown below. void DemandImportDeclaration() : void Name() : There are other more complex ways to specify lookaheads, but we won’t go into them here. You can study the JavaCC documentation and examples for more information. Advanced Topics • Special tokens: Sometimes it’s preferable to process some white space tokens during the parsing process. JavaCC provides a way to specify tokens so they don’t get discarded but still don’t participate in the parsing process. • Lexical states: This allows you to split your lexical specification into different categories within which matches and ambiguity resolution take place independently of the other states. This allows you to conveniently write lexical specifications for (say) mail files, where the string “From” may be a reserved word in one context and a simple text in another. • Java code productions: This allows you to write actual Java code for a production instead of a grammar. Sometimes it isn’t possible to write a production using the context-free syntax, and Java code productions become invaluable. However, this feature must be used sparingly. • JJTree: This is a JavaCC preprocessor that inserts parse tree building actions into a grammar based on a parse tree specification. The result of the parsing process is a tree data structure that captures the essence of the parsing process. Additional processing is then performed on this tree (no additional actions are added by hand into the grammar). Conclusion Like other parser generators, JavaCC reads a grammar specification and converts the code into a Java program that recognizes matches to the grammar. In addition to the parser generator itself, JavaCC also provides other standard capabilities related to parser generation, such as tree building (via a tool called JJTree), actions, and debugging. JavaCC strongly supports the industry trend toward standardizing the syntax of important data formats such as XML. By giving developers a powerful tool for automating the creation of Java code that can recognize a standardized language structure, JavaCC will play a valuable role in the evolution of open systems. Reader Feedback: Page 1 of 1
Your Feedback
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
|
SYS-CON Featured Whitepapers
Most Read This Week
Breaking Cloud Computing News
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||