In this section, we describe the low-level lexical structure of Haskell. Most of the details may be skipped in a first reading of the report.
These notational conventions are used for presenting syntax:
[pattern] | optional |
{pattern} | zero or more repetitions |
(pattern) | grouping |
pat1 | pat2 | choice |
pat<pat'> | difference---elements generated by pat |
except those generated by pat' | |
fibonacci | terminal syntax in typewriter font |
Because the syntax in this section describes lexical syntax, all whitespace is expressed explicitly; there is no implicit space between juxtaposed symbols. BNF-like syntax is used throughout, with productions having the form:
nonterm -> alt1 | alt2 | ... | altn
Care must be taken in distinguishing metalogical syntax such as | and [...] from concrete terminal syntax (given in typewriter font) such as | and [...], although usually the context makes the distinction clear.
Haskell uses the Latin-ISO-8859-1 character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.
program -> { lexeme | whitespace } lexeme -> varid | conid | varsym | consym | literal | special | reservedop | reservedid literal -> integer | float | char | string special -> ( | ) | , | ; | [ | ] | _ | ` | { | } whitespace -> whitestuff {whitestuff} whitestuff -> whitechar | comment | ncomment whitechar -> newline | vertab | formfeed | space | tab | nonbrkspc newline -> a newline (system dependent) space -> a space tab -> a horizontal tab vertab -> a vertical tab formfeed -> a form feed nonbrkspc -> a non-breaking space comment -> -- {any} newline ncomment -> {- ANYseq {ncomment ANYseq} -} ANYseq -> {ANY}<{ANY} ( {- | -} ) {ANY}> ANY -> any | newline | vertab | formfeed any -> graphic | space | tab | nonbrkspc graphic -> large | small | digit | symbol | special | : | " | ' small -> ASCsmall | ISOsmall ASCsmall -> a | b | ... | z ISOsmall -> à | á | â | ã | ä | å | æ | ç | è | é | ê | ë | ì | í | î | ï | ð | ñ | ò | ó | ô | õ | ö | ø | ù | ú | û | ü | ý | þ | ÿ | ß large -> ASClarge | ISOlarge ASClarge -> A | B | ... | Z ISOlarge -> À | Á | Â | Ã | Ä | Å | Æ | Ç | È | É | Ê | Ë | Ì | Í | Î | Ï | Ð | Ñ | Ò | Ó | Ô | Õ | Ö | Ø | Ù | Ú | Û | Ü | Ý | Þ symbol -> ASCsymbol | ISOsymbol ASCsymbol -> ! | # | $ | % | & | * | + | . | / | < | = | > | ? | @ | \ | ^ | | | - | ~ ISOsymbol -> ¡ | ¢ | £ | ¤ | ¥ | ¦ | § | ¨ | © | ª | « | ¬ | | ® | &hibar; | ° | ± | ² | ³ | ´ | µ | ¶ | · | ¸ | ¹| º | » | ¼ | ½ | ¾ | ¿ | × | ÷ digit -> 0 | 1 | ... | 9 octit -> 0 | 1 | ... | 7 hexit -> digit | A | ... | F | a | ... | f
Characters not in the category ANY are not valid in Haskell programs and should result in a lexing error. Comments are valid whitespace. An ordinary comment begins with two consecutive dashes (--) and extends to the following newline. A nested comment begins with {- and ends with -}; it can be between any two lexemes. All character sequences not containing {- nor -} are ignored within a nested comment. Nested comments may be nested to any depth: any occurrence of {- within the nested comment starts a new nested comment, terminated by -}. Within a nested comment, each {- is matched by a corresponding occurrence of -}. In an ordinary comment, the character sequences {- and -} have no special significance, and, in a nested comment, the sequence -- has no special significance. Nested comments are used for compiler pragmas, as explained in Appendix E.
If some code is commented out using a nested comment, then any occurrence of {- or -} within a string or within an end-of-line comment in that code will interfere with the nested comments.
varid -> (small {small | large | digit | ' | _})<reservedid> conid -> large {small | large | digit | ' | _} reservedid -> case | class | data | default | deriving | do | else | if | import | in | infix | infixl | infixr | instance | let | module | newtype | of | then | type | where specialid -> as | qualified | hidingAn identifier consists of a letter followed by zero or more letters, digits, underscores, and single quotes. Identifiers are lexically distinguished into two classes: those that begin with a lower-case letter (variable identifiers) and those that begin with an upper-case letter (constructor identifiers). Identifiers are case sensitive: name, naMe, and Name are three distinct identifiers (the first two are variable identifiers, the last is a constructor identifier). Some identifiers, here indicated by specialid, have special meanings in certain contexts but can be used as ordinary identifiers.
varsym -> ( symbol {symbol | :} )<reservedop> consym -> (: {symbol | :})<reservedop> reservedop -> .. | :: | = | \ | | | <- | -> | @ | ~ | => specialop -> - | !
Operator symbols are formed from one or more symbol characters, as defined above, and are lexically distinguished into two classes: those that start with a colon (constructors) and those that do not (functions). Some operators, here indicated by specialop, have special meanings in certain contexts but can be used as ordinary operators.
The sequence -- immediately terminates a symbol; thus +--+ parses as the symbol + followed by a comment.
Other than the special syntax for prefix negation, all operators are infix, although each infix operator can be used in a section to yield partially applied operators (see Section 3.5). All of the standard infix operators are just predefined symbols and may be rebound.
Although case is a reserved word, cases is not. Similarly, although = is reserved, == and ~= are not. At each point, the longest possible lexeme is read, using a context-independent deterministic lexical analysis (i.e. no lookahead beyond the current character is required). Any kind of whitespace is also a proper delimiter for lexemes.
In the remainder of the report six different kinds of names will be used:
varid (variables) conid (constructors) tyvar -> varid (type variables) tycon -> conid (type constructors) tycls -> conid (type classes) modid -> conid (modules)Variables and type variables are represented by identifiers beginning with small letters, and the other four by identifiers beginning with capitals; also, variables and constructors have infix forms, the other four do not. Namespaces are also discussed in Section 1.4.
External names may optionally be qualified in certain circumstances by prepending them with a module identifier. This applies to variable, constructor, type constructor and type class names, but not type variables or module names. Qualified names are discussed in detail in Section 5.1.2.
qvarid -> [modid .] varid qconid -> [modid .] conid qtycon -> [modid .] tycon qtycls -> [modid .] tycls qvarsym -> [modid .] varsym qconsym -> [modid .] consym
decimal -> digit{digit} octal -> octit{octit} hexadecimal -> hexit{hexit}
integer -> decimal | 0o octal | 0O octal | 0x hexadecimal | 0X hexadecimal float -> decimal . decimal[(e | E)[- | +]decimal]There are two distinct kinds of numeric literals: integer and floating. Integer literals may be given in decimal (the default), octal (prefixed by 0o or 0O) or hexadecimal notation (prefixed by 0x or 0X). Floating literals are always decimal. A floating literal must contain digits both before and after the decimal point; this ensures that a decimal point cannot be mistaken for another use of the dot character. Negative numeric literals are discussed in Section 3.4. The typing of numeric literals is discussed in Section 6.3.1.
2.5 Character and String Literals
char -> ' (graphic<' | \> | space | escape<\&>) ' string -> " {graphic<" | \> | space | escape | gap} " escape -> \ ( charesc | ascii | decimal | o octal | x hexadecimal ) charesc -> a | b | f | n | r | t | v | \ | " | ' | & ascii -> ^cntrl | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | HT | LF | VT | FF | CR | SO | SI | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US | SP | DEL cntrl -> ASClarge | @ | [ | \ | ] | ^ | _ gap -> \ whitechar {whitechar} \
Character literals are written between single quotes, as in 'a', and strings between double quotes, as in "Hello".
Escape codes may be used in characters and strings to represent special characters. Note that a single quote ' may be used in a string, but must be escaped in a character; similarly, a double quote " may be used in a character, but must be escaped in a string. \ must always be escaped. The category charesc also includes portable representations for the characters "alert" (\a), "backspace" (\b), "form feed" (\f), "new line" (\n), "carriage return" (\r), "horizontal tab" (\t), and "vertical tab" (\v).
Escape characters for the ISO-8859-1 character set, including control characters such as \^X, are also provided. Numeric escapes such as \137 are used to designate the character with decimal representation 137; octal (e.g. \o137) and hexadecimal (e.g. \x37) representations are also allowed. Numeric escapes that are out-of-range of the ISO standard are undefined and thus non-portable.
Consistent with the "consume longest lexeme" rule, numeric escape characters in strings consist of all consecutive digits and may be of arbitrary length. Similarly, the one ambiguous ASCII escape code, "\SOH", is parsed as a string of length 1. The escape character \& is provided as a "null character" to allow strings such as "\137\&9" and "\SO\&H" to be constructed (both of length two). Thus "\&" is equivalent to "" and the character '\&' is disallowed. Further equivalences of characters are defined in Section 6.1.2.
A string may include a "gap"---two backslants enclosing white characters---which is ignored. This allows one to write long strings on more than one line by writing a backslant at the end of one line and at the start of the next. For example,
"Here is a backslant \\ as well as \137, \ \a numeric escape character, and \^X, a control character."
String literals are actually abbreviations for lists of characters (see Section 3.7).