RegexSearch 2.5 : Manual

Table of contents

You must have JavaScript enabled in your browser to generate the table of contents.

1  Introduction

RegexSearch is a Java application that performs find and find-and-replace searches for regular expressions on multiple text files. It is distributed under version 3 of the GNU General Public License; for details, see the file license.txt that is included in the RegexSearch distribution.

RegexSearch has the following features:

The website of the RegexSearch project is at http://regexsearch.sourceforge.net/ .

2  Requirements

RegexSearch is a Java application that requires a Java runtime environment that supports Java 1.6, such as Sun's Java Runtime Environment (JRE), version 6.0 or later.

3  Contents of the distribution

The following files are included in the distribution:

regexSearch.jar
The executable JAR (Java archive) file of the RegexSearch application.
regexSearch.conf
The configuration file for RegexSearch, which contains the default values for the configuration properties.
license.txt
A copy of the GNU General Public License, version 3 — the license under which RegexSearch is distributed.
doc/manual.html, doc/images/valid-xhtml11.png, doc/scripts/toc.js, doc/style/regexSearch-manual.css, doc/style/puckfist-manual.css, doc/style/puckfist-common.css
This manual, its image file, script and stylesheets. Any modifications to the manual for the latest version of RegexSearch will appear in the online version of the document, to which there is a link on the RegexSearch website.
dtd/searchParams.dtd
The DTD (document type definition) of a RegexSearch search-parameter document. (RegexSearch does not use the DTD; it is provided only for reference.)
images/regexSearch.png, images/regexSearch.ico
A 48×48-pixel PNG image and a 48×48-pixel Windows-format icon that can be used to customise a desktop icon for the RegexSearch application.

4  Installing and running RegexSearch

Note:   RegexSearch is distributed without an automated means of installation. Because of this, the instructions below assume that you have a basic knowledge of environment variables and command lines appropriate to the system on which RegexSearch is to be installed.

RegexSearch consists of a single JAR (Java archive) file, regexSearch.jar. It is recommended that RegexSearch be used with a configuration file, regexSearch.conf, which contains user preferences. Of the files listed in the contents of the distribution, only the JAR file is required.

The installation of RegexSearch consists of two stages: copying the JAR file — and, optionally, the default configuration file — to your system, and providing a means of invoking the JAR file. The more advanced topic of configuring RegexSearch after installation is dealt with in the section on configuration.

The first stage is simple: copy the file regexSearch.jar to a suitable location on your system. The next stage involves providing the means by which the RegexSearch application is run. The recommended way of running RegexSearch is to invoke the java launcher tool from a command line, which may be included in a batch file. Configuration properties (including the location of a configuration file) may be specified in the command line.

4.1  Linux/UNIX

Assuming that your PATH environment variable includes the path to the java tool and that you have copied regexSearch.jar to the directory /home/slothrop/bin/regexsearch/, the command

java -jar /home/slothrop/bin/regexsearch/regexSearch.jar

will run the RegexSearch application.

The file regexSearch.png can be used as the icon for the RegexSearch application.

4.2  Windows

The RegexSearch application does not require a console window, so you can use the javaw launcher rather than the java launcher unless you particularly want a console window. Assuming that your PATH environment variable includes the path to the javaw tool and that you have copied regexSearch.jar to the directory C:\Program Files\RegexSearch\, the command

javaw -jar "C:\Program Files\RegexSearch\regexSearch.jar"

will run the RegexSearch application.

The file regexSearch.ico can be used as the icon for the RegexSearch application.

4.3  Uninstalling RegexSearch

RegexSearch does not have an automated means of uninstallation. To remove it from your system, delete the file regexSearch.jar from the location to which you copied it when you installed RegexSearch. If you want to remove RegexSearch completely, you should also delete the configuration file, regexSearch.conf, which may be at its default location, and any search-parameter files that you created.

5  Configuration

When it starts up, RegexSearch gets its configuration from two sources: properties in the command line that is used to run the Java launcher, and a configuration file whose location may be explicitly specified.

The recommended method of setting the properties in a configuration file is with the Preferences command. Command-line properties must necessarily be edited manually; the form of the property values is given in the appendix on configuration properties, and it can also be inferred from the sample configuration file.

5.1  Command-line properties

When RegexSearch is run by means of the java launcher, configuration properties may be specified in the command line using the standard Java form -Dname="value"; eg, -Dapp.appearance.textViewViewableSize="96, 32". (The quotation marks around the value aren't necessary if the value doesn't contain spaces.) RegexSearch's command-line configuration properties all have the prefix "app.". A list of all the properties that are recognised by RegexSearch is given in the appendix on configuration properties.

5.1.1  The app.configPath property

One particular property, app.configPath, is used to specify the directory that contains a configuration file:

If the configuration file were located in a directory named config in the user's home directory, the sample command lines given above would become:

Linux/UNIX: java -Dapp.configPath="~/config" -jar /home/slothrop/bin/regexsearch/regexSearch.jar
Windows: javaw -Dapp.configPath="~/config" -jar "C:\Program Files\RegexSearch\regexSearch.jar"

5.2  Configuration file

The configuration file, which must be named regexSearch.conf, is an XML file that is ordinarily written by RegexSearch but can be edited manually if you know what you're doing. (It can also be edited manually if you don't know what you're doing, but this is discouraged.) RegexSearch doesn't require a configuration file: it uses a default value for any configuration property that is missing from the source(s) of configuration. Similarly, if it finds a property value to be invalid, RegexSearch will display a message to this effect and use its default value.

A configuration file takes precedence over configuration properties in the command line; that is, if the same property is specified as a command-line property and in a configuration file, the value from the configuration file is used.

If the configuration has changed when you exit the application normally (ie, using the File > Exit command or an equivalent), RegexSearch will save its configuration to a configuration file. If a configuration file was read on startup, it will overwrite that file; otherwise, it will write a configuration file to the default directory described above.

A configuration file can be written explicitly with the Save Configuration command within the Preferences dialog.

5.2.1  Location of the configuration file

When it starts up, RegexSearch is informed of the location of the configuration file with the app.configPath property, which may be set on the command line that runs the java launcher. The existence of a system property with the key app.configPath determines the locations that are searched for a configuration file:

6  Search parameters

The parameters of a search consist of:

The search parameters are stored collectively as an XML file. The DTD of a search-parameter file (searchParams.dtd) is included in the RegexSearch distribution. It is provided only for reference because RegexSearch does not validate search-parameter files against the DTD.

A search-parameter file may contain multiple file sets, targets and replacements, and each file set may contain multiple pathnames and pathname filters, though only a single pathname, inclusion filter, exclusion filter, target and replacement are used in a search. For each of those five parameters, RegexSearch's user interface allows you to select from the list of available values, to edit or delete existing values and to add new ones to the list.

When RegexSearch is run, search parameters are read from the file specified by the configuration property path.defaultSearchParameters. The current set of search parameters can be saved with the File > Save Search Parameters command, and files saved in this way can be opened with the File > Open Search Parameters command. When you open a new search-parameter file or exit RegexSearch, you will be prompted to save the current search parameters if a search-parameter file was read, either automatically at startup or explicitly, and the parameters have changed since the file was read. A change to the file-set index or a parameter index is regarded as a change to the file.

6.1  File set

A file set specifies the files that are searched in a find or find-and-replace operation. A file set has a file-set type and, depending on its file-set type, it may also have a pathname and two kinds of pathname filter: an inclusion filter and an exclusion filter.

6.1.1  Pathname

Depending on the file-set type, the pathname of a file set may be direct or indirect. A direct pathname specifies either a single file or a base directory that, in conjunction with inclusion and exclusion filters, defines the scope of a search. An indirect pathname specifies a file that contains a list of files and directories to be searched. A pathname may be absolute or relative; a relative pathname is relative to the current working directory.

6.1.2  Pathname filters

The inclusion filters and exclusion filters of a file set are two kinds of pathname filter. A pathname filter is a set of patterns that determines, usually in conjunction with a base pathname, the files that are searched. A file is included in a search if it matches at least one pattern in the inclusion filter AND none of the patterns in the exclusion filter. The maximum number of patterns in a filter is 64.

A pattern is a pathname that may include wildcards. There are three wildcards: two filename wildcards and a pathname wildcard.

6.1.2.1  Filename wildcards

The filename wildcards, "?" and "*", have their usual meaning: "?" matches a single character and "*" matches zero or more characters in a pathname component (a filename or directory name). For example, the pattern "foo*.txt" will match the filenames foo.txt, food.txt and football.txt. Following the UNIX convention (but differing from the MS-DOS convention), a dot, ".", has no special significance in patterns: it is matched by the "*" wildcard. Thus, the pattern "foo*" will match the filenames foo, football.txt and food.store.log.

6.1.2.2  Pathname wildcards

The pathname wildcard, "**", matches zero or more pathname components. Its use in a pathname pattern is analogous to the use of "*" in a filename pattern. By itself, the pattern "**" is the recursive analogue of "*": it matches all files in or below the base directory. Used as a pathname component in a larger pattern, "**" specifies a recursive portion of the pathname that may be bounded above or below by a non-recursive pathname. For example,

A pathname-filter pattern may be either relative to a base directory (in a directory or list file set), or it may be absolute. The pathname components in a pattern are separated with a "/" character (U+002F). (A "\" may be used as the directory separator on the Windows platform, but "/" is recommended because "\" is used as the escape character in filter fields.) A pattern that ends with a directory separator is assumed to be followed by an implicit "**". A pattern that, when appended to its base directory, specifies an existing directory is assumed to be followed by an implicit "/**". A pattern may contain dot and double-dot components ("." and ".."), but only if they appear before the first wildcard in the pattern.

A file is matched against a pathname-filter pattern by converting both the pattern (appended to a base directory, if the pattern is relative) and the pathname of the target file to a canonical form. An error may occur in converting the pattern to canonical form if, for example, the resulting pathname is illegal or access to part of the file system is not permitted.

6.1.3  File-set types

A file-set type may be one of File, Directory, List or Results. The four types are described below.

6.1.3.1  File

The search is performed on the file specified by the pathname of the file set. No inclusion filter or exclusion filter is applied.

6.1.3.2  Directory

The pathname of the file set specifies the base directory of the search. (A search is not necessarily confined to this directory and the directories below it because the inclusion filter may contain patterns that specify pathnames outside the base directory.) An inclusion filter and an exclusion filter may be specified. If no inclusion filter is specified, a filter consisting of the single pattern "**" (match all recursively) is assumed. Any relative patterns (see pathname filter) in the inclusion filter and exclusion filter are relative to the base directory.

The files in a directory are searched in order of filename. The ordering is lexicographic (ie, the Unicode values of characters in the filename are compared) and platform-dependent: it is case-sensitive on Linux/UNIX systems, but alphabetic case is ignored on Windows systems. Recursion is specified implicitly by pathname wildcards. A recursive search on a directory is depth-first: files in subdirectories are searched before the files in the directory. Like files, subdirectories are searched in order of name. If the inclusion filter contains any absolute patterns, the files or directories specified by those patterns are searched after the base directory.

6.1.3.3  List

The pathname of the file set is assumed to specify a text file, each of whose non-empty lines denotes the pathname of a file or directory that is to be searched. A line of the list file may contain a comment, beginning with a ";" character. If a line contains a comment, any characters after the last non-space character before the comment are ignored (eg, the line "simple-filename.txt ; file #23" is parsed as "simple-filename.txt"). Empty lines are ignored. The pathnames are validated before the search starts, and the search will not proceed unless each pathname denotes an existing directory or a regular file.

An inclusion filter and an exclusion filter may be specified. If no inclusion filter is specified, a filter consisting of the single pattern "**" (match all recursively) is assumed.

The search of the pathnames in the list is equivalent to a sequence of searches on file sets whose file-set type is either File or Directory according to whether a pathname specifies a file or directory. If the pathname specifies a directory, the inclusion filter and any exclusion filter are applied to it. If the inclusion filter contains any absolute patterns, the files or directories specified by those patterns are searched after all the pathnames in the list.

6.1.3.4  Results

The files that will be searched are those from the last list of files to be saved with the Search > Save Results command. The current list of saved results can be viewed with the Search > View Saved Results command. No inclusion filter or exclusion filter is applied.

6.2  Target

The target of the search — the pattern that you are attempting to match in the files that are searched — can be either literal text or a regular expression. The corresponding types of search are referred to below as literal-text search and regular-expression search.

6.3  Replacement

The replacement is an expression that will be used to replace occurrences of the target pattern in a find-and-replace search. The interpretation of the replacement differs according to whether the target is literal text or a regular expression. Both types of replacement may contain metasymbols — special sequences that are introduced with an escape character. By default, the escape character is a backslash, "\", but it can be changed with the general.replacementEscapeCharacter configuration property if, for example, you want to avoid having to escape the backslashes in Windows pathnames. In a replacement, an escape character must always be escaped by prefixing another escape character to it (eg, "\\", if the escape character is "\").

6.3.1  Literal-text replacement

The following metasymbols may appear in a literal-text replacement string. It is assumed that the escape character is "\".

\t Tab character, U+0009
\n Line-feed character, U+000A
\unnnn Unicode character U+nnnn, where n is a hexadecimal digit character, [0-9A-Fa-f]
\\ Literal escape character

6.3.2  Regular-expression replacement

The following metasymbols may appear in a regular-expression replacement string. It is assumed that the escape character is "\".

\t Tab character, U+0009
\n Line-feed character, U+000A
\unnnn Unicode character U+nnnn, where n is a hexadecimal digit character, [0-9A-Fa-f]
\\ Literal escape character
\n Capturing group in the target pattern, where n is the decimal index of the group
\Ln Capturing group in the target pattern, where n is the decimal index of the group.
All alphabetic characters in the group are converted to lower case.
\Un Capturing group in the target pattern, where n is the decimal index of the group.
All alphabetic characters in the group are converted to upper case.

7  The display

The main display consists of a single window, divided into three areas that are referred to in this document as the text view, control panel and result area.

7.1  Column width and row height

The width and/or height of some text components are specified in logical units of columns and rows. The width of a column and the height of a row are determined by the font that is used to display text within the component: the height of a row is the height of the font, and the width of a column is the width of a zero character (U+0030), or, if the font doesn't have a glyph for the zero character, the width of the glyph that is used for characters that are not defined.

7.2  Text view

The text view is the text area at the top of the display in which the contents of a file are displayed. The text view is not editable. The following attributes of the text view are configurable:

The width of the text view is expanded, if necessary, to fit the width of the main window (which is also determined by the size of other components), so the viewable text-view size property effectively sets the minimum width rather than the displayed width of the text view.

The colours of the text view are also applied to the result area and to the fields in the Search Options dialog.

7.2.1  How tab characters are displayed

When a file containing tab characters (U+0009) is displayed in the text view, RegexSearch uses two configuration properties — tab-width filters and a default tab width — to determine how the tab characters are converted to spaces. A tab-width filter maps a filename filter (a set of patterns that are used to match filenames) to the number of spaces that will be used to replace tab characters when displaying a matching file. If none of the defined tab-width filters matches the file, the default tab width is used. If the tab width is zero, tab characters are not expanded but rendered as a U+2192 (rightwards arrow) character in left-to-right locales or a U+2190 (leftwards arrow) character in right-to-left locales, or as the "not defined" glyph if the font doesn't contain a glyph for the appropriate arrow character.

The filename-filter part of the tab-width filter consists of one or more filename patterns separated by spaces; for instance, "*.cpp *.h". If a filename matches one of the patterns, it is included in the search in the case of the inclusive filter or excluded from the search in the case of the exclusive filter. A pattern may be a literal filename or it may contain the wildcards "*" and "?", which have their usual meaning: "*" matches zero or more characters and "?" matches a single character.

7.3  File-set controls

The file-set controls, which can be found in the top row of the control panel, consist of a combo box for selecting the file-set type, a group of three buttons for inserting, duplicating and deleting file sets, and a group of four buttons for navigating the list of file sets and changing the position of the current file set in the list.

The index of the current file set and the number of file sets in the list are shown in a box between the two pairs of navigation buttons. "End" indicates that the file-set position is at the end of the list; there is no current file set. A file set may be inserted at the end of the list.

7.3.1  File-set type combo box

The combo box is used to select the file-set type. The pathname field and include and exclude fields are enabled or disabled according to the file-set type.

7.3.2  File-set command buttons

A file set can be added to and removed from the list of file sets with the commands that are associated with the group of three buttons in the top row of the control panel. Each command can also be issued from the keyboard.

7.3.2.1  Insert

The Insert command inserts a new file set into the list at the current file-set index. To add a new file set to the end of the list, first navigate to the end of the list. The Insert command can be issued by pressing the F2 key.

7.3.2.2  Duplicate

The Duplicate command makes a copy of the current file-set, inserts the copy into the list after the current index, then selects the copy. The Duplicate command can be issued by pressing the F3 key.

7.3.2.3  Delete

The Delete command deletes the current file-set after you have confirmed the deletion. The Delete command can be issued by pressing the F4 key.

7.3.3  File-set navigation buttons

The list of file sets can be navigated and the position of the current file set in the list can be changed with the commands that are associated with the group of four arrow buttons and barred-arrow buttons in the top row of the control panel. Each command can also be issued from the keyboard.

The arrow buttons select the previous or next file set in the list. The current file-set index continues to change while the mouse button is pressed or until the start or end of the list is reached. Holding down the Ctrl key while clicking on or pressing the arrow buttons will move the current file set up or down the list. The Go-to-previous and Go-to-next commands can be issued by pressing the F6 and F7 keys respectively. The Move-up and Move-down commands can be issued by pressing Ctrl+F6 and Ctrl+F7 respectively.

The barred-arrow buttons select the first file set in the list or go to the end of the list (where no file set is selected). The Go-to-start and Go-to-end commands can be issued by pressing the F5 and F8 keys respectively.

7.4  Parameter fields

The five most prominent components in the control panel are referred to as parameter fields, although two of them are text areas rather than fields. Along with the text view and result area, these fields determine the size of the application's main window, and their width (number of columns) can be set with the appearance.paramFieldNumColumns configuration property.

Each parameter field maintains a list of the most recent values that were entered in the field, up to a maximum of 64 values. A parameter field is similar in operation to an editable combo box except that an item is not moved to the top of the list when it is selected. (The order of items in the list may be changed in the editor; see below.)

A value is entered into the field explicitly by pressing Ctrl+Enter or implicitly when

When an value is entered in the field, it is inserted at the top of the list.

The list can be navigated and edited in several ways. Navigation and editing commands are available from a pop-up menu that is activated in a system-dependent manner (eg, by pressing or releasing the right mouse button) or by pressing the context-menu key when the field has keyboard focus. The Select Previous Item and Select Next Item commands that are available from the pop-up menu can also be issued by pressing Ctrl+PageUp and Ctrl+PageDown respectively. The Delete command can be issued by pressing Ctrl+Shift+Delete.

All the parameter fields have an Edit command that displays an editor in which the items in the field's list can be edited. The command is available from the field's pop-up menu and can also be issued by pressing Alt+Enter. (For the filter fields, the command can be issued with the Edit button adjacent to the field.) Within the editor, the position of an item in the list can be changed by dragging with the mouse or by pressing Ctrl+Up or Ctrl+Down. The Delete key and Delete button delete the selected item after confirmation, while Shift+Delete and Shift+left-click on the Delete button delete the selected item without confirmation.

7.4.1  Pathname field

A pathname can be entered in the field by typing, by selecting a file using the Browse button adjacent to the field, or by dragging a file or directory from, for example, a file browser and dropping it onto the field or onto other parts of the main window.

7.4.2  Include and exclude fields

The fields contain a pathname filter: a set of patterns separated by spaces. Within the field, the backslash, "\", acts as an escape character to allow the inclusion of space characters in patterns. The escape convention in the filter fields is that a character following a "\" is treated as a literal character, and a single trailing "\" is ignored. Thus, you would use "\ " for a literal space and "\\" for a literal backslash. Because of this, it is recommended that you use a "/" to separate pathname components in patterns on the Windows platform.

The individual patterns of a pathname filter can be edited from the Edit Pattern dialog — the third-level editor that is invoked with the Edit command in the Edit Filter dialog that is invoked by the Edit command in the list editor that is invoked by the Edit command in the include or exclude field. (Got that?) Note that no escape character is used in the pattern field of the Edit Pattern dialog.

7.4.3  Target and replacement fields

The target and replacement fields are actually text areas that can contain multiple lines of text. The replacement field is enabled only if the Replace checkbox is selected.

Text in the target and replacement field can include tab characters (U+0009) and line-feed characters (U+000A), which are entered in the field by pressing Ctrl+Tab and Enter respectively. Line feeds are not displayed in a special way in the field, so, if your target or replacement isn't behaving as you expected, it may be that you have an unwanted — and invisble — line feed at the end of the field.

The fields may use a tab surrogate to display tab characters. Some characters in the fields may be escaped in two different ways: tabs and line feeds can be escaped separately, and an Escape command can be applied to the field. The Escape command behaves differently in the target field and the replacement field.

7.4.3.1  Tab surrogate

Within the target and replacement fields, tabs are replaced with the character that is specified by the appearance.tabSurrogate configuration property. The default tab surrogate is the tab character (U+0009) itself; in this case, tabs are displayed as a number of spaces up to the next tab stop, and the tab width is specified by the tabWidth.targetAndReplacement configuration property. It is important to understand that the tab surrogate is not just a substitute glyph: it actually replaces each occurrence of the tab character in the field unless tabs are escaped. When the content of the field is used (eg, in a search), the tab surrogate is converted either to a tab character or to a tab sequence ("\t") as appropriate, so you should choose as tab surrogate a character that is unlikely to appear in any target or replacement text.

7.4.3.2  Escaping tabs and line feeds

Tabs and line feeds may be escaped (ie, converted to the escape sequences "\t" and "\n" respectively) in the target and replacement fields by selecting the Tabs Escaped or Line Feeds Escaped item in the field's pop-up menu. (In reality, it is the tab surrogate that is converted to "\t", but the existence of tab surrogates is ignored in this section so as not to complicate matters.) Deselecting the menu item reverses the procedure: each occurrence of "\t" or "\n" is converted to a tab character or line-feed character, even if the "\" is itself escaped with another backslash. You will need to be careful about toggling the escaping of tabs and line feeds if the text contains literal "\t" or "\n" sequences. Within the field, the escaping of tabs and line feeds can be toggled from the keyboard with Ctrl+T and Ctrl+N respectively. Indicators appear alongside a field in which tabs or line feeds are escaped.

When a regular-expression search is performed, any tab characters and line-feed characters in the target field are escaped automatically in the target pattern that is used in the search. Tabs and line feeds are also escaped in the list of target or replacement items displayed in the editor, in the Select Item submenu displayed in the field's pop-up menu, and when targets and replacements are saved to a search-parameter file.

7.4.3.3  The Escape command for the target field

The Escape button adjacent to the target field is enabled only when the Regular expression checkbox is selected. The Escape command for the target field prefixes a "\" to each metacharacter in the field. The set of metacharacters on which the command operates is specified by the general.escapedMetacharacters configuration property. The default value of the property is the set of characters that are used in metasymbols outside a character class delimited by square brackets:

  $ ( ) * + . ? [ \ ] ^ { | }

("]" and "}" are not metacharacters but are included in the set for symmetry.)

If tabs or line feeds are escaped in the target field and the backspace character is in the set of escaped metacharacters, the "\" prefix to the escaped tabs and line feeds will itself be escaped by the Escape command. Unless the text contains literal "\t" or "\n" sequences, it may be best to unescape tabs and line feeds before issuing the Escape command.

7.4.3.4  The Escape command for the replacement field

The Escape command for the replacement field, which can be issued with the button adjacent to the field, prefixes an escape character to each escape character in the field. (The escape character for a replacement is specified by the general.replacementEscapeCharacter configuration property.) If the escape character is "\", it may be best to unescape tabs and line feeds before issuing the Escape command unless the text contains literal "\t" or "\n" sequences.

7.5  Result area

The result area is the text area at the bottom of the display in which the contents of a file are displayed. The result area is not editable. The following attributes of the result area are configurable:

The viewable width of the text view in columns is also applied to the result area. The physical widths of the two text areas are also dependent on their fonts, although both areas will be displayed with the same width because they are expanded, if necessary, to fit the width of the main window.

The maximum number of columns in the result area is fixed at 1024.

The colours of the result area are also applied to the text view.

7.6  Window size

The main window is not directly resizeable but its size can be modified indirectly by means of some of the configuration properties. As was mentioned above, the size of some text components — including the text view and result area — is determined by the font that they use to display text, as well as any properties that explicitly control their dimensions in terms of columns and rows. The two text areas will be expanded to fit the width of the main window, so changes to the properties that determine their size may not always be apparent. Any changes to configuration properties that affect the size of the main window will not take effect until the next time that RegexSearch is run.

8  Commands

RegexSearch's main commands are accessible from its main menu. Some of the commands are also accessible from a pop-up menu that is activated in a system-dependent manner (eg, by pressing or releasing the right mouse button) while the mouse cursor is over one of the text areas or the background of the control panel.

8.1  File menu

8.1.1  Open Search Parameters

The Open Search Parameters brings up a file-selection dialog in which you can choose the file that you want to open. If the file is of the correct format, the search parameters are loaded from it and RegexSearch's display is updated. If the current search parameters were read from a file, either automatically at startup or explicitly, and the parameters have changed since the file was read, you will be asked whether you want to save the current parameters before the new parameters are loaded.

8.1.2  Save Search Parameters

The Save Search Parameters command brings up a file-selection dialog in which you can choose the file to which you want to save the current set of search parameters. A file that is saved in this way can be specified as the default search parameters that will be loaded when RegexSearch starts up.

8.1.3  Exit

This command terminates the application. If you have made changes to search parameters that were read from a file, you will be asked whether you want to save them.

8.2  Edit menu

8.2.1  Edit File

The Edit File command executes a specified system command in a separate process. The command line, which is specified with the editor.command configuration property, may include a placeholder for the pathname of the file that is currently displayed in the text view. The intended purpose of the command line is to open the currently displayed file in a text editor, though it can be used for another purpose.

When using the Edit File command during a find-and-replace search, remember that the file in the text editor will not be synchronised with the file in RegexSearch's buffer, which may subsequently be written back to storage with modifications if replacements have been made in the file, even if the replacements were made before the Edit File command was issued. (If the Edit File command is issued while the Search Options dialog is displayed, the Next File option in the dialog can be used to discard any changes to the current file.)

8.2.2  Edit File - Deferred

This command is available only during a find-and-replace search. It behaves similarly to the Edit File command except that the associated system command (specified with the editor.command configuration property) is not executed until the search of the current file is finished and, if any replacements have been made, the modified file has been written to storage.

8.3  Search menu

8.3.1  Search

When you issue a Search command, RegexSearch first validates the search parameters and displays an error message for the first parameter that is invalid. If the file-set type is List, the specified list file is read and parsed. In a search of multiple files, the files are searched in the order described in the Directory and List file-set types.

Within a file, the search proceeds from the start of the file to the end. If a match of the target expression is found, the search will resume at the first character after the last character in the matched text, or, if a replacement is made, at the first character after the replacement.

When the first match of the target expression is found, the file in which the match occurred is displayed in the text view, and the matched text is highlighted. A Search Options dialog box is displayed; the type of dialog depends on the search mode, find or find-and-replace. As the Search Options dialog is non-modal, the text in the text view can be scrolled while the dialog is displayed.

The options in the Search Options dialog can be selected either by clicking on the appropriate button or by pressing a key or key combination. In addition to the usual Java Alt+<key> combination, each option (apart from Cancel, whose keyboard equivalent is the Escape key) can be selected by pressing the key by itself (ie, without the Alt key).

At the end of a search, the aggregate results are displayed in the result area. The results include a list of any files or directories that were not processed because of an error and a list of files or directories whose pathname could not be converted to canonical form. If the file-writing mode is Use a temporary file, preserve attributes, the results of a find-and-replace search include a separate list of files that were written but whose attributes were not set.

8.3.1.1  The Search Options dialog in find mode

In find mode, the Search Options dialog has four options:

Yes
The search is resumed in the current file. If no more matches are found in the current file, the search proceeds to the next file. If another occurrence is found, the matched text is highlighted and the Search Options dialog is displayed again.
Global
The search is resumed in the current file, and proceeds through all the files in the file set without displaying the Search Options dialog again. As each file is searched, if any matches are found in the file, the number of matches for that file are displayed in the result area. To make the search faster, files are not displayed in the text view during a global search.
A global search in find mode can be used to generate a list of files in which a match of the target expression was found.
Next File
The current file is skipped, and the search is resumed with the next file. If another occurrence is found, the matched text is highlighted and the Search Options dialog is displayed again.
Cancel
The search is terminated.
8.3.1.2  The Search Options dialog in find-and-replace mode

In find-and-replace mode, the Search Options dialog has seven options:

Yes
The occurrence of the matched text is replaced, and the search is resumed in the current file. If no more matches are found in the current file, the search proceeds to the next file. If another occurrence is found, the matched text is highlighted and the Search Options dialog is displayed again.
No
The occurrence of the matched text is not replaced, and the search is resumed in the current file. If no more matches are found in the current file, the search proceeds to the next file. If another occurrence is found, the matched text is highlighted and the Search Options dialog is displayed again.
Preview
The occurrence of the matched text is replaced. The replacement is displayed in the text view, highlighted. Another dialog box is displayed in which you are asked whether you want to keep the replacement, to restore the original text or to cancel the search. In that dialog, the Keep, Restore and Cancel options are equivalent respectively to the Yes, No and Cancel options in the Search Options dialog.
This File
The occurrence of the matched text is replaced, along with all remaining occurrences in the current file. The current file is saved, and the search proceeds to the next file. If another occurrence is found in a subsequent file, the matched text is highlighted and the Search Options dialog is displayed again.
Global
The occurrence of the matched text is replaced, along with all remaining occurrences in the current file and any subsequent files. The search proceeds through all the files in the file set without displaying the Search Options dialog again. To make the search faster, files are not displayed in the text view during a global search.
Next File
The occurrence of the matched text is not replaced. Any changes to the current file are saved and the search is resumed with the next file. If another occurrence is found, the matched text is highlighted and the Search Options dialog is displayed again.
Cancel
The search is terminated. Any changes to the current file are discarded.
8.3.1.3  How files are processed

Some aspects of RegexSearch's behaviour when processing files are worth noting in order that you may avoid the unwanted consequences of that behaviour. RegexSearch assumes that the files it reads during a search are text files that have a specified character set and encoding. It also assumes that certain characters or character sequences in the files are line separators. The implications of these two assumptions are discussed below.

When a file is read during a search, the bytes of the file are converted to 16-bit Unicode according to the configuration property general.charset. A charset is a combination of a character set and a character encoding, such as UTF-8, that maps between sequences of bytes and 16-bit Unicode values.

Within the file, all occurrences of the characters LF (U+000A) and CR (U+000D), and the character sequence (CR, LF) are treated as line separators. The type of line separator is recorded for possible later use. If the file contains more than one type of line separator, the most numerous type of line separator prevails. If the numbers of different types are equal, the precedence from highest to lowest is: LF – CR – CR+LF.

In find mode, the processing of a file ends at this point: the processing is internal, and no physical changes are made to the stored file. In find-and-replace mode, a file may be modified as a result of a replacement, and the file written back to storage. If the general.preserveLineSeparator configuration property has the value yes, the file is written with the type of line separator that was detected when it was read; otherwise, it is written with an LF line separator.

The way in which a modified file is written to storage is determined by the general.fileWritingMode configuration property. A file may be written directly, or it may be written first to a temporary file that is renamed after the entire file has been written. If a temporary file is used, the owner, group and permissions of the file may be set to those of the original file on systems that support it. (Linux is the only system that is known to do so.) See the description of the general.fileWritingMode property for more details on its use.

8.3.2  Copy Results

The Copy Results command copies the contents of the result area to the system clipboard. The general.copyResultsAsListFile configuration property controls the format of the text that is placed on the clipboard: the results can be either in the form in which they appear in the result area or in a form that is suitable for use as a list file in a new search, with match/replacement counts converted to comments.

8.3.3  Save Results

The Save Results command saves the list of files from the results of the last search (ie, the files in which an occurrence of the target was found). A list of files that is saved with this command can be used as the file set for a further search if you select Results as the file-set type.

8.3.4  View Saved Results

The View Saved Results command displays the last list of files to be saved with the Save Results command, which allows you to see the files that will comprise the file set if Results is selected as the file-set type.

8.4  Options menu

8.4.1  Preferences

The Preferences command brings up a tabbed dialog box in which the configuration properties of RegexSearch can be edited. The properties on the various tabbed pages are described below.

Some of the configuration properties in the Preferences dialog are edited with a spinner — a graphical component that consists of a text field adjacent to a pair of small buttons. The value in the text field may be edited manually, or it may be incremented and decremented by one of the following methods:

Using the last two methods, the amount by which the value is incremented or decremented can be modified by holding down the Ctrl, Shift or Ctrl+Shift keys, which correspond to increments of 10, 100 and 1000 respectively.

General
Character set and encoding
This property specifies the charset — a combined character set and character encoding — that is used to map between the bytes of a file and Unicode values when reading and writing files. Different implementations of Java may support different charsets, though every implementation must support a few common charsets, including ISO 8859-1 (Latin-1) and UTF-8. The combo box lists the charsets that are available in the current implementation. At the top of the list is the value <default charset>, which denotes the platform- and locale-dependent default charset.
The default value is determined at runtime by the Java virtual machine, depending on the locale and platform.
Escaped metacharacters
This is the set of characters that will be escaped (ie, characters that will have "\" prefixed to them) when the Escape command is applied to a regular-expression target.
The default value is $()*+.?[\]^{|}
Replacement escape character
This is the character that is used as the escape character in replacement expressions. Your choice of character is limited to the punctuation characters that are displayed in the combo box.
The default value is \ (backslash, U+005C).
Ignore case of filenames
If you select Yes, alphabetic case will be ignored when matching pathnames against the patterns in an inclusion or exclusion filter and when matching filenames against the patterns in a tab-width filter (eg, the filename pattern "*.txt" will match the filenames foo.txt and BAR.TXT).
The default value is No.
File-writing mode
This property determines how files that are modified during a find-and-replace search are written back to storage.
Direct
The file is written directly to an existing file (ie, the existing file is overwritten). Using this method, the file attributes are preserved but there is a risk that the existing file may be corrupted if there is a system failure while the file is being written.
Use a temporary file
The file is first written to a temporary file. When the temporary file has been written and closed, the existing file is deleted and the temporary file renamed. This is safer than the direct-writing mode but it does not preserve the file attributes on Linux/UNIX. (Files were always written in this way by RegexSearch prior to the introduction of this configuration property in version 2.2.)
Use a temporary file, preserve attributes
With this option, the file is first written to a temporary file, as with the previous option. After the temporary file has been renamed, the Linux/UNIX chmod, chgrp and chown commands are issued with the --reference option, which should set the file's permissions, group and owner to those of the original file. Linux is known to support the --reference option for these three commands; other UNIX-like systems may support it. Because it involves the additional execution of three system commands, this file-writing mode is slower than the other two.
The default value is Use a temporary file.
Preserve line separator type
If you select Yes, a file in which replacements are made during a find-and-replace search will be written with the same type of line separator — LF (U+000A), CR (U+000D) or CR+LF — that it had when it was read. (Files that have more than one type of line separator will be written with the type of line separator that is most numerous.) If this property has the value No, files modified by RegexSearch will be written with LF (UNIX-type) line separators.
The default value is Yes.
Display UNIX-style pathnames
If you select Yes, pathnames are displayed in a reduced "UNIX style" in some parts of the GUI. A pathname is converted from its platform-specific form in two steps:
  1. If the pathname starts with the user's home directory, the latter is replaced by "~".
  2. The file-separator character ("\" on Windows systems) is replaced by "/".
The default value is No.
Copy search results as list file
This property controls the format of the text that the Copy Results command places on the system clipboard. If you select No, the results are in the form in which they appear in the result area of the main window. If you select Yes, the results are converted into a form that is suitable for use as a list file in a new search.
The default value is No.
Appearance
Look-and-feel
The look-and-feel (LAF) can be selected from a list of the LAFs that are available on the current system.
The default value is the cross-platform LAF, currently called Metal.
Orientation by locale
The user interface has been prepared programatically for locales that are associated with languages that have a horizontal, right-to-left orientation (eg, Arabic, Hebrew). This is intended to facilitate the translation of RegexSearch into right-to-left languages. However, as the text displayed by RegexSearch is currently hard-coded in English, using the program in right-to-left locales would merely have the undesirable effect of reversing the layout of components in the GUI. This property controls whether the orientation of components is dependent on locale. If you select No, RegexSearch ignores the locale when laying out components (ie, the GUI is laid out left to right).
The default value is No.
Text antialiasing
This determines the kind of antialiasing that is performed when text is drawn in custom or partially customised user-interface components (eg, in non-editable combo boxes). Note that antialiasing is only a hint in Java; the implementation is not required to perform the chosen antialiasing.
This property has no effect on the antialiasing of text in standard UI components, such as labels and menus, which is determined by the Java implementation and the desktop setting for antialiasing text (often referred to as "font smoothing"). You can override the desktop setting with the unsupported system property awt.useSystemAAFontSettings.
This property does not control text antialiasing in the text view, which is configured independently with the property appearance.textViewTextAntialiasing.
The text antialiasing property can have the following values:
Default
The desktop setting for text antialiasing (font smoothing) is used, if the Java implementation recognises one; otherwise, no antialiasing is performed.
None
No antialiasing is performed.
Standard
This selects pixel-oriented antialiasing rather than subpixel antialiasing. It is suitable for non-LCD displays.
Subpixel, horizontal RGB
Subpixel, horizontal BGR
Subpixel, vertical RGB
Subpixel, vertical BGR
These four options are intended to optimise the rendering of text for LCD displays using subpixel antialiasing with subpixels in the chosen arrangement. Selecting an option that does not correspond to the actual arrangement of subpixels in your LCD display may result in blurred text. The most common arrangement of subpixels is horizontal RGB.
The default value is Default.
Number of columns in parameter fields
This determines the viewable width of the five parameter fields in the control panel. The width of a column is the width of a zero character (U+0030) in the field's font.
The default value is 80.
Number of rows in result area
This determines the viewable height of the result area.
The default value is 4.
Tab surrogate
This is the character that is used in place of a tab character in the target and replacement fields. Its role is described in the section on the tab surrogate. You can enter either a single character or four hexadecimal-digit characters in the tab surrogate field. The hexadecimal digits will be interpreted as a Unicode value. If a character is a control code or it cannot be displayed in the field's font, it is displayed in the field as its four-digit Unicode value.
The default value is the tab character, U+0009.
Text view: viewable size
These are the dimensions (number of columns × number of rows) of the area in which the contents of a file are displayed. The physical size of the text view is also determined by its font.
The default dimensions are 96 × 24.
Text view: maximum number of columns
This is the upper limit of the width of the text view; lines of text displayed in the text view are truncated at this limit. The limit applies only to displayed text: the actual text is not truncated. (The use of this property makes the display of text more efficient.)
The default value is 256.
Text view: text antialiasing
This determines the kind of antialiasing that is performed when text is drawn in the text view. It is independent from the general text antialiasing property in order to allow, for example, a bitmap font to be used in the text view. The values that this property may have are described for the appearance.textAntialiasing property.
The default value is Default.
Text area colours
These are the four colours that are used when drawing text in the text view, the result area and the fields in the Search Options dialog. Clicking on a colour button brings up a colour-selection dialog.
Tab width
Text view: tab-width filters, default tab width
When a file containing tab characters (U+0009) is displayed, RegexSearch uses two properties — a list of tab-width filters and a default tab width — to determine how the tab characters are converted to spaces. A tab-width filter is a filename filter that is mapped to a tab width. The filename filter consists of one or more space-separated filename patterns (eg, "*.c *.cpp *.h *.hpp"). Up to 64 tab-width filters can be specified. Filters are applied in the order in which they appear in the list. The default tab width is used for a file whose name does not match any filter.
New filters can be added to the list, and items in the list can be edited, deleted or their position in the list changed. The Delete key and Delete button delete the selected item after confirmation, while Shift+Delete and Shift+left-click on the Delete button delete the selected item without confirmation. The position of an item in the list can be changed by dragging it with the mouse, or by pressing Ctrl+Up or Ctrl+Down.
The default value of the default tab width is 8.
Target and replacement editors: tab width
This is the tab width that is used in the target and replacement editors if the tab surrogate is the tab character (U+0009) itself.
The default value of the tab width in the target and replacement editors is 8.
Editor
Command
This property can be used to specify a command line that will invoke a text editor in response to the Edit File and Edit File - Deferred commands. The pathname of the file that is currently displayed in the text view can be included in the command line so that the file will be opened in the text editor.
Within the command line, arguments must be separated with one or more spaces, and "%" (U+0025) acts as an escape character. "%f" is a placeholder for the pathname of the file that is to be edited. All other characters that follow "%" are treated as themselves; thus, a literal space is represented by "% " (ie, U+0025, U+0020), and a literal "%" is represented by "%%".
Pathnames in the command line may contain special constructs for system properties, environment variables and the user's home directory.
File locations
Default search parameters
This is the pathname of the search-parameter file that will be loaded automatically when RegexSearch starts up. The pathname may contain special constructs for system properties, environment variables and the user's home directory.
Fonts
These are the fonts that are used in RegexSearch's display. The parameter text area font applies to the target and replacement fields. Remember that font names may be platform-dependent, so that a configuration that specifies font names may not work across platforms.
The default values of all the font properties are those of the default fonts for the platform and look-and-feel. A default font size is specified by leaving the size field empty (the minimum position on the spinner). A default font is used if no font name is specified in RegexSearch's configuration or if the named font is not available.

Some of the configuration properties will take effect when the Preferences dialog is accepted (by closing it with OK); other properties (eg, the look-and-feel and fonts) will not take effect until the next time that RegexSearch is run.

The configuration file is normally saved automatically when RegexSearch exits, if the configuration has changed. The Save Configuration command in the Preferences dialog can be used to save a configuration file explicitly.

9  Regular expressions

Within RegexSearch, the parsing and matching of regular expressions is performed by the Java regex engine. The purpose of this section is to present a summary of the syntax of Java's regular expressions, which is similar to that of Perl and Python. This section is not intended to be a tutorial on the use of regular expressions; see the references at the end of this section for suggested sources of further information.

Note: There are several differences between the syntax of regular expressions in Java and the syntax of regular expressions in Linux/UNIX tools such as sed and (g)awk.

In a search, the target pattern, replacement pattern and file are all composed of Unicode characters. RegexSearch converts files from bytes to 16-bit Unicode characters according to the scheme described in How files are processed. In particular, the line separators CR and CR+LF are converted to LF before a file is searched. Thereafter, by default, the only line separator recognised during a search is the line feed character (U+000A) unless the (?-d) flag appears in the target pattern.

When selected, the Ignore case checkbox in the main window enables the default form of case-insensitive matching, which applies only to characters in the US-ASCII charset. To apply case-insensitive matching to all Unicode characters, use the (?u) flag in the target pattern.

Within a regular expression, all characters are treated as literal characters except for twelve metacharacters — characters that have a special meaning and don't behave normally in regular expressions. The metacharacters are:

  $ ( ) * + . ? [ \ ^ { |

A metacharacter can be escaped — that is, its special meaning can be removed — by prefixing a backslash, "\", to it. An escaped metacharacter represents its corresponding literal character; thus, "\?" represents the character "?", and "\\" represents a literal backslash.

Some metacharacters are used by theselves within regular expressions; others are used to create special sequences called metasymbols. (In the documentation for java.util.regex.Pattern, metasymbols are referred to as constructs.) For example, several alphanumeric characters become metasymbols when preceded by a backslash.

9.1  Simple metacharacters and structural metasymbols

. By default, a dot matches any single character except a newline. The (?s) flag enables a mode in which a dot matches any character including a newline.
^ Matches the beginning of a line.
Example: ^# matches a "#" character at the beginning of a line.
$ Matches the end of a line or the end of the input string (in RegexSearch, the end of a file).
Example: ;$ matches a ";" character at the end of a line or at the end of a file.
\ The backslash has two roles:
  1. When it precedes a metacharacter (including itself), it escapes the metacharacter (ie, removes the special meaning of the metacharacter).
    Example: \* matches a "*" character.
  2. When it precedes some alphanumeric characters, it introduces a metasymbol. (Placing a backslash in front of an alphabetic character for which no metasymbol is defined will result in an error.)
    Example: \t matches a tab character (U+0009).
| The vertical bar separates alternatives.
Example: his|her|its matches any one of the strings "his", "her" or "its".
[ ] Matches one character from a character class — a set of characters enclosed within the square brackets. The set of characters can be specified in a number of ways. It may be:
  • An enumeration of characters.
    Example: [abc].
  • One or more ranges of characters, in which a hyphen, "-", separates the inclusive start and end of a range of contiguous characters.
    Example: [a-z], or [A-Za-z].
  • A union.
    Example: [0-9[A-F]], which is equivalent to [0-9A-F].
  • An intersection, in which the string "&&" separates sets of characters.
    Example: [a-e&&d-h], which is equivalent to [de]).
If the first character within the square brackets is a circumflex, "^", the set of characters is negated; that is, the character class matches one character that is not in the set of characters that follows the "^".
Example: [^0-9] matches any character except a (Western) decimal digit; [a-z&&[^ij]] is equivalent to [a-hk-z].
( ) Encloses a capturing group. The set of characters within the parentheses is treated as a unit; eg, ^(foo|bar) matches either "foo" or "bar" at the beginning of a line. The group is called capturing because the text that it matched can be included later in the target pattern or in the replacement by specifying the index of the group in a metasymbol (see \n in Alphanumeric metasymbols).
A cluster — a non-capturing group — can be specified by enclosing a set of characters between "(?:" and ")" (eg, (?:foo|bar) matches either "foo" or "bar" without capturing it).

9.2  Quantifiers

Quantifiers specify how many times the preceding character or group should match. The different types of quantifier are available in three flavours, which Java refers to as greedy, reluctant and possessive. (Greedy quantifiers are also known as maximal, and reluctant quantifiers are also known as lazy or minimal.)

A greedy (maximal) quantifier starts by matching as much as possible of the input string. If this doesn't allow the whole pattern to be matched, the greedy quantifier matches progressively less of the input string until either the whole pattern is matched or the match fails.

A reluctant (minimal) quantifier starts by matching as little as possible of the input string. If this doesn't allow the whole pattern to be matched, the reluctant quantifier matches progressively more of the input string until either the whole pattern is matched or the match fails.

A possessive quantifier starts, like a greedy quantifier, by matching as much as possible of the input string. However, if this doesn't allow the whole pattern to be matched, no backing-up is performed, and the match fails.

Quantifiers Meaning
Greedy Reluctant Possessive
* *? *+ Matches zero or more times
+ +? ++ Matches one or more times
? ?? ?+ Matches once or not at all
{n} {n}? {n}+ Matches exactly n times
{n,} {n,}? {n,}+ Matches at least n times
{n,m} {n,m}? {n,m}+ Matches at least n times but not more than m times

9.3  Alphanumeric metasymbols

\0n The character with octal value 0n, where n is in [0-7]
\0nn The character with octal value 0nn, where n is in [0-7]
\0mnn The character with octal value 0mnn, where m is in [0-3] and n is in [0-7]
\n The sequence matched by the nth capturing group
\a The alert character (BEL), U+0007
\A The beginning of the input string (in RegexSearch, the beginning of a file)
\b A word boundary
\B Not a word boundary
\cX The control character, Control-X
\d A digit, [0-9]
\D A non-digit, [^0-9]
\e The escape character (ESC), U+001B
\E End the quotation of metacharacters started by \Q
\f The form feed character (FF), U+000C
\n The line feed character (LF), U+000A
\p{prop} Any character in the character class named prop
\P{prop} Any character not in the character class named prop
\Q Quote (escape) metacharacters until \E
\r The carriage return character (CR), U+000D
\s A whitespace character, [ \t\n\x0B\f\r]
\S A non-whitespace character, [^\s]
\t The tab character (HT), U+0009
\unnnn The Unicode character U+nnnn, where n is a hexdecimal digit character, [0-9A-Fa-f]
\w A word character, [0-9A-Za-z_]
\W A non-word character, [^\w]
\xnn The character with hexdecimal value 0xnn
\z The end of the input string (in RegexSearch, the end of a file)
\Z The end of the input string (in RegexSearch, the end of a file), apart from a final '\n'

9.4  Named character classes

Named character classes are metasymbols of the form \p{name} or \P{name}. There are three types of named character class: POSIX, Unicode and Java.

9.4.1  POSIX character classes

Lower A lowercase alphabetic character, [a-z]
Upper An uppercase alphabetic character, [A-Z]
ASCII An ASCII character, [\x00-\x7F]
Alpha An alphabetic character, [\p{Lower}\p{Upper}]
Digit A decimal digit character, [0-9]
Alnum An alphanumeric character, [\p{Alpha}\p{Digit}]
Punct Punctuation: one of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Graph A visible character, [\p{Alnum}\p{Punct}]
Print A printable character, [\p{Graph}\x20]
Blank A space or a tab character, [ \t]
Cntrl A control character, [\x00-\x1F\x7F]
XDigit A hexadecimal digit character [0-9a-fA-F]
Space A whitespace character, [ \t\n\x0B\f\r]

9.4.2  Unicode character classes

The Unicode character classes are too numerous to list all of them here. They include Unicode character blocks (eg, Greek) and character categories (eg, uppercase letters). When forming a metasymbol, In is prefixed to the name of a Unicode block (eg, \p{InGreek} ), and Is is optionally prefixed to the name of a Unicode category is (eg, \p{Lu} or \p{IsLu} ).

The following table lists abbreviations for values in the Unicode General Category:

L Letter
Lu Letter, uppercase
Ll Letter, lowercase
Lt Letter, titlecase
Lm Letter, modifier
Lo Letter, other
M Mark
Mn Mark, non-spacing
Mc Mark, spacing combining
Me Mark, enclosing
N Number
Nd Number, decimal digit
Nl Number, letter
No Number, other
P Punctuation
Pc Punctuation, connector
Pd Punctuation, dash
Ps Punctuation, open
Pe Punctuation, close
Pi Punctuation, initial quote (may behave like Ps or Pe depending on usage)
Pf Punctuation, final quote (may behave like Ps or Pe depending on usage)
Po Punctuation, other
S Symbol
Sm Symbol, mathematical
Sc Symbol, currency
Sk Symbol, modifier
So Symbol, other
Z Separator
Zs Separator, space
Zl Separator, line
Zp Separator, paragraph
Cc Other, control
Cf Other, format
Cs Other, surrogate
Co Other, private use
Cn Other, not assigned

9.4.3  Java character classes

The Java character classes will probably be of interest only to Java programmers. The name of the character class is formed by substituting "java" for "is" in the name of a method of the java.lang.Character class that begins with "is". For example, the character class javaLetterOrDigit is equivalent to java.lang.Character.isLetterOrDigit( ).

9.5  Extended sequences

Extended sequences are metasymbols of the form (?...). The modifiers, [dimsux], and their "off" versions (preceded by a minus sign) can be concatenated within an extended sequence; for example, (?iu-ms) switches on i and u and switches off m and s.

(?:…) Non-capturing group (cluster).
(?>…) Non-capturing group referred to in Perl as a nonbacktracking subpattern.
(?d)
(?-d)
Enable/disable UNIX lines mode.
If enabled, only the UNIX line separator ('\n', U+000A) is recognised by the metacharacters ., ^ and $; otherwise, the following characters and character sequences are recognised as line separators: '\n' (U+000A), '\r' (U+000D), "\r\n" (U+000D, U+000A), U+0085, U+2028, U+2029.
UNIX lines mode is enabled by default.
(?i)
(?-i)
Enable/disable case-insensitive matching.
Case sensitivity is initially specified by the ignore case search parameter, but it can be changed within the target pattern by means of this flag. By default, case-insensitive matching applies only to characters in the US-ASCII charset, but this can be extended to all Unicode characters with the (?u) flag.
(?m)
(?-m)
Enable/disable multiline mode.
In multiline mode, the metacharacters ^ and $ match at the beginning and end, respectively, of a line; otherwise, they match only at the beginning and end of the input string (ie, the file).
Multiline mode is enabled by default.
(?s)
(?-s)
Enable/disable dotall mode.
In dotall mode (known in Perl as single-line mode), the . (dot) metacharacter matches any one character including a line separator; otherwise, . matches any one character except for a line separator.
(?u)
(?-u)
By default, the case-insensitive matching that is control by the ignore case search parameters and the (?i) flag applies only to characters in the US-ASCII charset. Using the (?u) flag, case-insensitive matching can be extended to all Unicode characters.
(?x)
(?-x)
Enable/disable comments mode.
In comments mode, whitespace and comments in the target pattern are ignored. A comment starts with a "#" character and ends at the end of the pattern.
(?=pattern) Positive lookahead: a zero-width assertion that is true if pattern immediately follows the assertion.
(?!pattern) Negative lookahead: a zero-width assertion that is true if pattern does not immediately follow the assertion.
(?<=pattern) Positive lookbehind: a zero-width assertion that is true if pattern immediately precedes the assertion.
(?<!pattern) Negative lookbehind: a zero-width assertion that is true if pattern does not immediately precede the assertion.

9.6  References

The following sources were used in writing this section: