RegexSearch 3.4 : Manual

Table of contents

You must have JavaScript enabled in your browser to generate the table of contents.

1  Introduction

RegexSearch is a Java application that performs find and find-and-replace searches for regular expressions on multiple text files. It is distributed under version 3 of the GNU General Public License; for details, see the file license.txt that is included in the RegexSearch distribution.

RegexSearch has the following features:

The website of the RegexSearch project is at http://regexsearch.sourceforge.net/ .

2  Requirements

RegexSearch is a Java application that requires a Java runtime environment that supports Java SE 7 (Java 1.7), such as Oracle's Java Runtime Environment (JRE), version 7 or later.

3  Contents of the distribution

The following files are included in the distribution:

regexSearch.jar The executable JAR (Java archive) file of the RegexSearch application.
regexSearch-config.xml The configuration file for RegexSearch, which contains the default values for the configuration properties.
license.txt A copy of the licence under which RegexSearch is distributed (GNU General Public License, version 3).
dtd/searchParams.dtd The DTD (document type definition) of a RegexSearch search-parameter document. (RegexSearch does not use the DTD; it is provided only for reference.)
images/regexSearch.png
images/regexSearch.ico
A 48×48-pixel PNG image and a 48×48-pixel Windows-format icon that can be used to customise a desktop icon for the RegexSearch application.
manual/manual.html
manual/images/*.png
manual/scripts/*.js
manual/style/*.css
This manual, its image files, scripts and stylesheets. Any modifications to the manual for the latest version of RegexSearch will appear in the online version of the document, to which there is a link on the RegexSearch website.

4  Installing and running RegexSearch

RegexSearch consists of a single JAR (executable Java archive) file, regexSearch.jar and a configuration file, regexSearch-config.xml, which contains user preferences. The use of the configuration file is optional but recommended. The application can be installed in two ways: with the RegexSearchInstaller program or by copying files from the .zip or .tar.gz archive of the RegexSearch executable distribution.

4.1  Executing a JAR file

Both the RegexSearch application and the installer are executable JAR (Java archive) files that require a Java runtime environment, which includes a program named java for running JAR files — a Java launcher. When you install a Java runtime environment, it may create an association on your system between JAR files and its Java launcher. (Oracle's Java runtime environment on Windows associates JAR files with an additional Java launcher named javaw that runs without a console window.) If so, or if you have created the association yourself, you will be able to run a JAR file directly (eg, by double-clicking on an icon of the JAR file in a file manager). If not, you can run a JAR file by invoking the java launcher tool from a command line and supplying the location of the JAR file as an argument. There are examples below of command lines for running the JAR file of the RegexSearch application under Linux/UNIX and Windows.

4.2  Installing RegexSearch

4.2.1  Installation with the installer program

The installer is an executable JAR (Java archive) file that requires the same Java runtime environment as the RegexSearch application itself. It can be run directly or indirectly in the ways described above.

In the opening display of the installer, you can choose the components that you want to install and the directories in which they will be installed. It is recommended that you install the configuration file in its default directory; the default directories of the other components should also be suitable for most users. If you install the executable file and the configuration file, a file named regexSearch-properties.xml will be generated and written to the same directory as the executable file to inform the RegexSearch application of the location of the configuration file. This file is required only if the configuration file was not installed in the default directory.

Any existing file that has the same name as an installed file will be overwritten without warning except for a configuration file, whose properties will be preserved if they conflict with the properties of the new file.

The final display of the installer has a Show files command that displays a list of files that were installed. If the installation was successful, you might want to keep a list of the files so that you will know where to find them when you uninstall RegexSearch, which does not have an automated means of uninstallation. If the installation failed, you might want to remove any files that were installed.

4.2.2  Direct installation

The direct installation of RegexSearch consists simply of copying the JAR file and, optionally, the default configuration file to suitable locations on your system. If the configuration file is not installed in its default directory, you will need to inform the RegexSearch application of the location of the configuration file, which can be done either on the command line or in a properties file.

4.3  Running RegexSearch

The RegexSearch application is an executable JAR (Java archive) file that requires a Java runtime environment. The JAR file can be run directly or indirectly in the ways described above.

If you run the RegexSearch application from a command line, the command line may contain configuration properties, including the location of a configuration file. The following subsections describe how to run RegexSearch from a command line.

4.3.1  Running under Linux/UNIX

Assuming that your PATH environment variable includes the path to the java tool and that you have copied regexSearch.jar to the directory /home/slothrop/bin/regexsearch/, the command

java -jar /home/slothrop/bin/regexsearch/regexSearch.jar

will run the RegexSearch application.

The file regexSearch.png can be used as the icon for the RegexSearch application.

4.3.2  Running under Windows

The RegexSearch application does not require a console window, so you can use the javaw launcher rather than the java launcher unless you particularly want a console window. Assuming that your PATH environment variable includes the path to the javaw tool and that you have copied regexSearch.jar to the directory C:\Program Files\RegexSearch\, the command

javaw -jar "C:\Program Files\RegexSearch\regexSearch.jar"

will run the RegexSearch application.

The file regexSearch.ico can be used as the icon for the RegexSearch application.

4.4  Uninstalling RegexSearch

RegexSearch does not have an automated means of uninstallation. To remove it from your system, delete the file regexSearch.jar from the location to which it was written when you installed RegexSearch. If you want to remove RegexSearch completely, you should also delete the configuration file, regexSearch-config.xml, which may be at its default location, and any other files that were installed (eg, the manual).

5  Configuration

When it starts up, RegexSearch is configured with configuration properties that are read from two sources: the command line that is used to run the Java launcher and a configuration file whose location may be explicitly specified. If the same property is specified on the command line and in a configuration file, the value from the configuration file takes precedence.

The recommended method of setting the properties in a configuration file is with the Options > Preferences command. For command-line properties, which must be edited manually, the form of the property values is given in an appendix, and it can also be inferred by generating a configuration file with the desired values and inspecting the contents of the file.

5.1  Command-line properties

When RegexSearch is run by means of the java launcher, configuration properties may be specified on the command line using the standard Java form -Dname="value"; eg, -Dapp.appearance.textViewViewableSize="96, 32". (The quotation marks around the value aren't necessary if the value doesn't contain spaces.) RegexSearch's command-line configuration properties all have the prefix app. . A list of all the properties that are recognised by RegexSearch is given in an appendix.

One particular property, app.configDir, is used to specify the directory that contains a configuration file, as described below. The value of the app.configDir property may contain special constructions for system properties, environment variables and the user's home directory.

5.2  Configuration file

The configuration file is named regexSearch-config.xml. RegexSearch doesn't require a configuration file: it uses a default value for any configuration property that is missing from the source(s) of configuration. Similarly, if it finds a property value to be invalid, RegexSearch will display a message to this effect and use the default value of the property. If the configuration file contains a property that was specified on the command line, the value from the configuration file is used.

If the configuration has changed when you exit the application normally (ie, using the File > Exit command or an equivalent), RegexSearch will save its configuration to a configuration file. If a configuration file was read on startup, it will overwrite that file; otherwise, it will write a configuration file to the default directory described above, unless the value of the app.configDir property was an empty string.

A configuration file can be written explicitly with the Save configuration command within the Preferences dialog.

5.2.1  Location of the configuration file

When it starts up, RegexSearch is informed of the location of the configuration file with the app.configDir property, which may be set in two ways:

If the app.configDir property is set both in the properties file and on the command line, the value in the properties file takes precedence.

The regexSearch-properties.xml file is normally written by the installer. If you create the file manually, it should have the following form, with the example pathname replaced by the actual pathname:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<entry key="app.configDir">/home/slothrop/.blankaspect/regexSearch</entry>
</properties>

If the configuration file were located in a directory named config in the user's home directory, the sample command lines given above would become:

Linux/UNIX: java -Dapp.configDir="~/config" -jar /home/slothrop/bin/regexsearch/regexSearch.jar
Windows: javaw -Dapp.configDir="~/config" -jar "C:\Program Files\RegexSearch\regexSearch.jar"

The existence and value of the app.configDir property determines the locations that are searched for a configuration file:

6  Search parameters

The parameters of a search consist of:

The search parameters are stored collectively as an XML file. The DTD of a search-parameter file (searchParams.dtd) is included in the RegexSearch distribution. It is provided only for reference because RegexSearch does not validate search-parameter files against the DTD.

A search-parameter file may contain multiple file sets, targets and replacements, and each file set may contain multiple pathnames and pathname filters, though only a single pathname, inclusion filter, exclusion filter, target and replacement are used in a search. For each of those five parameters, RegexSearch's user interface allows you to select from the list of available values, to edit or delete existing values and to add new ones to the list.

When RegexSearch is run, search parameters are read from the file denoted by the configuration property path.defaultSearchParameters. The current set of search parameters can be saved with the File > Save search parameters command, and files saved in this way can be opened with the File > Open search parameters command. When you open a new search-parameter file or exit RegexSearch, you will be prompted to save the current search parameters if a search-parameter file was read, either automatically at startup or explicitly, and the parameters have changed since the file was read. A change to the file-set index or a parameter index is regarded as a change to the file.

6.1  File set

A file set specifies the files that are searched in a find or find-and-replace operation. A file set has a file-set kind and, depending on its file-set kind, it may also have a pathname and two kinds of pathname filter: an inclusion filter and an exclusion filter.

6.1.1  Pathname

Depending on the file-set kind, the pathname of a file set may be direct or indirect. A direct pathname denotes either a single file or a base directory that, in conjunction with inclusion and exclusion filters, defines the scope of a search. An indirect pathname denotes a file that contains a list of files and directories to be searched. A pathname may be absolute or relative; a relative pathname is relative to the current working directory.

6.1.2  Pathname filters

The inclusion filters and exclusion filters of a file set are two kinds of pathname filter. A pathname filter is a set of patterns that determines, usually in conjunction with a base pathname, the files that are searched. A file is included in a search if it matches at least one pattern in the inclusion filter AND none of the patterns in the exclusion filter. The maximum number of patterns in a filter is 64.

A pattern is a pathname that may include wildcards. There are three wildcards: two filename wildcards and a pathname wildcard.

6.1.2.1  Filename wildcards

The filename wildcards, '?' and '*', have their usual meaning: '?' matches a single character and '*' matches zero or more characters in a pathname component (a filename or directory name). For example, the pattern "foo*.txt" will match the filenames foo.txt, food.txt and football.txt. Following the UNIX convention (but differing from the Windows convention), a dot, '.', has no special significance in patterns: it is matched by the '*' wildcard. Thus, the pattern "foo*" will match the filenames foo, football.txt and food.store.log.

6.1.2.2  Pathname wildcards

The pathname wildcard, '**', matches zero or more pathname components. Its use in a pathname pattern is analogous to the use of '*' in a filename pattern. By itself, the pattern '**' is the recursive analogue of '*': it matches all files in or below the base directory. Used as a pathname component in a larger pattern, '**' denotes a recursive portion of the pathname that may be bounded above or below by a non-recursive pathname. For example,

A pathname-filter pattern may be either relative to a base directory (in a directory or list file set), or it may be absolute. The pathname components in a pattern are separated with a '/' character (U+002F). (A '\' may be used as the directory separator on the Windows platform, but '/' is recommended because '\' is used as the escape character in filter fields.) A pattern that ends with a directory separator is assumed to be followed by an implicit '**'. A pattern that, when appended to its base directory, denotes an existing directory is assumed to be followed by an implicit '/**'. A pattern may contain dot and double-dot components ('.' and '..'), but only if they appear before the first wildcard in the pattern.

A file is matched against a pathname-filter pattern by converting both the pattern (appended to a base directory, if the pattern is relative) and the pathname of the target file to a canonical form. An error may occur in converting the pattern to canonical form if, for example, the resulting pathname is illegal or access to part of the file system is not permitted.

6.1.3  File-set kinds

A file-set kind may be one of File, Directory, List, Results or Clipboard. The five kinds of file set are described below.

6.1.3.1  File

The search is performed on the file denoted by the pathname of the file set. No inclusion filter or exclusion filter is applied.

6.1.3.2  Directory

The pathname of the file set denotes the base directory of the search. (A search is not necessarily confined to this directory and the directories below it because the inclusion filter may contain patterns that denote pathnames outside the base directory.) An inclusion filter and an exclusion filter may be specified. If no inclusion filter is specified, a filter consisting of the single pattern '**' (match all recursively) is assumed. Any relative patterns (see pathname filter) in the inclusion filter and exclusion filter are relative to the base directory.

The files in a directory are searched in order of filename. The ordering is lexicographic (ie, the Unicode values of characters in the filename are compared) and platform-dependent: it is case-sensitive on Linux/UNIX systems, but alphabetic case is ignored on Windows systems. Recursion is specified implicitly by pathname wildcards. A recursive search on a directory is depth-first: files in subdirectories are searched before the files in the directory. Like files, subdirectories are searched in order of name. If the inclusion filter contains any absolute patterns, the files or directories specified by those patterns are searched after the base directory.

6.1.3.3  List

The pathname of the file set is assumed to denote a text file, each of whose non-empty lines denotes the pathname of a file or directory that is to be searched. A line of the list file may contain a comment, beginning with a ';' character. If a line contains a comment, any characters after the last non-space character before the comment are ignored (eg, the line "simple-filename.txt ; file #23" is parsed as "simple-filename.txt"). Empty lines are ignored. The pathnames are validated before the search starts, and the search will not proceed unless each pathname denotes an existing directory or a regular file.

An inclusion filter and an exclusion filter may be specified. If no inclusion filter is specified, a filter consisting of the single pattern '**' (match all recursively) is assumed.

The search of the pathnames in the list is equivalent to a sequence of searches on file sets whose file-set kind is either File or Directory according to whether a pathname denotes a file or directory. If the pathname denotes a directory, the inclusion filter and any exclusion filter are applied to it. If the inclusion filter contains any absolute patterns, the files or directories specified by those patterns are searched after all the pathnames in the list.

6.1.3.4  Results

The files that will be searched are those from the last list of files to be saved with the Search > Save results command. The current list of saved results can be viewed with the Search > View saved results command. No inclusion filter or exclusion filter is applied.

6.1.3.5  Clipboard

This denotes a "pseudo-file": the contents of the system clipboard. If the clipboard contains text, the text will be searched in the same way as if it were read from a file. In a find-and-replace search, if any changes are made to the text, the modified text is put back on the clipboard at the end of the search. No inclusion filter or exclusion filter is applied.

6.2  Target

The target of the search — the pattern that you are attempting to match in the files that are searched — can be either literal text or a regular expression. The corresponding kinds of search are referred to below as literal-text search and regular-expression search.

6.3  Replacement

The replacement is an expression that will be used to replace occurrences of the target pattern in a find-and-replace search. The interpretation of the replacement differs according to whether the target is literal text or a regular expression. Both types of replacement may contain metasymbols — special sequences that are introduced with an escape character. By default, the escape character is a backslash, '\', but it can be changed with the general.replacementEscapeCharacter configuration property if, for example, you want to avoid having to escape the backslashes in Windows pathnames. In a replacement, an escape character must always be escaped by prefixing another escape character to it (eg, '\\', if the escape character is '\').

6.3.1  Literal-text replacement

The following metasymbols may appear in a literal-text replacement string. It is assumed that the escape character is '\'.

\t Tab character, U+0009
\n Line-feed character, U+000A
\unnnn Unicode character U+nnnn, where n is a hexadecimal digit character, [0-9A-Fa-f]
\\ Literal escape character

6.3.2  Regular-expression replacement

The following metasymbols may appear in a regular-expression replacement string. It is assumed that the escape character is '\'.

\t Tab character, U+0009
\n Line-feed character, U+000A
\unnnn Unicode character U+nnnn, where n is a hexadecimal digit character, [0-9A-Fa-f]
\\ Literal escape character
\n Capturing group in the target pattern, where n is the decimal index of the group
\Ln
Capturing group in the target pattern, where n is the decimal index of the group.
All alphabetic characters in the group are converted to lower case.
\Un
Capturing group in the target pattern, where n is the decimal index of the group.
All alphabetic characters in the group are converted to upper case.

7  The display

The main display consists of two windows:

The main window is always visible. The control dialog may be hidden and made visible with the Hide control dialog / Show control dialog command.

7.1  Column width and row height

The width and/or height of some text components are specified in logical units of columns and rows. The width of a column and the height of a row are determined by the font that is used to display text within the component: the height of a row is the height of the font, and the width of a column is the width of a zero character (U+0030), or, if the font doesn't have a glyph for the zero character, the width of the glyph that is used for characters that are not defined.

7.2  Text view

The text view is the text area at the top of the main window in which the contents of a file are displayed. The text view is not editable. The following attributes of the text view are configurable:

The number of viewable columns in the text view is also used as the number of viewable columns in the result area. If the two areas use different fonts, the actual width of the text view and result area is the wider of the two areas.

The colours of the text view are also applied to the result area and to the fields in the Search options dialog.

7.2.1  How tab characters are displayed

When a file containing tab characters (U+0009) is displayed in the text view, RegexSearch uses two configuration properties — tab-width filters and default tab width — to determine how the tab characters are converted to spaces. A tab-width filter maps a filename filter (a set of patterns that are used to match filenames) to the number of spaces that will be used to replace tab characters when displaying a matching file. If none of the defined tab-width filters matches the file, the default tab width is used. If the tab width is zero, tab characters are not expanded but rendered as a U+2192 (rightwards arrow) character, or as the "not defined" glyph if the font doesn't contain a glyph for the arrow character.

The filename-filter part of the tab-width filter consists of one or more filename patterns separated by spaces; for instance, "*.cpp *.h". If a filename matches one of the patterns, it is included in the search in the case of the inclusive filter or excluded from the search in the case of the exclusive filter. A pattern may be a literal filename or it may contain the wildcards '*' and '?', which have their usual meaning: '*' matches zero or more characters and '?' matches a single character.

7.3  Result area

The result area is the text area at the bottom of the main window in which the results of a search are displayed. The result area is not editable. The following information is displayed in the result area after a search:

After a search, files that are listed in the result area can be opened with an external editor as though the Edit > Edit file command were issued on the file during a search. The command that invokes the external editor is issued by holding down the Ctrl key and clicking the left mouse button on the chosen pathname in the result area.

The following attributes of the result area are configurable:

The number of viewable columns in the text view is also used as the number of viewable columns in the result area. If the two areas use different fonts, the actual width of the text view and result area is the wider of the two areas.

The maximum number of columns in the result area is fixed at 1024.

The colours of the result area are also applied to the text view.

7.4  File-set controls

The file-set controls, which can be found in the top row of the control dialog, consist of a drop-down list for selecting the file-set kind, a group of three buttons for inserting, duplicating and deleting file sets, and a group of four buttons for navigating the list of file sets and changing the position of the current file set in the list.

The index of the current file set and the number of file sets in the list are shown in a box between the two pairs of navigation buttons. "End" indicates that the file-set position is at the end of the list; there is no current file set. A file set may be inserted at the end of the list.

7.4.1  File-set kind drop-down list

The drop-down list is used to select the file-set kind. The Pathname field and Include and Exclude fields are enabled or disabled according to the file-set kind.

7.4.2  File-set command buttons

A file set can be added to and removed from the list of file sets with the commands that are associated with the group of three buttons in the top row of the control dialog. Each command can also be issued from the keyboard.

7.4.2.1  Insert

The Insert command inserts a new file set into the list at the current file-set index. To add a new file set to the end of the list, first navigate to the end of the list. The Insert command can be issued by pressing the F2 key.

7.4.2.2  Duplicate

The Duplicate command makes a copy of the current file-set, inserts the copy into the list after the current index, then selects the copy. The Duplicate command can be issued by pressing the F3 key.

7.4.2.3  Delete

The Delete command deletes the current file-set after you have confirmed the deletion. The Delete command can be issued by pressing the F4 key.

7.4.3  File-set navigation buttons

The list of file sets can be navigated and the position of the current file set in the list can be changed with the commands that are associated with the group of four arrow buttons and barred-arrow buttons in the top row of the control dialog. Each command can also be issued from the keyboard.

The arrow buttons select the previous or next file set in the list. The current file-set index continues to change while the mouse button is pressed or until the start or end of the list is reached. Holding down the Ctrl key while clicking on or pressing the arrow buttons will move the current file set up or down the list. The Go to previous and Go to next commands can be issued by pressing the F6 and F7 keys respectively. The Move up and Move down commands can be issued by pressing Ctrl+F6 and Ctrl+F7 respectively.

The barred-arrow buttons select the first file set in the list or go to the end of the list (where no file set is selected). The Go to start and Go to end commands can be issued by pressing the F5 and F8 keys respectively.

7.5  Parameter fields

The five most prominent components in the control dialog are referred to as parameter fields. Two of the fields — the Target and Replacement fields — are text areas rather than fields, and the size of these text areas can be configured with the appearance.parameterEditorSize property. The width of these two fields determines the width of the other three parameter fields.

Each parameter field maintains a history list: a list of the most recent values that were entered in the field, up to a maximum of 64 items. A parameter field is similar in operation to a combo box except that an item is not moved to the top of the list when it is selected. (The order of items in the list may be changed in the editor; see below.)

A value is entered into the field explicitly by pressing Ctrl+Enter or implicitly when

When a value is entered in the field, it is inserted at the top of the list.

A history list can be navigated and edited in several ways. Navigation and editing commands are available from a pop-up menu that is activated in a system-dependent manner (eg, by pressing or releasing the right mouse button) or by pressing the context-menu key when the field has keyboard focus. The Select previous item and Select next item commands that are available from the pop-up menu can also be issued by pressing Ctrl+PageUp and Ctrl+PageDown respectively. The Delete command can be issued by pressing Ctrl+Shift+Delete.

All the parameter fields have an Edit command that displays an editor in which the items in the field's history list can be edited. The command is available from the field's pop-up menu and can also be issued by pressing Alt+Enter. (For the filter fields, the command can be issued with the Edit button adjacent to the field.) Within the list editor, the position of an item in the list can be changed by dragging it with the mouse, or by pressing Ctrl+Shift+Up or Ctrl+Shift+Down when the list has keyboard focus. The Delete key and Delete button delete the selected item after confirmation, while Shift+Delete and Shift+left-click on the Delete button delete the selected item without confirmation.

7.5.1  Pathname field

A pathname can be entered in the field by typing, by selecting a file using the "" button adjacent to the field, or by dragging a file or directory from, for example, a file browser and dropping it onto the field or onto other parts of the control dialog or onto the main window.

7.5.2  Include and Exclude fields

The fields contain a pathname filter: a set of patterns separated by spaces. Within the field, the backslash, '\', acts as an escape character to allow the inclusion of space characters in patterns. The escape convention in the filter fields is that a character following a '\' is treated as a literal character, and a single trailing '\' is ignored. Thus, you would use '' for a literal space and '\\' for a literal backslash. Because of this, it is recommended that you use a '/' to separate pathname components in patterns on the Windows platform.

The individual patterns of a pathname filter can be edited from the Edit pattern dialog — the third-level editor that is invoked with the Edit command in the Edit filter dialog that is invoked by the Edit command in the list editor that is invoked by the Edit command in the Include or Exclude field. (Got that?) Note that no escape character is used in the Pattern field of the Edit pattern dialog.

The history list of the Exclude field set may contain an empty string (ie, "exclude nothing").

7.5.3  Target and Replacement fields

The Target and Replacement fields are actually text areas that can contain multiple lines of text. The Replacement field is enabled only if the Replace check box is selected.

Text in the Target and Replacement fields can include tab characters (U+0009) and line-feed characters (U+000A), which are entered in the field by pressing Ctrl+Tab and Enter respectively. Line feeds are not displayed in a special way in the field, so, if your target or replacement isn't behaving as you expected, it may be that you have an unwanted — and invisble — line feed at the end of the field.

The fields may use a tab surrogate to display tab characters. Some characters in the fields may be escaped in two different ways: tabs and line feeds can be escaped separately, and an Escape command can be applied to the field. The Escape command behaves differently in the Target field and the Replacement field.

7.5.3.1  Tab surrogate

Within the Target and Replacement fields, tabs are replaced with the character that is denoted by the appearance.tabSurrogate configuration property. The default tab surrogate is the tab character (U+0009) itself; in this case, tabs are displayed as a number of spaces up to the next tab stop, and the tab width is denoted by the tabWidth.targetAndReplacement configuration property. It is important to understand that the tab surrogate is not just a substitute glyph: it actually replaces each occurrence of the tab character in the field unless tabs are escaped. When the content of the field is used (eg, in a search), the tab surrogate is converted either to a tab character or to a tab sequence ("\t") as appropriate, so you should choose as tab surrogate a character that is unlikely to appear in any target or replacement text.

7.5.3.2  Escaping tabs and line feeds

Tabs and line feeds may be escaped (ie, converted to the escape sequences "\t" and "\n" respectively) in the Target and Replacement fields by selecting the Tabs escaped or Line feeds escaped item in the field's pop-up menu. (In reality, it is the tab surrogate that is converted to "\t", but the existence of tab surrogates is ignored in this section so as not to complicate matters.) Deselecting the menu item reverses the procedure: each occurrence of "\t" or "\n" is converted to a tab character or line-feed character, even if the '\' is itself escaped with another backslash. You will need to be careful about toggling the escaping of tabs and line feeds if the text contains literal "\t" or "\n" sequences. Within the field, the escaping of tabs and line feeds can be toggled from the keyboard with Ctrl+T and Ctrl+N respectively. Indicators appear alongside a field in which tabs or line feeds are escaped.

When a regular-expression search is performed, any tab characters and line-feed characters in the Target field are escaped automatically in the target pattern that is used in the search. Tabs and line feeds are also escaped in the list of target or replacement items displayed in the editor, in the Select item submenu displayed in the field's pop-up menu, and when targets and replacements are saved to a search-parameter file.

7.5.3.3  The Escape command for the Target field

The Escape button adjacent to the Target field is enabled only when the Regular expression check box is selected. The Escape command for the Target field prefixes a '\' to each metacharacter in the field. The set of metacharacters on which the command operates is denoted by the general.escapedMetacharacters configuration property. The default value of the property is the set of characters that are used in metasymbols outside a character class delimited by square brackets:

  $ ( ) * + . ? [ \ ] ^ { | }

(']' and '}' are not metacharacters but are included in the set for symmetry.)

If tabs or line feeds are escaped in the Target field and the backspace character is in the set of escaped metacharacters, the '\' prefix to the escaped tabs and line feeds will itself be escaped by the Escape command. Unless the text contains literal "\t" or "\n" sequences, it may be best to unescape tabs and line feeds before issuing the Escape command.

7.5.3.4  The Escape command for the Replacement field

The Escape command for the Replacement field, which can be issued with the button adjacent to the field, prefixes an escape character to each escape character in the field. (The escape character for a replacement is specified by the general.replacementEscapeCharacter configuration property.) If the escape character is '\', it may be best to unescape tabs and line feeds before issuing the Escape command unless the text contains literal "\t" or "\n" sequences.

7.6  Window size

The main window is not directly resizeable but its size can be modified indirectly by means of some of the configuration properties that relate to the text view and result area.

The control dialog is resizeable. The initial size of the dialog is determined by the appearance.parameterEditorSize configuration property, which can be edited in the Preferences dialog. If the control dialog has been resized using the GUI, the value of the appearance.parameterEditorSize property that is written to the configuration file when RegexSearch exits is obtained from the actual dimensions of the Target and Replacement fields, overriding any changes to the configuration property that were made in the Preferences dialog.

As was mentioned above, the size of some text components — including the text view and result area — is determined by the font that they use to display text, as well as any properties that explicitly control their dimensions in terms of columns and rows. Any changes to configuration properties that affect the size of the main window or control dialog will not take effect until the next time that RegexSearch is run.

8  Commands

RegexSearch's main commands are accessible from its main menu. Some of the commands are also accessible from a pop-up menu that is activated in a system-dependent manner (eg, by pressing or releasing the right mouse button) while the mouse cursor is over one of the text areas or the background of the control dialog.

8.1  File menu

8.1.1  Open search parameters

The Open search parameters command brings up a file-selection dialog in which you can choose the file that you want to open. If the file has the correct format, the search parameters are loaded from it and the application's display is updated. If the current search parameters were read from a file, either automatically at startup or explicitly, and the parameters have changed since the file was read, you will be asked whether you want to save the current parameters before the new parameters are loaded.

8.1.2  Save search parameters

The Save search parameters command brings up a file-selection dialog in which you can choose the file to which you want to save the current set of search parameters. A file that is saved in this way can be specified as the default search parameters that will be loaded when RegexSearch starts up.

8.1.3  Exit

This command terminates the application. If you have made changes to search parameters that were read from a file, you will be asked whether you want to save them.

8.2  Edit menu

8.2.1  Edit file

The Edit file command executes a specified system command in a separate process. The command line, which is specified with the editor.command configuration property, may include a placeholder for the pathname of the file that is currently displayed in the text view. The intended purpose of the command line is to open the currently displayed file in a text editor, though it can be used for other purposes.

When using the Edit file command during a find-and-replace search, remember that the file in the text editor will not be synchronised with the file in RegexSearch's buffer, which may subsequently be written back to the file system with modifications if replacements have been made in the file, even if the replacements were made before the Edit file command was issued. (If the Edit file command is issued while the Search options dialog is displayed, the Next file option in the dialog can be used to discard any changes to the current file.)

After a search has finished, the external editor can be invoked on files that are selected from those listed in the result area.

8.2.2  Edit file, deferred

This command is available only during a find-and-replace search. It behaves similarly to the Edit file command except that the associated system command (specified with the editor.command configuration property) is not executed until the search of the current file is finished and, if any replacements have been made, the modified file has been written to the file system.

8.3  View menu

8.3.1  Hide control dialog / Show control dialog

If the control dialog is visible, this command is named Hide control dialog and it makes the control dialog invisible. If the control dialog is hidden, this command is named Show control dialog and it makes the control dialog visible.

8.4  Search menu

8.4.1  Search

When you issue a Search command, RegexSearch first validates the search parameters and displays an error message for the first parameter that is invalid. If the file-set kind is List, the specified list file is read and parsed. In a search of multiple files, the files are searched in the order described in the Directory and List file-set kinds.

Within a file, the search proceeds from the start of the file to the end. If a match of the target expression is found, the search will resume at the first character after the last character in the matched text, or, if a replacement is made, at the first character after the replacement.

When the first match of the target expression is found, the file in which the match occurred is displayed in the text view, and the matched text is highlighted. A Search options dialog box is displayed; the type of dialog depends on the search mode, find or find-and-replace. Because the Search options dialog is non-modal, the text in the text view can be scrolled while the dialog is displayed.

The options in the Search options dialog can be selected either by clicking on the appropriate button or by pressing a key or key combination. In addition to the usual Java Alt+<key> combination, each option (apart from Cancel, whose keyboard equivalent is the Escape key) can be selected by pressing the key by itself (ie, without the Alt key).

At the end of a search, the aggregate results are displayed in the result area. The results include a list of any files or directories that were not processed because of an error and a list of files or directories whose pathname could not be converted to canonical form. If the file-writing mode is Use a temporary file, preserve attributes, the results of a find-and-replace search include a separate list of files that were written but whose attributes were not set.

8.4.1.1  The Search options dialog in find mode

In find mode, the Search options dialog has four options:

Yes
The search is resumed in the current file. If no more matches are found in the current file, the search proceeds to the next file. If another occurrence is found, the matched text is highlighted and the Search options dialog is displayed again.
Global
The search is resumed in the current file, and proceeds through all the files in the file set without displaying the Search options dialog again. As each file is searched, if any matches are found in the file, the number of matches for that file is displayed in the result area. To make the search faster, files are not displayed in the text view during a global search.
A global search in find mode can be used to generate a list of files in which a match of the target expression was found.
Next file
The current file is skipped, and the search is resumed with the next file. If another occurrence is found, the matched text is highlighted and the Search options dialog is displayed again.
Cancel
The search is terminated.
8.4.1.2  The Search options dialog in find-and-replace mode

In find-and-replace mode, the Search options dialog has seven options:

Yes
The occurrence of the matched text is replaced, and the search is resumed in the current file. If no more matches are found in the current file, the search proceeds to the next file. If another occurrence is found, the matched text is highlighted and the Search options dialog is displayed again.
No
The occurrence of the matched text is not replaced, and the search is resumed in the current file. If no more matches are found in the current file, the search proceeds to the next file. If another occurrence is found, the matched text is highlighted and the Search options dialog is displayed again.
Preview
The occurrence of the matched text is replaced. The replacement is displayed in the text view, highlighted. Another dialog box is displayed in which you are asked whether you want to keep the replacement, to restore the original text or to cancel the search. In that dialog, the Keep, Restore and Cancel options are equivalent respectively to the Yes, No and Cancel options in the Search options dialog.
This file
The occurrence of the matched text is replaced, along with all remaining occurrences in the current file. The current file is saved, and the search proceeds to the next file. If another occurrence is found in a subsequent file, the matched text is highlighted and the Search options dialog is displayed again.
Global
The occurrence of the matched text is replaced, along with all remaining occurrences in the current file and any subsequent files. The search proceeds through all the files in the file set without displaying the Search options dialog again. To make the search faster, the text view is not updated during a global search.
Next file
The occurrence of the matched text is not replaced. Any changes to the current file are saved and the search is resumed with the next file. If another occurrence is found, the matched text is highlighted and the Search options dialog is displayed again.
Cancel
The search is terminated. Any changes to the current file are discarded.
8.4.1.3  How files are processed

Some aspects of RegexSearch's behaviour when processing files are worth noting in order that you may avoid the unintended consequences of that behaviour. RegexSearch assumes that the files it reads during a search are text files that have a specified character set and encoding. It also assumes that certain characters or character sequences in the files are line separators. The implications of these two assumptions are discussed below.

When a file is read during a search, the bytes of the file are converted to 16-bit Unicode according to the configuration property general.characterEncoding. A character encoding, such as UTF-8, maps between sequences of bytes and 16-bit Unicode values.

Within the file, all occurrences of the characters LF (U+000A) and CR (U+000D), and the character sequence (CR, LF) are treated as line separators. The kind of line separator is recorded for possible later use. If the file contains more than one kind of line separator, the most numerous kind of line separator prevails. If the numbers of different kinds are equal, the precedence from highest to lowest is: LF – CR – CR+LF.

In find mode, the processing of a file ends at this point: the processing is internal, and no physical changes are made to the stored file. In find-and-replace mode, a file may be modified as a result of a replacement, and the file written back to the file system. If the general.preserveLineSeparator configuration property has the value true, the file is written with the kind of line separator that was detected when it was read; otherwise, it is written with an LF line separator.

The way in which a modified file is written to the file system is determined by the general.fileWritingMode configuration property. A file may be written directly, or it may be written first to a temporary file that is renamed after the entire file has been written. If a temporary file is used, the owner, group and permissions of the file may be set to those of the original file on systems that support it. (Linux is the only system that is known to do so.) See the description of the general.fileWritingMode property for more details on its use.

8.4.2  Copy results

The Copy results command copies the contents of the result area to the system clipboard. The general.copyResultsAsListFile configuration property controls the format of the text that is placed on the clipboard: the results can be either in the form in which they appear in the result area or in a form that is suitable for use as a list file in a new search, with match/replacement counts converted to comments.

8.4.3  Save results

The Save results command saves the list of files from the results of the last search (ie, the files in which an occurrence of the target was found). A list of files that is saved with this command can be used as the file set for a further search if you select Results as the file-set kind.

8.4.4  View saved results

The View saved results command displays the last list of files to be saved with the Save results command, which allows you to see the files that will comprise the file set if Results is selected as the file-set kind.

8.5  Options menu

8.5.1  Preferences

The Preferences command brings up a tabbed dialog box in which the configuration properties of RegexSearch can be edited. The properties on the various tabbed pages are described below.

Some of the configuration properties in the Preferences dialog are edited with a spinner — a graphical component that consists of a text field adjacent to a pair of small buttons. The value in the text field may be edited manually, or it may be incremented and decremented by one of the following methods:

Using the last two methods, the amount by which the value is incremented or decremented can be modified by holding down the Ctrl, Shift or Ctrl+Shift keys, which correspond to increments of 10, 100 and 1000 respectively.

General
Character encoding
This property denotes the character encoding that is used to map between the bytes of a file and Unicode values when reading and writing files. Different implementations of Java may support different character encodings, though every implementation must support a few common encodings, including ISO 8859-1 (Latin-1) and UTF-8. The drop-down list contains the character encodings that are available in the current implementation. At the top of the list is the value <default encoding>, which denotes the platform- and locale-dependent default character encoding.
The default value is determined at runtime by the Java virtual machine, depending on the locale and platform.
Escaped metacharacters
This is the set of characters that will be escaped (ie, characters that will have '\' prefixed to them) when the Escape command is applied to a regular-expression target.
The default value is $()*+.?[\]^{|}
Replacement escape character
This is the character that is used as the escape character in replacement expressions. Your choice of character is limited to the punctuation characters that are displayed in the drop-down list.
The default value is \ (backslash, U+005C).
Ignore case of filenames
If you select Yes, alphabetic case will be ignored when matching pathnames against the patterns in an inclusion or exclusion filter and when matching filenames against the patterns in a tab-width filter (eg, the filename pattern "*.txt" will match the filenames foo.txt and BAR.TXT).
The default value is No.
File-writing mode
This property determines how files that are modified during a find-and-replace search are written back to the file system.
Direct
The file is written directly to an existing file (ie, the existing file is overwritten). Using this method, the file attributes are preserved but there is a risk that the existing file may be corrupted if there is a system failure while the file is being written.
Use a temporary file
The file is first written to a temporary file. When the temporary file has been written and closed, the existing file is deleted and the temporary file renamed. This is safer than the direct-writing mode but it does not preserve the file attributes on Linux/UNIX. (Files were always written in this way by RegexSearch prior to the introduction of this configuration property in version 2.2.)
Use a temporary file, preserve attributes
With this option, the file is first written to a temporary file, as with the previous option. After the temporary file has been renamed, the Linux/UNIX chmod, chgrp and chown commands are issued with the --reference option, which should set the file's permissions, group and owner to those of the original file. Linux is known to support the --reference option for these three commands; other UNIX-like systems may support it. Because it involves the additional execution of three system commands, this file-writing mode is slower than the other two.
The default value is Use a temporary file.
Preserve line-separator kind
If you select Yes, a file in which replacements are made during a find-and-replace search will be written with the same kind of line separator — LF (U+000A), CR (U+000D) or CR+LF — that it had when it was read. (Files that have more than one kind of line separator will be written with the kind of line separator that is most numerous.) If this property has the value No, files modified by RegexSearch will be written with LF (UNIX-style) line separators.
The default value is Yes.
Display UNIX-style pathnames
If you select Yes, pathnames are displayed in a reduced "UNIX style" in some parts of the GUI. A pathname is converted from its platform-specific form in two steps:
  1. If the pathname starts with the user's home directory, the latter is replaced by '~'.
  2. The file-separator character ('\' on Windows systems) is replaced by '/'.
The default value is No.
Select text when focus is gained
If you select Yes, all the text in a text field will be automatically selected when the field gains keyboard focus, regardless of how the focus is transferred.
The default value is Yes.
Save location of main window
If you select Yes, the location of the main window on the screen will be saved to the configuration file when you exit the application. The next time that RegexSearch is run, its main window will be positioned at the previously saved location.
The default value is Yes.
Hide control dialog when searching
If this property has the value Yes and the control dialog has not been explicitly hidden, it is automatically hidden during a search and made visible again when the search ends. If the control dialog is hidden in this way, the Show control dialog command can be used to make it visible during a search.
The default value is No.
Copy search results as list file
This property controls the format of the text that the Copy results command places on the system clipboard. If you select No, the results are in the form in which they appear in the result area of the main window. If you select Yes, the results are converted into a form that is suitable for use as a list file in a new search.
The default value is No.
Appearance
Look-and-feel
The look-and-feel (LAF) can be selected from a list of the LAFs that are available on the current system.
The default value is the cross-platform LAF, currently called Metal.
Text antialiasing
This determines the kind of antialiasing that is performed when text is drawn in custom or partially customised user-interface components (eg, in drop-down lists). Note that antialiasing is only a hint in Java; the implementation is not required to perform the chosen antialiasing.
This property has no effect on the antialiasing of text in standard UI components, such as labels and menus, which is determined by the Java implementation and the desktop setting for antialiasing text (often referred to as "font smoothing"). You can override the desktop setting with the unsupported system property awt.useSystemAAFontSettings.
This property does not control text antialiasing in the text view, which is configured independently with the property appearance.textViewTextAntialiasing.
The text antialiasing property can have the following values:
Default
The desktop setting for text antialiasing (font smoothing) is used, if the Java implementation recognises one; otherwise, no antialiasing is performed.
None
No antialiasing is performed.
Standard
This selects pixel-oriented antialiasing rather than subpixel antialiasing. It is suitable for non-LCD displays.
Subpixel, horizontal RGB
Subpixel, horizontal BGR
Subpixel, vertical RGB
Subpixel, vertical BGR
These four options are intended to optimise the rendering of text for LCD displays using subpixel antialiasing with subpixels in the chosen arrangement. Selecting an option that does not correspond to the actual arrangement of subpixels in your LCD display may result in blurred text. The most common arrangement of subpixels is horizontal RGB.
The default value is Default.
Size of parameter editor
These are the dimensions (number of columns × number of rows) of the Target and Replacement fields in the control dialog. The width of a column is the width of a zero character (U+0030) in the field's font. If the control dialog is resized using the GUI, the size of the parameter editor that was set in the Preferences dialog is overridden by the actual size of the Target and Replacement fields when the configuration properties are saved on exiting RegexSearch.
The default dimensions are 80 × 4.
Number of rows in result area
This determines the viewable height of the result area.
The default value is 4.
Tab surrogate
This is the character that is used in place of a tab character in the Target and Replacement fields. Its role is described in the section on the tab surrogate. You can enter either a single character or four hexadecimal-digit characters in the Tab surrogate field. The hexadecimal digits will be interpreted as a Unicode value. If a character is a control code or it cannot be displayed in the field's font, it is displayed in the field as its four-digit Unicode value.
The default value is the tab character, U+0009.
Text view: viewable size
These are the dimensions (number of columns × number of rows) of the area in which the contents of a file are displayed. The physical size of the text view is also determined by its font.
The default dimensions are 96 × 24.
Text view: maximum number of columns
This is the upper limit of the width of the text view; lines of text displayed in the text view are truncated at this limit. The limit applies only to displayed text: the actual text is not truncated. (The use of this property makes the display of text more efficient.)
The default value is 256.
Text view: text antialiasing
This determines the kind of antialiasing that is performed when text is drawn in the text view. It is independent from the general text antialiasing property in order to allow, for example, a bitmap font to be used in the text view. The values that this property may have are described for the appearance.textAntialiasing property.
The default value is Default.
Text area colours
These are the four colours that are used when drawing text in the text view, the result area and the fields in the Search options dialog. Clicking on a colour button brings up a colour-selection dialog.
Tab width
Text view: tab-width filters, default tab width
When a file containing tab characters (U+0009) is displayed, RegexSearch uses two properties — a list of tab-width filters and a default tab width — to determine how the tab characters are converted to spaces. A tab-width filter is a filename filter that is mapped to a tab width. The filename filter consists of one or more space-separated filename patterns (eg, "*.c *.cpp *.h *.hpp"). Up to 64 tab-width filters can be specified. Filters are applied in the order in which they appear in the list. The default tab width is used for a file whose name does not match any filter.
New filters can be added to the list, and items in the list can be edited, deleted or their position in the list changed. The Delete key and Delete button delete the selected item after confirmation, while Shift+Delete and Shift+left-click on the Delete button delete the selected item without confirmation. The position of an item in the list can be changed by dragging it with the mouse, or by pressing Ctrl+Shift+Up or Ctrl+Shift+Down when the list has keyboard focus.
The default value of the default tab width is 8.
Target and replacement editors: tab width
This is the tab width that is used in the target and replacement editors if the tab surrogate is the tab character (U+0009) itself.
The default value of the tab width in the target and replacement editors is 8.
Editor
Command
This property can be used to specify a command line that will invoke a text editor, either
The pathname of the file to be edited can be included in the command line so that the file will be opened in the text editor.
Within the command line, arguments must be separated with one or more spaces, and '%' (U+0025) acts as an escape character. '%f' is a placeholder for the pathname of the file that is to be edited. All other characters that follow '%' are treated as themselves; thus, a literal space is represented by '' (ie, U+0025, U+0020), and a literal '%' is represented by '%%'.
Pathnames in the command line may contain special constructions for system properties, environment variables and the user's home directory.
File locations
Default search parameters
This is the pathname of the search-parameter file that will be loaded automatically when RegexSearch starts up. The pathname may contain special constructions for system properties, environment variables and the user's home directory.
Fonts
These are the fonts that are used in RegexSearch's display. Remember that font names may be platform-dependent, so that a configuration that specifies font names may not work across platforms.
The main font is used for various components including labels (static text), menus, buttons and list boxes.
The text field font is used for text fields, spinners and some other text components.
The combo box font is used for drop-down lists and related components.
The parameter editor font is used for the Target and Replacement fields.
The text view and result area fonts are used for the respective text areas.
The default values of all the font properties are those of the default fonts for the platform and look-and-feel. A default font size is specified by leaving the Size field empty (the minimum position on the spinner). A default font is used if no font name is specified in RegexSearch's configuration or if the named font is not available.

Some of the configuration properties will take effect when the Preferences dialog is accepted (by closing it with OK); other properties (eg, the look-and-feel and fonts) will not take effect until the next time that RegexSearch is run.

The configuration file is normally saved automatically when RegexSearch exits, if the configuration has changed. The Save configuration command in the Preferences dialog can be used to save a configuration file explicitly.

9  Regular expressions

Within RegexSearch, the parsing and matching of regular expressions is performed by the Java regex engine. The purpose of this section is to present a summary of the syntax of Java's regular expressions, which is similar to that of Perl and Python. This section is not intended to be a tutorial on the use of regular expressions; see the references at the end of this section for suggested sources of further information.

Note: There are several differences between the syntax of regular expressions in Java and the syntax of regular expressions in Linux/UNIX tools such as sed and (g)awk.

In a search, the target pattern, replacement pattern and file are all composed of Unicode characters. RegexSearch converts files from bytes to 16-bit Unicode characters according to the scheme described in How files are processed. In particular, the line separators CR and CR+LF are converted to LF before a file is searched. Thereafter, by default, the only line separator recognised during a search is the line feed character (U+000A) unless the (?-d) flag appears in the target pattern.

When selected, the Ignore case check box in the control dialog enables the default form of case-insensitive matching, which applies only to characters in the US-ASCII character encoding. To apply case-insensitive matching to all Unicode characters, use the (?u) flag in the target pattern.

Within a regular expression, all characters are treated as literal characters except for twelve metacharacters — characters that have a special meaning and don't behave normally in regular expressions. The metacharacters are:

  $ ( ) * + . ? [ \ ^ { |

A metacharacter can be escaped — that is, its special meaning can be removed — by prefixing a backslash, '\', to it. An escaped metacharacter represents its corresponding literal character; thus, '\?' represents the character '?', and '\\' represents a literal backslash.

Some metacharacters are used by theselves within regular expressions; others are used to create special sequences called metasymbols. (In the documentation for java.util.regex.Pattern, metasymbols are referred to as constructs.) For example, several alphanumeric characters become metasymbols when preceded by a backslash.

9.1  Simple metacharacters and structural metasymbols

. By default, a dot matches any single character except a newline. The (?s) flag enables a mode in which a dot matches any character including a newline.
^
Matches the beginning of a line.
Example: ^# matches a '#' character at the beginning of a line.
$
Matches the end of a line or the end of the input string (in RegexSearch, the end of a file).
Example: ;$ matches a ';' character at the end of a line or at the end of a file.
\ The backslash has two roles:
  1. When it precedes a metacharacter (including itself), it escapes the metacharacter (ie, removes the special meaning of the metacharacter).
    Example: \* matches a '*' character.
  2. When it precedes some alphanumeric characters, it introduces a metasymbol. (Placing a backslash in front of an alphabetic character for which no metasymbol is defined will result in an error.)
    Example: \t matches a tab character (U+0009).
|
The vertical bar separates alternatives.
Example: his|her|its matches any one of the strings "his", "her" or "its".
[ ] Matches one character from a character class — a set of characters enclosed within the square brackets. The set of characters can be specified in a number of ways. It may be:
  • An enumeration of characters.
    Example: [abc].
  • One or more ranges of characters, in which a hyphen, '-', separates the inclusive start and end of a range of contiguous characters.
    Example: [a-z], or [A-Za-z].
  • A union.
    Example: [0-9[A-F]], which is equivalent to [0-9A-F].
  • An intersection, in which the string "&&" separates sets of characters.
    Example: [a-e&&d-h], which is equivalent to [de]).
If the first character within the square brackets is a circumflex, '^', the set of characters is negated; that is, the character class matches one character that is not in the set of characters that follows the '^'.
Example: [^0-9] matches any character except a (Western) decimal digit; [a-z&&[^ij]] is equivalent to [a-hk-z].
( )
Encloses a capturing group. The set of characters within the parentheses is treated as a unit; eg, ^(foo|bar) matches either "foo" or "bar" at the beginning of a line. The group is called capturing because the text that it matched can be included later in the target pattern or in the replacement by specifying the index of the group in a metasymbol (see \n in Alphanumeric metasymbols).
A cluster — a non-capturing group — can be specified by enclosing a set of characters between '(?:' and ')' (eg, (?:foo|bar) matches either "foo" or "bar" without capturing it).

9.2  Quantifiers

Quantifiers specify how many times the preceding character or group should match. The different types of quantifier are available in three flavours, which Java refers to as greedy, reluctant and possessive. (Greedy quantifiers are also known as maximal, and reluctant quantifiers are also known as lazy or minimal.)

A greedy (maximal) quantifier starts by matching as much as possible of the input string. If this doesn't allow the whole pattern to be matched, the greedy quantifier matches progressively less of the input string until either the whole pattern is matched or the match fails.

A reluctant (minimal) quantifier starts by matching as little as possible of the input string. If this doesn't allow the whole pattern to be matched, the reluctant quantifier matches progressively more of the input string until either the whole pattern is matched or the match fails.

A possessive quantifier starts, like a greedy quantifier, by matching as much as possible of the input string. However, if this doesn't allow the whole pattern to be matched, no backing-up is performed, and the match fails.

Quantifiers Meaning
Greedy Reluctant Possessive
* *? *+ Matches zero or more times
+ +? ++ Matches one or more times
? ?? ?+ Matches once or not at all
{n} {n}? {n}+ Matches exactly n times
{n,} {n,}? {n,}+ Matches at least n times
{n,m} {n,m}? {n,m}+ Matches at least n times but not more than m times

9.3  Alphanumeric metasymbols

\0n The character with octal value 0n, where n is in [0-7]
\0nn The character with octal value 0nn, where n is in [0-7]
\0mnn The character with octal value 0mnn, where m is in [0-3] and n is in [0-7]
\n The sequence matched by the nth capturing group
\a The alert character (BEL), U+0007
\A The beginning of the input string (in RegexSearch, the beginning of a file)
\b A word boundary
\B Not a word boundary
\cX The control character, Control-X
\d A digit, [0-9]
\D A non-digit, [^0-9]
\e The escape character (ESC), U+001B
\E End the quotation of metacharacters started by \Q
\f The form feed character (FF), U+000C
\n The line feed character (LF), U+000A
\p{prop} Any character in the character class named prop
\P{prop} Any character not in the character class named prop
\Q Quote (escape) metacharacters until \E
\r The carriage return character (CR), U+000D
\s A whitespace character, [ \t\n\x0B\f\r]
\S A non-whitespace character, [^\s]
\t The tab character (HT), U+0009
\unnnn The Unicode character U+nnnn, where n is a hexdecimal digit character, [0-9A-Fa-f]
\w A word character, [0-9A-Za-z_]
\W A non-word character, [^\w]
\xnn The character with hexdecimal value 0xnn
\z The end of the input string (in RegexSearch, the end of a file)
\Z The end of the input string (in RegexSearch, the end of a file), apart from a final '\n'

9.4  Named character classes

Named character classes are metasymbols of the form \p{name} or \P{name}. There are three types of named character class: POSIX, Unicode and Java.

9.4.1  POSIX character classes

Lower A lowercase alphabetic character, [a-z]
Upper An uppercase alphabetic character, [A-Z]
ASCII An ASCII character, [\x00-\x7F]
Alpha An alphabetic character, [\p{Lower}\p{Upper}]
Digit A decimal digit character, [0-9]
Alnum An alphanumeric character, [\p{Alpha}\p{Digit}]
Punct Punctuation: one of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Graph A visible character, [\p{Alnum}\p{Punct}]
Print A printable character, [\p{Graph}\x20]
Blank A space or a tab character, [ \t]
Cntrl A control character, [\x00-\x1F\x7F]
XDigit A hexadecimal digit character [0-9a-fA-F]
Space A whitespace character, [ \t\n\x0B\f\r]

9.4.2  Unicode character classes

The Unicode character classes are too numerous to list all of them here. They include Unicode character blocks (eg, Greek) and character categories (eg, uppercase letters). When forming a metasymbol, In is prefixed to the name of a Unicode block (eg, \p{InGreek} ), and Is is optionally prefixed to the name of a Unicode category is (eg, \p{Lu} or \p{IsLu} ).

The following table lists abbreviations for values in the Unicode General Category:

L Letter
Lu Letter, uppercase
Ll Letter, lowercase
Lt Letter, titlecase
Lm Letter, modifier
Lo Letter, other
M Mark
Mn Mark, non-spacing
Mc Mark, spacing combining
Me Mark, enclosing
N Number
Nd Number, decimal digit
Nl Number, letter
No Number, other
P Punctuation
Pc Punctuation, connector
Pd Punctuation, dash
Ps Punctuation, open
Pe Punctuation, close
Pi Punctuation, initial quote (may behave like Ps or Pe depending on usage)
Pf Punctuation, final quote (may behave like Ps or Pe depending on usage)
Po Punctuation, other
S Symbol
Sm Symbol, mathematical
Sc Symbol, currency
Sk Symbol, modifier
So Symbol, other
Z Separator
Zs Separator, space
Zl Separator, line
Zp Separator, paragraph
Cc Other, control
Cf Other, format
Cs Other, surrogate
Co Other, private use
Cn Other, not assigned

9.4.3  Java character classes

The Java character classes will probably be of interest only to Java programmers. The name of the character class is formed by substituting 'java' for 'is' in the name of a method of the java.lang.Character class that begins with 'is'. For example, the character class javaLetterOrDigit is equivalent to java.lang.Character.isLetterOrDigit( ).

9.5  Extended sequences

Extended sequences are metasymbols of the form (?...). The modifiers, [dimsux], and their "off" versions (preceded by a minus sign) can be concatenated within an extended sequence; for example, (?iu-ms) switches on i and u and switches off m and s.

(?:…) Non-capturing group (cluster).
(?>…) Non-capturing group referred to in Perl as a nonbacktracking subpattern.
(?d)
(?-d)
Enable/disable UNIX lines mode.
If enabled, only the UNIX line separator ('\n', U+000A) is recognised by the metacharacters ., ^ and $; otherwise, the following characters and character sequences are recognised as line separators: '\n' (U+000A), '\r' (U+000D), '\r\n' (U+000D, U+000A), U+0085, U+2028, U+2029.
UNIX linesmode is enabled by default.
(?i)
(?-i)
Enable/disable case-insensitive matching.
Case sensitivity is initially denoted by the ignore case search parameter, but it can be changed within the target pattern by means of this flag. By default, case-insensitive matching applies only to characters in the US-ASCII character encoding, but this can be extended to all Unicode characters with the (?u) flag.
(?m)
(?-m)
Enable/disable multiline mode.
In multiline mode, the metacharacters ^ and $ match at the beginning and end, respectively, of a line; otherwise, they match only at the beginning and end of the input string (ie, the file).
Multiline mode is enabled by default.
(?s)
(?-s)
Enable/disable dotall mode.
In dotall mode (known in Perl as single-line mode), the . (dot) metacharacter matches any one character including a line separator; otherwise, . matches any one character except for a line separator.
(?u)
(?-u)
By default, the case-insensitive matching that is control by the ignore case search parameters and the (?i) flag applies only to characters in the US-ASCII character encoding. Using the (?u) flag, case-insensitive matching can be extended to all Unicode characters.
(?x)
(?-x)
Enable/disable comments mode.
In comments mode, whitespace and comments in the target pattern are ignored. A comment starts with a '#' character and ends at the end of the pattern.
(?=pattern) Positive lookahead: a zero-width assertion that is true if pattern immediately follows the assertion.
(?!pattern) Negative lookahead: a zero-width assertion that is true if pattern does not immediately follow the assertion.
(?<=pattern) Positive lookbehind: a zero-width assertion that is true if pattern immediately precedes the assertion.
(?<!pattern) Negative lookbehind: a zero-width assertion that is true if pattern does not immediately precede the assertion.

9.6  References

The following sources were used in writing this section:

The Java documentation recommends the following book as providing a detailed treatment of the use of regular expressions:

  Friedl, Jeffrey, Mastering regular expressions 3rd ed., O'Reilly, 2006. ISBN 0596528124.

Appendix A:  Special constructions in pathnames

Where indicated elsewhere in this document, pathname parameters and properties in RegexSearch can contain special constructions for system properties, environment variables and the user's home directory. The special constructions are expanded when the pathname is used.

System properties and environment variables
Java system properties (eg, the user's home directory, user.home) and environment variables (eg, PATH) are referenced by enclosing them between '${' and '}'; that is, they must have the form ${<name>}. A Java system property takes precedence over an environment variable with the same name.
• Example: ${user.home}/projects
• Example: ${HOME}/projects
A Java system property can be specified by prefixing sys. to it.
• Example: ${sys.user.home}/projects
An environment variable can be specified by prefixing env. to it.
• Example: ${env.HOME}/projects
User's home directory
A leading '~' in a pathname is expanded into the user's home directory using the user.home system property, which is usually equivalent to the environment variable $HOME on Linux/UNIX systems or %USERPROFILE% on Windows systems.
• Example: ~/projects

Appendix B:  Configuration properties

The table below lists the configuration properties of RegexSearch. Apart from the app.configDir property, which, for obvious reasons, cannot be used within a configuration file, all properties can be used in the two configuration locations: command-line properties and configuration file.

When used in a -D command-line property, app. must be prefixed to the property key (eg, app.general.mainWindowLocation).

The <index> of a indexed property must be a three-digit decimal-string representation of the zero-based index of the property (eg, the third tab-width filter would be app.tabWidth.fileFilter.002).

When used in a configuration file, the components of the property keys become element names in the XML document hierarchy. (The app prefix of the property key in command-line properties corresponds to the root element of the XML document.) The form of properties in a configuration file was changed in version 2.1 of RegexSearch. Configuration files in the old format can be read, but files are written in the new format.

Any commas (',') or backslash characters ('\') in the name of a font must be escaped by prefixing a '\' character to them.

In the table below, the initial character of an italicised component of a property value denotes its data type according to the following convention:

i integer
p platform-specific pathname, which may contain special constructions
s string
c character
Property key Property value
configDir pPathname
appearance.lookAndFeel sName
appearance.parameterEditorSize iNumColumns, iNumRows
appearance.resultAreaNumRows iNumRows
appearance.tabSurrogate sUnicode4
appearance.textAntialiasing default | none | normal | subpixelHRgb | subpixelHBgr | subpixelVRgb | subpixelVBgr
appearance.textAreaColour.background iRed, iGreen, iBlue
appearance.textAreaColour.highlightBackground iRed, iGreen, iBlue
appearance.textAreaColour.highlightText iRed, iGreen, iBlue
appearance.textAreaColour.text iRed, iGreen, iBlue
appearance.textViewMaxNumColumns iNumColumns
appearance.textViewTextAntialiasing default | none | normal | subpixelHRgb | subpixelHBgr | subpixelVRgb | subpixelVBgr
appearance.textViewViewableSize iNumColumns, iNumRows
editor.command sCommand
font.comboBox sName, plain | bold | italic | boldItalic, iSize
font.main sName, plain | bold | italic | boldItalic, iSize
font.parameterEditor sName, plain | bold | italic | boldItalic, iSize
font.resultArea sName, plain | bold | italic | boldItalic, iSize
font.textField sName, plain | bold | italic | boldItalic, iSize
font.textView sName, plain | bold | italic | boldItalic, iSize
general.characterEncoding sName
general.controlDialogLocation iX, iY
general.copyResultsAsListFile false | true
general.escapedMetacharacters sCharacters
general.ignoreFilenameCase false | true
general.fileWritingMode direct | useTempFile | useTempFilePreserveAttributes
general.hideControlDialogWhenSearching false | true
general.mainWindowLocation iX, iY
general.preserveLineSeparator false | true
general.replacementEscapeCharacter cCharacter
general.selectTextOnFocusGained false | true
general.showUnixPathnames false | true
path.defaultSearchParameters pPathname
tabWidth.default iNumChars
tabWidth.fileFilter.<index> sPatterns : iNumChars
tabWidth.targetAndReplacement iNumChars

Appendix C:  Providing feedback about RegexSearch

The RegexSearch project is hosted by SourceForge. You can submit bug reports, feature requests and suggestions for improvement through the SourceForge website, but the mechanism for doing so may change depending on the facilities that SourceForge provides. For current information, please see the feedback page for Blank Aspect projects.

When reporting a problem with RegexSearch, please try to include enough relevant information to enable the problem to be reproduced. You should include at least the following information:

A Java stack trace, if one is available, would be helpful.

Last modified: 2014-10-14