RegexSearch is a Java application that performs find and
find-and-replace searches for regular expressions on multiple text
files. It is distributed under version 3 of the GNU General Public
License; for details, see the file license.txt that is
included in the RegexSearch distribution.
RegexSearch has the following features:
The website of the RegexSearch project is at http://regexsearch.sourceforge.net/ .
RegexSearch is a Java application that requires a Java runtime environment that supports Java 1.6, such as Sun's Java Runtime Environment (JRE), version 6.0 or later.
The following files are included in the distribution:
Note: RegexSearch is distributed without an automated means of installation. Because of this, the instructions below assume that you have a basic knowledge of environment variables and command lines appropriate to the system on which RegexSearch is to be installed.
RegexSearch consists of a single JAR (Java archive) file,
regexSearch.jar. It is recommended that RegexSearch be
used with a configuration file,
regexSearch.conf, which contains user preferences. Of the files listed in the
contents of the distribution, only the JAR file
is required.
The installation of RegexSearch consists of two stages: copying the JAR file — and, optionally, the default configuration file — to your system, and providing a means of invoking the JAR file. The more advanced topic of configuring RegexSearch after installation is dealt with in the section on configuration.
The first stage is simple: copy the file regexSearch.jar to
a suitable location on your system. The next stage involves providing
the means by which the RegexSearch application is run. The recommended
way of running RegexSearch is to invoke the java launcher
tool from a command line, which may be included in a batch file. Configuration properties (including the location of
a configuration file) may be specified in the command line.
Assuming that your PATH environment variable includes the
path to the java tool and that you have copied
regexSearch.jar to the directory
/home/slothrop/bin/regexsearch/, the command
will run the RegexSearch application.
The file regexSearch.png can be used as the icon for the
RegexSearch application.
The RegexSearch application does not require a console window, so you
can use the javaw launcher rather than the
java launcher unless you particularly want a console
window. Assuming that your PATH environment variable
includes the path to the javaw tool and that you have
copied regexSearch.jar to the directory
C:\Program Files\RegexSearch\, the command
will run the RegexSearch application.
The file regexSearch.ico can be used as the icon for the
RegexSearch application.
RegexSearch does not have an automated means of uninstallation. To
remove it from your system, delete the file regexSearch.jar
from the location to which you copied it when you installed RegexSearch.
If you want to remove RegexSearch completely, you should also delete the
configuration file, regexSearch.conf, which may be at its
default location, and any search-parameter
files that you created.
When it starts up, RegexSearch gets its configuration from two sources: properties in the command line that is used to run the Java launcher, and a configuration file whose location may be explicitly specified.
The recommended method of setting the properties in a configuration file is with the Preferences command. Command-line properties must necessarily be edited manually; the form of the property values is given in the appendix on configuration properties, and it can also be inferred from the sample configuration file.
When RegexSearch is run by means of the java launcher,
configuration properties may be specified in the command line using the
standard Java form
-Dname="value"; eg,
-Dapp.appearance.textViewViewableSize="96, 32".
(The quotation marks around the value aren't necessary if the value
doesn't contain spaces.) RegexSearch's command-line
configuration properties all have the prefix
"app.". A list of all the properties that are
recognised by RegexSearch is given in the appendix on configuration properties.
app.configPath property
One particular property, app.configPath, is used to specify
the directory that contains a configuration file:
app.configPath property exists and the property
value is the empty string, RegexSearch will start without a
configuration file.
app.configPath property exists and the property is
not empty, RegexSearch will look for a configuration file in the
directory specified by the property.
app.configPath property does not exist,
RegexSearch will look for a configuration file at two other locations.
If the configuration file were located in a directory named
config in the user's home directory, the sample command
lines given above would become:
| Linux/UNIX: | java -Dapp.configPath="~/config" -jar /home/slothrop/bin/regexsearch/regexSearch.jar |
| Windows: | javaw -Dapp.configPath="~/config" -jar "C:\Program Files\RegexSearch\regexSearch.jar" |
The configuration file, which must be named
regexSearch.conf, is an XML file that is ordinarily written
by RegexSearch but can be edited manually if you know what you're
doing. (It can also be edited manually if you don't know
what you're doing, but this is discouraged.) RegexSearch
doesn't require a configuration file: it uses a default value for
any configuration property that is missing from the source(s) of
configuration. Similarly, if it finds a property value to be invalid,
RegexSearch will display a message to this effect and use its default
value.
A configuration file takes precedence over configuration properties in the command line; that is, if the same property is specified as a command-line property and in a configuration file, the value from the configuration file is used.
If the configuration has changed when you exit the application normally (ie, using the File > Exit command or an equivalent), RegexSearch will save its configuration to a configuration file. If a configuration file was read on startup, it will overwrite that file; otherwise, it will write a configuration file to the default directory described above.
A configuration file can be written explicitly with the Save Configuration command within the Preferences dialog.
When it starts up, RegexSearch is informed of the location of the
configuration file with the app.configPath property, which may
be set on the command line that runs the java launcher.
The existence of a system property with the key
app.configPath determines the locations that are searched
for a configuration file:
app.configPath exists,
RegexSearch will look for a configuration file in the directory
specified by the property, unless the property value is the empty
string. If a configuration file is not found at the location
specified by the app.configPath property, an error
message is displayed. The value of the app.configPath
property may contain special constructs for
system properties, environment variables and the user's home
directory.
app.configPath does not exist,
RegexSearch will look for a configuration file in the current working
directory before trying the default directory. The current working
directory (the directory specified by the environment variable
$PWD on Linux/UNIX systems) depends on the system and the
way in which RegexSearch was invoked. The default directory is
${user.home}/.puckfist/regexsearch, where
${user.home} is the user's home directory (the
directory specified by the environment variable $HOME on
Linux/UNIX systems or %USERPROFILE% on Windows systems).
The parameters of a search consist of:
The search parameters are stored collectively as an XML file. The DTD
of a search-parameter file (searchParams.dtd) is included
in the RegexSearch distribution. It is provided only for reference
because RegexSearch does not validate search-parameter files against the
DTD.
A search-parameter file may contain multiple file sets, targets and replacements, and each file set may contain multiple pathnames and pathname filters, though only a single pathname, inclusion filter, exclusion filter, target and replacement are used in a search. For each of those five parameters, RegexSearch's user interface allows you to select from the list of available values, to edit or delete existing values and to add new ones to the list.
When RegexSearch is run, search parameters are read from the file
specified by the configuration property path.defaultSearchParameters.
The current set of search parameters can be saved with the File > Save Search Parameters command, and files saved
in this way can be opened with the File > Open Search Parameters command. When you open
a new search-parameter file or exit RegexSearch, you will be prompted to
save the current search parameters if a search-parameter file was read,
either automatically at startup or explicitly, and the parameters have
changed since the file was read. A change to the file-set index or a
parameter index is regarded as a change to the file.
A file set specifies the files that are searched in a find or find-and-replace operation. A file set has a file-set type and, depending on its file-set type, it may also have a pathname and two kinds of pathname filter: an inclusion filter and an exclusion filter.
Depending on the file-set type, the pathname of a file set may be direct or indirect. A direct pathname specifies either a single file or a base directory that, in conjunction with inclusion and exclusion filters, defines the scope of a search. An indirect pathname specifies a file that contains a list of files and directories to be searched. A pathname may be absolute or relative; a relative pathname is relative to the current working directory.
The inclusion filters and exclusion filters of a file set are two kinds of pathname filter. A pathname filter is a set of patterns that determines, usually in conjunction with a base pathname, the files that are searched. A file is included in a search if it matches at least one pattern in the inclusion filter AND none of the patterns in the exclusion filter. The maximum number of patterns in a filter is 64.
A pattern is a pathname that may include wildcards. There are three wildcards: two filename wildcards and a pathname wildcard.
The filename wildcards, "?" and "*", have their
usual meaning: "?" matches a single character and
"*" matches zero or more characters in a pathname component (a
filename or directory name). For example, the pattern
"foo*.txt" will match the filenames foo.txt,
food.txt and football.txt. Following the UNIX
convention (but differing from the MS-DOS convention), a dot,
".", has no special significance in patterns: it is matched by
the "*" wildcard. Thus, the pattern "foo*" will
match the filenames foo, football.txt and
food.store.log.
The pathname wildcard, "**", matches zero or more pathname components. Its use in a pathname pattern is analogous to the use of "*" in a filename pattern. By itself, the pattern "**" is the recursive analogue of "*": it matches all files in or below the base directory. Used as a pathname component in a larger pattern, "**" specifies a recursive portion of the pathname that may be bounded above or below by a non-recursive pathname. For example,
.txt in the base directory and in all directories
below the base directory;
.xml in all directories named
config that are in or below the directory named
editor in the base directory;
.java in or below any directory named
test that is below the base directory (or in the base
directory, if it is named test).
A pathname-filter pattern may be either relative to a base directory (in a directory or list file set), or it may be absolute. The pathname components in a pattern are separated with a "/" character (U+002F). (A "\" may be used as the directory separator on the Windows platform, but "/" is recommended because "\" is used as the escape character in filter fields.) A pattern that ends with a directory separator is assumed to be followed by an implicit "**". A pattern that, when appended to its base directory, specifies an existing directory is assumed to be followed by an implicit "/**". A pattern may contain dot and double-dot components ("." and ".."), but only if they appear before the first wildcard in the pattern.
A file is matched against a pathname-filter pattern by converting both the pattern (appended to a base directory, if the pattern is relative) and the pathname of the target file to a canonical form. An error may occur in converting the pattern to canonical form if, for example, the resulting pathname is illegal or access to part of the file system is not permitted.
A file-set type may be one of File, Directory, List or Results. The four types are described below.
The search is performed on the file specified by the pathname of the file set. No inclusion filter or exclusion filter is applied.
The pathname of the file set specifies the base directory of the search. (A search is not necessarily confined to this directory and the directories below it because the inclusion filter may contain patterns that specify pathnames outside the base directory.) An inclusion filter and an exclusion filter may be specified. If no inclusion filter is specified, a filter consisting of the single pattern "**" (match all recursively) is assumed. Any relative patterns (see pathname filter) in the inclusion filter and exclusion filter are relative to the base directory.
The files in a directory are searched in order of filename. The ordering is lexicographic (ie, the Unicode values of characters in the filename are compared) and platform-dependent: it is case-sensitive on Linux/UNIX systems, but alphabetic case is ignored on Windows systems. Recursion is specified implicitly by pathname wildcards. A recursive search on a directory is depth-first: files in subdirectories are searched before the files in the directory. Like files, subdirectories are searched in order of name. If the inclusion filter contains any absolute patterns, the files or directories specified by those patterns are searched after the base directory.
The pathname of the file set is assumed to specify a text file, each of whose non-empty lines denotes the pathname of a file or directory that is to be searched. A line of the list file may contain a comment, beginning with a ";" character. If a line contains a comment, any characters after the last non-space character before the comment are ignored (eg, the line "simple-filename.txt ; file #23" is parsed as "simple-filename.txt"). Empty lines are ignored. The pathnames are validated before the search starts, and the search will not proceed unless each pathname denotes an existing directory or a regular file.
An inclusion filter and an exclusion filter may be specified. If no inclusion filter is specified, a filter consisting of the single pattern "**" (match all recursively) is assumed.
The search of the pathnames in the list is equivalent to a sequence of searches on file sets whose file-set type is either File or Directory according to whether a pathname specifies a file or directory. If the pathname specifies a directory, the inclusion filter and any exclusion filter are applied to it. If the inclusion filter contains any absolute patterns, the files or directories specified by those patterns are searched after all the pathnames in the list.
The files that will be searched are those from the last list of files to be saved with the Search > Save Results command. The current list of saved results can be viewed with the Search > View Saved Results command. No inclusion filter or exclusion filter is applied.
The target of the search — the pattern that you are attempting to match in the files that are searched — can be either literal text or a regular expression. The corresponding types of search are referred to below as literal-text search and regular-expression search.
The replacement is an expression that will be used to replace
occurrences of the target pattern in a find-and-replace search. The
interpretation of the replacement differs according to whether the
target is literal text or a regular expression. Both types of
replacement may contain metasymbols — special sequences
that are introduced with an escape character. By default, the escape
character is a backslash, "\", but it can be changed with the
general.replacementEscapeCharacter
configuration property if, for example, you want to avoid having to
escape the backslashes in Windows pathnames. In a replacement, an
escape character must always be escaped by prefixing another escape
character to it (eg, "\\", if the escape character is
"\").
The following metasymbols may appear in a literal-text replacement string. It is assumed that the escape character is "\".
| \t | Tab character, U+0009 |
| \n | Line-feed character, U+000A |
| \unnnn | Unicode character U+nnnn, where n is a hexadecimal digit character, [0-9A-Fa-f] |
| \\ | Literal escape character |
The following metasymbols may appear in a regular-expression replacement string. It is assumed that the escape character is "\".
| \t | Tab character, U+0009 |
| \n | Line-feed character, U+000A |
| \unnnn | Unicode character U+nnnn, where n is a hexadecimal digit character, [0-9A-Fa-f] |
| \\ | Literal escape character |
| \n | Capturing group in the target pattern, where n is the decimal index of the group |
| \Ln |
Capturing group in the target pattern, where n is the
decimal index of the group. All alphabetic characters in the group are converted to lower case. |
| \Un |
Capturing group in the target pattern, where n is the
decimal index of the group. All alphabetic characters in the group are converted to upper case. |
The main display consists of a single window, divided into three areas that are referred to in this document as the text view, control panel and result area.
The width and/or height of some text components are specified in logical units of columns and rows. The width of a column and the height of a row are determined by the font that is used to display text within the component: the height of a row is the height of the font, and the width of a column is the width of a zero character (U+0030), or, if the font doesn't have a glyph for the zero character, the width of the glyph that is used for characters that are not defined.
The text view is the text area at the top of the display in which the contents of a file are displayed. The text view is not editable. The following attributes of the text view are configurable:
The width of the text view is expanded, if necessary, to fit the width of the main window (which is also determined by the size of other components), so the viewable text-view size property effectively sets the minimum width rather than the displayed width of the text view.
The colours of the text view are also applied to the result area and to the fields in the Search Options dialog.
When a file containing tab characters (U+0009) is displayed in the text view, RegexSearch uses two configuration properties — tab-width filters and a default tab width — to determine how the tab characters are converted to spaces. A tab-width filter maps a filename filter (a set of patterns that are used to match filenames) to the number of spaces that will be used to replace tab characters when displaying a matching file. If none of the defined tab-width filters matches the file, the default tab width is used. If the tab width is zero, tab characters are not expanded but rendered as a U+2192 (rightwards arrow) character in left-to-right locales or a U+2190 (leftwards arrow) character in right-to-left locales, or as the "not defined" glyph if the font doesn't contain a glyph for the appropriate arrow character.
The filename-filter part of the tab-width filter consists of one or more filename patterns separated by spaces; for instance, "*.cpp *.h". If a filename matches one of the patterns, it is included in the search in the case of the inclusive filter or excluded from the search in the case of the exclusive filter. A pattern may be a literal filename or it may contain the wildcards "*" and "?", which have their usual meaning: "*" matches zero or more characters and "?" matches a single character.
The file-set controls, which can be found in the top row of the control panel, consist of a combo box for selecting the file-set type, a group of three buttons for inserting, duplicating and deleting file sets, and a group of four buttons for navigating the list of file sets and changing the position of the current file set in the list.
The index of the current file set and the number of file sets in the list are shown in a box between the two pairs of navigation buttons. "End" indicates that the file-set position is at the end of the list; there is no current file set. A file set may be inserted at the end of the list.
The combo box is used to select the file-set type. The pathname field and include and exclude fields are enabled or disabled according to the file-set type.
A file set can be added to and removed from the list of file sets with the commands that are associated with the group of three buttons in the top row of the control panel. Each command can also be issued from the keyboard.
The Insert command inserts a new file set into the list at the current file-set index. To add a new file set to the end of the list, first navigate to the end of the list. The Insert command can be issued by pressing the F2 key.
The Duplicate command makes a copy of the current file-set, inserts the copy into the list after the current index, then selects the copy. The Duplicate command can be issued by pressing the F3 key.
The Delete command deletes the current file-set after you have confirmed the deletion. The Delete command can be issued by pressing the F4 key.
The list of file sets can be navigated and the position of the current file set in the list can be changed with the commands that are associated with the group of four arrow buttons and barred-arrow buttons in the top row of the control panel. Each command can also be issued from the keyboard.
The arrow buttons select the previous or next file set in the list. The current file-set index continues to change while the mouse button is pressed or until the start or end of the list is reached. Holding down the Ctrl key while clicking on or pressing the arrow buttons will move the current file set up or down the list. The Go-to-previous and Go-to-next commands can be issued by pressing the F6 and F7 keys respectively. The Move-up and Move-down commands can be issued by pressing Ctrl+F6 and Ctrl+F7 respectively.
The barred-arrow buttons select the first file set in the list or go to the end of the list (where no file set is selected). The Go-to-start and Go-to-end commands can be issued by pressing the F5 and F8 keys respectively.
The five most prominent components in the control panel are referred to
as parameter fields, although two of them are text areas rather
than fields. Along with the text view and result area, these fields
determine the size of the application's main window, and their width
(number of columns) can be set with the appearance.paramFieldNumColumns
configuration property.
Each parameter field maintains a list of the most recent values that were entered in the field, up to a maximum of 64 values. A parameter field is similar in operation to an editable combo box except that an item is not moved to the top of the list when it is selected. (The order of items in the list may be changed in the editor; see below.)
A value is entered into the field explicitly by pressing Ctrl+Enter or implicitly when
When an value is entered in the field, it is inserted at the top of the list.
The list can be navigated and edited in several ways. Navigation and editing commands are available from a pop-up menu that is activated in a system-dependent manner (eg, by pressing or releasing the right mouse button) or by pressing the context-menu key when the field has keyboard focus. The Select Previous Item and Select Next Item commands that are available from the pop-up menu can also be issued by pressing Ctrl+PageUp and Ctrl+PageDown respectively. The Delete command can be issued by pressing Ctrl+Shift+Delete.
All the parameter fields have an Edit command that displays an editor in which the items in the field's list can be edited. The command is available from the field's pop-up menu and can also be issued by pressing Alt+Enter. (For the filter fields, the command can be issued with the Edit button adjacent to the field.) Within the editor, the position of an item in the list can be changed by dragging with the mouse or by pressing Ctrl+Up or Ctrl+Down. The Delete key and Delete button delete the selected item after confirmation, while Shift+Delete and Shift+left-click on the Delete button delete the selected item without confirmation.
A pathname can be entered in the field by typing, by selecting a file using the Browse button adjacent to the field, or by dragging a file or directory from, for example, a file browser and dropping it onto the field or onto other parts of the main window.
The fields contain a pathname filter: a set of patterns separated by spaces. Within the field, the backslash, "\", acts as an escape character to allow the inclusion of space characters in patterns. The escape convention in the filter fields is that a character following a "\" is treated as a literal character, and a single trailing "\" is ignored. Thus, you would use "\ " for a literal space and "\\" for a literal backslash. Because of this, it is recommended that you use a "/" to separate pathname components in patterns on the Windows platform.
The individual patterns of a pathname filter can be edited from the Edit Pattern dialog — the third-level editor that is invoked with the Edit command in the Edit Filter dialog that is invoked by the Edit command in the list editor that is invoked by the Edit command in the include or exclude field. (Got that?) Note that no escape character is used in the pattern field of the Edit Pattern dialog.
The target and replacement fields are actually text areas that can contain multiple lines of text. The replacement field is enabled only if the Replace checkbox is selected.
Text in the target and replacement field can include tab characters (U+0009) and line-feed characters (U+000A), which are entered in the field by pressing Ctrl+Tab and Enter respectively. Line feeds are not displayed in a special way in the field, so, if your target or replacement isn't behaving as you expected, it may be that you have an unwanted — and invisble — line feed at the end of the field.
The fields may use a tab surrogate to display tab characters. Some characters in the fields may be escaped in two different ways: tabs and line feeds can be escaped separately, and an Escape command can be applied to the field. The Escape command behaves differently in the target field and the replacement field.
Within the target and replacement fields, tabs are replaced with the
character that is specified by the appearance.tabSurrogate
configuration property. The default tab surrogate is the tab character
(U+0009) itself; in this case, tabs are displayed as a number of spaces
up to the next tab stop, and the tab width is specified by the
tabWidth.targetAndReplacement configuration property. It
is important to understand that the tab surrogate is not just a
substitute glyph: it actually replaces each occurrence of the tab
character in the field unless tabs are escaped. When the content of the field is
used (eg, in a search), the tab surrogate is converted either to a tab
character or to a tab sequence ("\t") as appropriate, so you
should choose as tab surrogate a character that is unlikely to appear in
any target or replacement text.
Tabs and line feeds may be escaped (ie, converted to the escape sequences "\t" and "\n" respectively) in the target and replacement fields by selecting the Tabs Escaped or Line Feeds Escaped item in the field's pop-up menu. (In reality, it is the tab surrogate that is converted to "\t", but the existence of tab surrogates is ignored in this section so as not to complicate matters.) Deselecting the menu item reverses the procedure: each occurrence of "\t" or "\n" is converted to a tab character or line-feed character, even if the "\" is itself escaped with another backslash. You will need to be careful about toggling the escaping of tabs and line feeds if the text contains literal "\t" or "\n" sequences. Within the field, the escaping of tabs and line feeds can be toggled from the keyboard with Ctrl+T and Ctrl+N respectively. Indicators appear alongside a field in which tabs or line feeds are escaped.
When a regular-expression search is performed, any tab characters and line-feed characters in the target field are escaped automatically in the target pattern that is used in the search. Tabs and line feeds are also escaped in the list of target or replacement items displayed in the editor, in the Select Item submenu displayed in the field's pop-up menu, and when targets and replacements are saved to a search-parameter file.
The Escape button adjacent to the target field is enabled only when the
Regular expression checkbox is selected. The Escape command for the
target field prefixes a "\" to each metacharacter in the
field. The set of metacharacters on which the command operates is
specified by the general.escapedMetacharacters
configuration property. The default value of the property is the set of
characters that are used in metasymbols outside a character class
delimited by square brackets:
$ ( ) * + . ? [ \ ] ^ { | }
("]" and "}" are not metacharacters but are included in the set for symmetry.)
If tabs or line feeds are escaped in the target field and the backspace character is in the set of escaped metacharacters, the "\" prefix to the escaped tabs and line feeds will itself be escaped by the Escape command. Unless the text contains literal "\t" or "\n" sequences, it may be best to unescape tabs and line feeds before issuing the Escape command.
The Escape command for the replacement field, which can be issued with
the button adjacent to the field, prefixes an escape character to each
escape character in the field. (The escape character for a replacement
is specified by the general.replacementEscapeCharacter
configuration property.) If the escape character is "\", it
may be best to unescape tabs and line feeds before issuing the Escape
command unless the text contains literal "\t" or
"\n" sequences.
The result area is the text area at the bottom of the display in which the contents of a file are displayed. The result area is not editable. The following attributes of the result area are configurable:
The viewable width of the text view in columns is also applied to the result area. The physical widths of the two text areas are also dependent on their fonts, although both areas will be displayed with the same width because they are expanded, if necessary, to fit the width of the main window.
The maximum number of columns in the result area is fixed at 1024.
The colours of the result area are also applied to the text view.
The main window is not directly resizeable but its size can be modified indirectly by means of some of the configuration properties. As was mentioned above, the size of some text components — including the text view and result area — is determined by the font that they use to display text, as well as any properties that explicitly control their dimensions in terms of columns and rows. The two text areas will be expanded to fit the width of the main window, so changes to the properties that determine their size may not always be apparent. Any changes to configuration properties that affect the size of the main window will not take effect until the next time that RegexSearch is run.
RegexSearch's main commands are accessible from its main menu. Some of the commands are also accessible from a pop-up menu that is activated in a system-dependent manner (eg, by pressing or releasing the right mouse button) while the mouse cursor is over one of the text areas or the background of the control panel.
The Open Search Parameters brings up a file-selection dialog in which you can choose the file that you want to open. If the file is of the correct format, the search parameters are loaded from it and RegexSearch's display is updated. If the current search parameters were read from a file, either automatically at startup or explicitly, and the parameters have changed since the file was read, you will be asked whether you want to save the current parameters before the new parameters are loaded.
The Save Search Parameters command brings up a file-selection dialog in which you can choose the file to which you want to save the current set of search parameters. A file that is saved in this way can be specified as the default search parameters that will be loaded when RegexSearch starts up.
This command terminates the application. If you have made changes to search parameters that were read from a file, you will be asked whether you want to save them.
The Edit File command executes a specified system command in a separate
process. The command line, which is specified with the editor.command configuration
property, may include a placeholder for the pathname of the file that is
currently displayed in the text view. The intended purpose of the
command line is to open the currently displayed file in a text editor,
though it can be used for another purpose.
When using the Edit File command during a find-and-replace search, remember that the file in the text editor will not be synchronised with the file in RegexSearch's buffer, which may subsequently be written back to storage with modifications if replacements have been made in the file, even if the replacements were made before the Edit File command was issued. (If the Edit File command is issued while the Search Options dialog is displayed, the Next File option in the dialog can be used to discard any changes to the current file.)
This command is available only during a find-and-replace search. It
behaves similarly to the Edit File command
except that the associated system command (specified with the editor.command configuration
property) is not executed until the search of the current file is
finished and, if any replacements have been made, the modified file has
been written to storage.
When you issue a Search command, RegexSearch first validates the search parameters and displays an error message for the first parameter that is invalid. If the file-set type is List, the specified list file is read and parsed. In a search of multiple files, the files are searched in the order described in the Directory and List file-set types.
Within a file, the search proceeds from the start of the file to the end. If a match of the target expression is found, the search will resume at the first character after the last character in the matched text, or, if a replacement is made, at the first character after the replacement.
When the first match of the target expression is found, the file in which the match occurred is displayed in the text view, and the matched text is highlighted. A Search Options dialog box is displayed; the type of dialog depends on the search mode, find or find-and-replace. As the Search Options dialog is non-modal, the text in the text view can be scrolled while the dialog is displayed.
The options in the Search Options dialog can be selected either by clicking on the appropriate button or by pressing a key or key combination. In addition to the usual Java Alt+<key> combination, each option (apart from Cancel, whose keyboard equivalent is the Escape key) can be selected by pressing the key by itself (ie, without the Alt key).
At the end of a search, the aggregate results are displayed in the result area. The results include a list of any files or directories that were not processed because of an error and a list of files or directories whose pathname could not be converted to canonical form. If the file-writing mode is Use a temporary file, preserve attributes, the results of a find-and-replace search include a separate list of files that were written but whose attributes were not set.
In find mode, the Search Options dialog has four options:
In find-and-replace mode, the Search Options dialog has seven options:
Some aspects of RegexSearch's behaviour when processing files are worth noting in order that you may avoid the unwanted consequences of that behaviour. RegexSearch assumes that the files it reads during a search are text files that have a specified character set and encoding. It also assumes that certain characters or character sequences in the files are line separators. The implications of these two assumptions are discussed below.
When a file is read during a search, the bytes of the file are converted
to 16-bit Unicode according to the configuration property
general.charset. A charset is a combination of a
character set and a character encoding, such as UTF-8, that maps between
sequences of bytes and 16-bit Unicode values.
Within the file, all occurrences of the characters LF (U+000A) and CR (U+000D), and the character sequence (CR, LF) are treated as line separators. The type of line separator is recorded for possible later use. If the file contains more than one type of line separator, the most numerous type of line separator prevails. If the numbers of different types are equal, the precedence from highest to lowest is: LF – CR – CR+LF.
In find mode, the processing of a file ends at this point: the
processing is internal, and no physical changes are made to the stored
file. In find-and-replace mode, a file may be modified as a result of a
replacement, and the file written back to storage. If the
general.preserveLineSeparator configuration property has
the value yes, the file is written with the type of line
separator that was detected when it was read; otherwise, it is written
with an LF line separator.
The way in which a modified file is written to storage is determined by
the general.fileWritingMode configuration property. A file
may be written directly, or it may be written first to a temporary file
that is renamed after the entire file has been written. If a temporary
file is used, the owner, group and permissions of the file may be set to
those of the original file on systems that support it. (Linux is the
only system that is known to do so.) See the description of the general.fileWritingMode
property for more details on its use.
The Copy Results command copies the contents of the result area to the
system clipboard. The general.copyResultsAsListFile
configuration property controls the format of the text that is placed on
the clipboard: the results can be either in the form in which they
appear in the result area or in a form that is suitable for use as a list file in a new search, with
match/replacement counts converted to comments.
The Save Results command saves the list of files from the results of the last search (ie, the files in which an occurrence of the target was found). A list of files that is saved with this command can be used as the file set for a further search if you select Results as the file-set type.
The View Saved Results command displays the last list of files to be saved with the Save Results command, which allows you to see the files that will comprise the file set if Results is selected as the file-set type.
The Preferences command brings up a tabbed dialog box in which the configuration properties of RegexSearch can be edited. The properties on the various tabbed pages are described below.
Some of the configuration properties in the Preferences dialog are edited with a spinner — a graphical component that consists of a text field adjacent to a pair of small buttons. The value in the text field may be edited manually, or it may be incremented and decremented by one of the following methods:
Using the last two methods, the amount by which the value is incremented or decremented can be modified by holding down the Ctrl, Shift or Ctrl+Shift keys, which correspond to increments of 10, 100 and 1000 respectively.
| General |
|---|
|
|
|
|
|
|
|
|
| Appearance |
|---|
|
|
|
|
|
|
|
|
|
|
| Tab width |
|---|
|
|
| Editor |
|---|
|
| File locations |
|---|
|
| Fonts |
|---|
|
Some of the configuration properties will take effect when the Preferences dialog is accepted (by closing it with OK); other properties (eg, the look-and-feel and fonts) will not take effect until the next time that RegexSearch is run.
The configuration file is normally saved automatically when RegexSearch exits, if the configuration has changed. The Save Configuration command in the Preferences dialog can be used to save a configuration file explicitly.
Within RegexSearch, the parsing and matching of regular expressions is performed by the Java regex engine. The purpose of this section is to present a summary of the syntax of Java's regular expressions, which is similar to that of Perl and Python. This section is not intended to be a tutorial on the use of regular expressions; see the references at the end of this section for suggested sources of further information.
Note: There are several differences between the syntax of regular expressions in Java and the syntax of regular expressions in Linux/UNIX tools such as sed and (g)awk.
In a search, the target pattern, replacement pattern and file are all
composed of Unicode characters. RegexSearch converts files from bytes
to 16-bit Unicode characters according to the scheme described in How files are processed. In particular, the
line separators CR and CR+LF are converted to LF before a file is
searched. Thereafter, by default, the only line separator recognised
during a search is the line feed character (U+000A) unless the
(?-d) flag appears in the target pattern.
When selected, the Ignore case checkbox in the main window enables the
default form of case-insensitive matching, which applies only to
characters in the US-ASCII charset. To apply case-insensitive matching
to all Unicode characters, use the (?u) flag in the target
pattern.
Within a regular expression, all characters are treated as literal characters except for twelve metacharacters — characters that have a special meaning and don't behave normally in regular expressions. The metacharacters are:
$ ( ) * + . ? [ \ ^ { |
A metacharacter can be escaped — that is, its special meaning can be removed — by prefixing a backslash, "\", to it. An escaped metacharacter represents its corresponding literal character; thus, "\?" represents the character "?", and "\\" represents a literal backslash.
Some metacharacters are used by theselves within regular expressions;
others are used to create special sequences called metasymbols.
(In the documentation for java.util.regex.Pattern,
metasymbols are referred to as constructs.) For example,
several alphanumeric characters become metasymbols when preceded by a
backslash.
| . |
By default, a dot matches any single character except a newline.
The (?s) flag enables a mode in which a dot matches any
character including a newline.
|
| ^ |
Matches the beginning of a line. Example: ^#
matches a "#" character at the beginning of a line.
|
| $ |
Matches the end of a line or the end of the input string (in
RegexSearch, the end of a file). Example: ;$
matches a ";" character at the end of a line or at the end
of a file.
|
| \ |
The backslash has two roles:
|
| | |
The vertical bar separates alternatives. Example: his|her|its matches any one of the strings
"his", "her" or "its".
|
| [ ] |
Matches one character from a character class — a set
of characters enclosed within the square brackets. The set of
characters can be specified in a number of ways. It may be:
Example: [^0-9]
matches any character except a (Western) decimal digit;
[a-z&&[^ij]] is equivalent to
[a-hk-z].
|
| ( ) |
Encloses a capturing group. The set of characters within
the parentheses is treated as a unit; eg, ^(foo|bar)
matches either "foo" or "bar" at the beginning
of a line. The group is called capturing because the text
that it matched can be included later in the target pattern or in
the replacement by specifying the index of the group in a metasymbol
(see \n in Alphanumeric
metasymbols).A cluster — a non-capturing group — can be specified by enclosing a set of characters between "(?:" and ")" (eg, (?:foo|bar) matches either "foo" or
"bar" without capturing it).
|
Quantifiers specify how many times the preceding character or group should match. The different types of quantifier are available in three flavours, which Java refers to as greedy, reluctant and possessive. (Greedy quantifiers are also known as maximal, and reluctant quantifiers are also known as lazy or minimal.)
A greedy (maximal) quantifier starts by matching as much as possible of the input string. If this doesn't allow the whole pattern to be matched, the greedy quantifier matches progressively less of the input string until either the whole pattern is matched or the match fails.
A reluctant (minimal) quantifier starts by matching as little as possible of the input string. If this doesn't allow the whole pattern to be matched, the reluctant quantifier matches progressively more of the input string until either the whole pattern is matched or the match fails.
A possessive quantifier starts, like a greedy quantifier, by matching as much as possible of the input string. However, if this doesn't allow the whole pattern to be matched, no backing-up is performed, and the match fails.
| Quantifiers | Meaning | ||
|---|---|---|---|
| Greedy | Reluctant | Possessive | |
| * | *? | *+ | Matches zero or more times |
| + | +? | ++ | Matches one or more times |
| ? | ?? | ?+ | Matches once or not at all |
| {n} | {n}? | {n}+ | Matches exactly n times |
| {n,} | {n,}? | {n,}+ | Matches at least n times |
| {n,m} | {n,m}? | {n,m}+ | Matches at least n times but not more than m times |
| \0n | The character with octal value 0n, where n is in [0-7] |
| \0nn | The character with octal value 0nn, where n is in [0-7] |
| \0mnn | The character with octal value 0mnn, where m is in [0-3] and n is in [0-7] |
| \n | The sequence matched by the nth capturing group |
| \a | The alert character (BEL), U+0007 |
| \A | The beginning of the input string (in RegexSearch, the beginning of a file) |
| \b | A word boundary |
| \B | Not a word boundary |
| \cX | The control character, Control-X |
| \d | A digit, [0-9] |
| \D | A non-digit, [^0-9] |
| \e | The escape character (ESC), U+001B |
| \E | End the quotation of metacharacters started by \Q |
| \f | The form feed character (FF), U+000C |
| \n | The line feed character (LF), U+000A |
| \p{prop} | Any character in the character class named prop |
| \P{prop} | Any character not in the character class named prop |
| \Q | Quote (escape) metacharacters until \E |
| \r | The carriage return character (CR), U+000D |
| \s | A whitespace character, [ \t\n\x0B\f\r] |
| \S | A non-whitespace character, [^\s] |
| \t | The tab character (HT), U+0009 |
| \unnnn | The Unicode character U+nnnn, where n is a hexdecimal digit character, [0-9A-Fa-f] |
| \w | A word character, [0-9A-Za-z_] |
| \W | A non-word character, [^\w] |
| \xnn | The character with hexdecimal value 0xnn |
| \z | The end of the input string (in RegexSearch, the end of a file) |
| \Z | The end of the input string (in RegexSearch, the end of a file), apart from a final '\n' |
Named character classes are metasymbols of the form
\p{name} or \P{name}. There
are three types of named character class: POSIX, Unicode and Java.
| Lower | A lowercase alphabetic character, [a-z] |
| Upper | An uppercase alphabetic character, [A-Z] |
| ASCII | An ASCII character, [\x00-\x7F] |
| Alpha | An alphabetic character, [\p{Lower}\p{Upper}] |
| Digit | A decimal digit character, [0-9] |
| Alnum | An alphanumeric character, [\p{Alpha}\p{Digit}] |
| Punct | Punctuation: one of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
| Graph | A visible character, [\p{Alnum}\p{Punct}] |
| A printable character, [\p{Graph}\x20] | |
| Blank | A space or a tab character, [ \t] |
| Cntrl | A control character, [\x00-\x1F\x7F] |
| XDigit | A hexadecimal digit character [0-9a-fA-F] |
| Space | A whitespace character, [ \t\n\x0B\f\r] |
The Unicode character classes are too numerous to list all of them here.
They include Unicode character blocks (eg, Greek) and character
categories (eg, uppercase letters). When forming a metasymbol,
In is prefixed to the name of a Unicode block (eg,
\p{InGreek} ), and Is is optionally prefixed
to the name of a Unicode category is (eg, \p{Lu} or
\p{IsLu} ).
The following table lists abbreviations for values in the Unicode General Category:
| L | Letter |
| Lu | Letter, uppercase |
| Ll | Letter, lowercase |
| Lt | Letter, titlecase |
| Lm | Letter, modifier |
| Lo | Letter, other |
| M | Mark |
| Mn | Mark, non-spacing |
| Mc | Mark, spacing combining |
| Me | Mark, enclosing |
| N | Number |
| Nd | Number, decimal digit |
| Nl | Number, letter |
| No | Number, other |
| P | Punctuation |
| Pc | Punctuation, connector |
| Pd | Punctuation, dash |
| Ps | Punctuation, open |
| Pe | Punctuation, close |
| Pi |
Punctuation, initial quote (may behave like Ps or
Pe depending on usage)
|
| Pf |
Punctuation, final quote (may behave like Ps or
Pe depending on usage)
|
| Po | Punctuation, other |
| S | Symbol |
| Sm | Symbol, mathematical |
| Sc | Symbol, currency |
| Sk | Symbol, modifier |
| So | Symbol, other |
| Z | Separator |
| Zs | Separator, space |
| Zl | Separator, line |
| Zp | Separator, paragraph |
| Cc | Other, control |
| Cf | Other, format |
| Cs | Other, surrogate |
| Co | Other, private use |
| Cn | Other, not assigned |
The Java character classes will probably be of interest only to Java
programmers. The name of the character class is formed by substituting
"java" for "is" in the name of a method of the java.lang.Character
class that begins with "is". For example, the character class
javaLetterOrDigit is equivalent to
java.lang.Character.isLetterOrDigit( ).
Extended sequences are metasymbols of the form (?...). The
modifiers, [dimsux], and their "off" versions
(preceded by a minus sign) can be concatenated within an extended
sequence; for example, (?iu-ms) switches on i
and u and switches off m and s.
| (?:…) | Non-capturing group (cluster). |
| (?>…) | Non-capturing group referred to in Perl as a nonbacktracking subpattern. |
| (?d) (?-d) |
Enable/disable UNIX lines mode. If enabled, only the UNIX line separator ('\n', U+000A) is recognised by the metacharacters ., ^ and $; otherwise, the
following characters and character sequences are recognised as line
separators: '\n' (U+000A), '\r' (U+000D),
"\r\n" (U+000D, U+000A), U+0085, U+2028, U+2029.UNIX lines mode is enabled by default. |
| (?i) (?-i) |
Enable/disable case-insensitive matching. Case sensitivity is initially specified by the ignore case search parameter, but it can be changed within the target pattern by means of this flag. By default, case-insensitive matching applies only to characters in the US-ASCII charset, but this can be extended to all Unicode characters with the (?u) flag.
|
| (?m) (?-m) |
Enable/disable multiline mode. In multiline mode, the metacharacters ^ and $ match at the
beginning and end, respectively, of a line; otherwise, they match
only at the beginning and end of the input string (ie, the
file).Multiline mode is enabled by default. |
| (?s) (?-s) |
Enable/disable dotall mode. In dotall mode (known in Perl as single-line mode), the . (dot) metacharacter
matches any one character including a line separator;
otherwise, . matches any one character except
for a line separator.
|
| (?u) (?-u) |
By default, the case-insensitive matching that is control by the
ignore case search parameters and the (?i)
flag applies only to characters in the US-ASCII charset. Using the
(?u) flag, case-insensitive matching can be extended to
all Unicode characters.
|
| (?x) (?-x) |
Enable/disable comments mode. In comments mode, whitespace and comments in the target pattern are ignored. A comment starts with a "#" character and ends at the end of the pattern. |
| (?=pattern) | Positive lookahead: a zero-width assertion that is true if pattern immediately follows the assertion. |
| (?!pattern) | Negative lookahead: a zero-width assertion that is true if pattern does not immediately follow the assertion. |
| (?<=pattern) | Positive lookbehind: a zero-width assertion that is true if pattern immediately precedes the assertion. |
| (?<!pattern) | Negative lookbehind: a zero-width assertion that is true if pattern does not immediately precede the assertion. |
The following sources were used in writing this section: