String operations
SQL defines two primary character types: character varying(n)
and
character(n)
, where n is a positive integer. Both of these types
can store strings up to n characters (not bytes) in length. An attempt
to store a longer string into a column of these types will result in
an error, unless the excess characters are all spaces, in which case
the string will be truncated to the maximum length. (This somewhat
bizarre exception is required by the SQL standard.) If the string to
be stored is shorter than the declared length, values of type
character will be space-padded; values of type character varying will
simply store the shorter string.
In addition, we provide the text
, or varchar
type, which stores
strings of any length.
Trailing spaces are removed when converting a character value to one of the other string types. Note that trailing spaces are semantically significant in character varying and text values, and when using pattern matching (e.g., LIKE and regular expressions).
String constants (literals)
A string constant in SQL is an arbitrary sequence of characters
bounded by single quotes ('
), for example 'This is a string'
. To
include a single-quote character within a string constant, write two
adjacent single quotes, e.g., 'Dianne''s horse'
. Note that this is
not the same as a double-quote character ("
).
Two string constants that are only separated by whitespace with at least one newline are concatenated and effectively treated as if the string had been written as one constant. For example:
SELECT 'foo'
'bar'
is equivalent to:
SELECT 'foobar'
but:
SELECT 'foo' 'bar'
is not valid syntax.
Escaped characters
We also accepts escaped characters withing string constants, which are
an extension to the SQL standard. Within an escape string, a
backslash character (\
) begins a C-like backslash escape sequence, in
which the combination of backslash and following character(s)
represent a special byte value:
Backslash Escape Sequence | Interpretation |
---|---|
\b | backspace |
\f | form feed |
\n | newline |
\r | carriage return |
\t | tab |
\o, \oo, \ooo | (o = 0–7) octal byte value |
\xh, \xhh (h = 0–9, A–F) | hexadecimal byte value |
\uxxxx, \Uxxxxxxxx (x = 0–9, A–F) | 16 or 32-bit hexadecimal Unicode character value |
Any other character following a backslash is taken literally. Thus, to
include a backslash character, write two backslashes \\
. Also, a
single quote can be included in an escape string by writing \'
, in
addition to the normal way of ''
.
Operations on string values
Operation | Description | Examples |
---|---|---|
|| | String concatenation (infix). Note that concatenation does not strip trailing spaces
from CHAR(N) values, unlike other SQL dialects. If such behavior is desired, an explicit
cast to varchar can be added. | 'Post' || 'greSQL' => PostgreSQL |
string LIKE pattern [ESCAPE escape-character] and
string NOT LIKE pattern [ESCAPE escape-character] | The LIKE expression returns true if the string matches the supplied pattern.
(As expected, the NOT LIKE expression returns false if LIKE returns true. | See below for details. |
| The | See below for details. |
string RLIKE pattern and
string NOT RLIKE pattern | The RLIKE expression returns true if the string matches the supplied pattern. The pattern is a standard Java regular expression. | 'string' RLIKE 's..i.*' => TRUE |
ASCII ( string ) | Returns the numeric code of the first character of the argument. In UTF8 encoding, returns the Unicode code point of the character. In other multibyte encodings, the argument must be an ASCII character. Returns 0 if the string is empty. | ascii('x') => 120 |
CHAR_LENGTH(string) or CHARACTER_LENGTH(string) or LENGTH(string) or LEN(string) | Returns number of characters in the string. | char_length('josé') => 4 |
CHR ( integer ) | Returns a string containing the character with the given code. If the code is incorrect (e.g., negative), the result is an empty string. | chr(65) => A |
CONCAT( string1, ..., stringN) | String concatenation. Can have any number of arguments. | CONCAT('Post', 'greSQL', 1) => PostgreSQL1 |
CONCAT_WS( sep, string1, ..., stringN) | String concatenation with separator sep . Can have any number of arguments. sep is intercalated between all strings. If sep is NULL result is NULL . Other NULL arguments are ignored. | CONCAT_WS(',', 'Post', 'greSQL', NULL, '1') => Post,greSQL,,1 |
INITCAP ( string ) | Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters. | initcap('hi THOMAS') => Hi Thomas |
LEFT ( string, count ) | Returns first count characters in the string. If any argument is NULL , return NULL . | left('abcde', 2) => ab |
LOWER ( string ) | Converts the string to all lower case. | lower('TOM') => tom |
OVERLAY ( string PLACING newsubstring FROM start [ FOR remove ] ) | Replaces the substring of string that starts at the start'th character and extends for remove characters with newsubstring. If count is omitted, it defaults to the length of newsubstring. If 'start' is nott positive, the original string is unchanged. If 'start' is bigger than the length of 'string', the result is the concatenation of the two strings. If 'remove' is negative it is considered 0. | overlay('Txxxxas' placing 'hom' from 2 for 4) => Thomas |
POSITION(substring IN string) | Returns first starting index of the specified substring within string, or zero if it's not present. First character has index 1. | position('om' in 'Thomas') => 3 |
REPEAT ( string, count ) | Repeats string the specified number of times. The result is an empty string for a negative or 0 count. | repeat('Pg', 4) => PgPgPgPg |
REPLACE ( haystack, needle, replacement ) | Replaces all occurrences of needle in haystack with replacement . | replace('abcdefabcdef', 'cd', 'XX') => abXXefabXXef |
RLIKE(string, pattern) | A function equivalent to the RLIKE operator above. | RLIKE('string', 's..i.*') => TRUE |
SPLIT(string [, delimiter]) | Produce an array of strings, by splitting the first argument at each delimiter occurrence.
If the delimiter is empty, return an array with the original string. If the original
string is empty, return an empty array. If either argument is NULL , return NULL .
If delimiter is absent assume it is the string ',' . | SPLIT('a|b|c|', '|') => |
SPLIT_PART(string, delimiter, n) | This function uses 1-based indexing. It extracts the
|
|
SUBSTRING ( string [ FROM start ] [ FOR count ] ) | Extracts the substring of string starting at the "start"'th character if that is specified, and stopping after "count" characters if the value is specified. At least one of "start" or "count" must be provided. If "start" is negative, it is replaced with 1. If "count" is negative the empty string is returned. The index of the first character is 1. | SUBSTRING('Thomas' from 2 for 3) => hom SUBSTRING('Thomas' from 3) => omas SUBSTRING('Thomas' for 2) => Th |
TRIM ( [ LEADING | TRAILING | BOTH ] characters FROM string ) | Remove the specified characters from the specified ends of the string argument | TRIM(both 'xyz' from 'yxTomxx') => Tom TRIM(leading 'xyz' from 'yxTomxx') => Tomxx |
UPPER ( string ) | Converts the string to all upper case. | upper('tom') => TOM |
REGEXP_REPLACE(expr, pat[, repl]) | Replaces occurrences in the string expr that match the regular expression
specified by the pattern pat with the replacement string repl , and returns
the resulting string. If any one of expr , pat , or repl is NULL , the return value is NULL .
If repl is missing, it is assumed to be the empty string. If the regular
expression is invalid, the original string is returned. |
LIKE
string LIKE
pattern [ESCAPE
escape-character]
string NOT LIKE
pattern [ESCAPE
escape-character]
If pattern does not contain percent signs or underscores, then the
pattern only represents the string itself; in that case LIKE
acts
like the equals operator. An underscore (_
) in pattern stands for
(matches) any single character; a percent sign (%
) matches any
sequence of zero or more characters.
Some examples:
'abc' LIKE 'abc' true
'abc' LIKE 'a%' true
'abc' LIKE '_b_' true
'abc' LIKE 'c' false
LIKE
pattern matching always covers the entire string. Therefore, if
it's desired to match a sequence anywhere within a string, the pattern
must start and end with a percent sign.
To match a literal underscore or percent sign without matching other
characters, the respective character in pattern must be preceded by
the escape character. The default escape character is the backslash
but a different one can be selected by using the ESCAPE clause. To
match the escape character itself, write two escape characters. The
escape character cannot be one of the special pattern characters _
or %
.
Some examples where the escape character is changed to #
:
SELECT 'hawkeye' LIKE 'h%' ESCAPE '#' true
SELECT 'hawkeye' NOT LIKE 'h%' ESCAPE '#' false
SELECT 'h%' LIKE 'h#%' ESCAPE '#' true
SELECT 'h%' NOT LIKE 'h#%' ESCAPE '#' false
SELECT 'h%wkeye' LIKE 'h#%' ESCAPE '#' false
SELECT 'h%wkeye' NOT LIKE 'h#%' ESCAPE '#' true
SELECT 'h%wkeye' LIKE 'h#%%' ESCAPE '#' true
SELECT 'h%wkeye' NOT LIKE 'h#%%' ESCAPE '#' false
SELECT 'h%awkeye' LIKE 'h#%a%k%e' ESCAPE '#' true
When either argument or LIKE
, NOT LIKE
is NULL
, the result is NULL
.
ILIKE
string ILIKE
pattern
string NOT ILIKE
pattern
The ILIKE
expression performs a case-insensitive pattern match. If the pattern does not contain percent signs or underscores, then the pattern represents the string itself, and ILIKE
acts like the equals operator, ignoring character case. An underscore (_
) in the pattern matches any single character, while a percent sign (%
) matches any sequence of zero or more characters.
Some examples:
SELECT 'hawkeye' ILIKE 'h%' true
SELECT 'hawkeye' NOT ILIKE 'h%' false
SELECT 'hawkeye' ILIKE 'H%' true
SELECT 'hawkeye' NOT ILIKE 'H%' false
SELECT 'hawkeye' ILIKE 'H%Eye' true
SELECT 'hawkeye' NOT ILIKE 'H%Eye' false
SELECT 'Hawkeye' ILIKE 'h%' true
SELECT 'Hawkeye' NOT ILIKE 'h%' false
SELECT 'ABC' ILIKE '_b_' true
SELECT 'ABC' NOT ILIKE '_b_' false
When either argument or ILIKE
, NOT ILIKE
is NULL
, the result is NULL
.
POSIX regular expressions
Regular expressions are matched using the RLIKE
function. If either
argument of RLIKE
is NULL
, the result is also NULL
.
The description below is from the Postgres documentation, where credit is given to Henry Spencer.
POSIX regular expressions provide a more powerful means for pattern
matching than the LIKE
and SIMILAR TO
operators. Many Unix tools
such as egrep
, sed
, or awk
use a pattern matching language that
is similar to the one described here.
Currently our compiler does not support SIMILAR TO
regular
expressions.
A regular expression is a character sequence that is an abbreviated
definition of a set of strings (a regular set). A string is said to
match a regular expression if it is a member of the regular set
described by the regular expression. As with LIKE
, pattern
characters match string characters exactly unless they are special
characters in the regular expression language — but regular
expressions use different special characters than LIKE
does. Unlike
LIKE
patterns, a regular expression is allowed to match anywhere
within a string, unless the regular expression is explicitly anchored
to the beginning or end of the string.
A regular expression is defined as one or more branches, separated
by |
. It matches anything that matches one of the branches.
A branch is zero or more quantified atoms or constraints, concatenated. It matches a match for the first, followed by a match for the second, etc.; an empty branch matches the empty string.
A quantified atom is an atom possibly followed by a single quantifier. Without a quantifier, it matches a match for the atom. With a quantifier, it can match some number of matches of the atom. An atom can be any of the possibilities shown in the Table below.
Atom | Description |
---|---|
(re) | where re is any regular expression: matches a match for re , with the match noted for possible reporting |
(?:re) | as above, but the match is not noted for reporting (a “non-capturing” set of parentheses) |
. | matches any single character |
[chars] | a bracket expression, matching any one of the chars (see below for more details) |
\k | where k is a non-alphanumeric character): matches that character taken as an ordinary character, e.g.,
\ matches a backslash character |
\c | where c is alphanumeric (possibly followed by other characters): is an escape, see below |
{ | when followed by a character other than a digit, matches the left-brace character { ;
when followed by a digit, it is the beginning of a bound (see below) |
x | where x is a single character with no other significance, matches that character |
The possible quantifiers and their meanings are shown the Table below.
Quantifier | Matches |
---|---|
* | a sequence of 0 or more matches of the atom |
+ | a sequence of 1 or more matches of the atom |
? | a sequence of 0 or 1 matches of the atom |
{ m} | a sequence of exactly m matches of the atom |
{ m,} | a sequence of m or more matches of the atom |
{ m, n} | a sequence of m through n (inclusive) matches of the atom; m cannot exceed n |
A constraint matches an empty string, but matches only when specific conditions are met. A constraint can be used where an atom could be used, except it cannot be followed by a quantifier. The simple constraints are shown in the Table below; some more constraints are described later.
Constraint | Description |
---|---|
^ | matches at the beginning of the string |
$ | matches at the end of the string |
Bracket Expressions
A bracket expression is a list of characters enclosed in []
. It
normally matches any single character from the list (but see
below). If the list begins with ^
, it matches any single character
not from the rest of the list. If two characters in the list are
separated by -
, this is shorthand for the full range of characters
between those two (inclusive) in the collating sequence, e.g., [0-9]
in ASCII matches any decimal digit. It is illegal for two ranges to
share an endpoint, e.g., a-c-e
. Ranges are very
collating-sequence-dependent, so portable programs should avoid
relying on them.
To include a literal ]
in the list, make it the first character
(after ^
, if that is used). To include a literal -
, make it the
first or last character, or the second endpoint of a range. To use a
literal -
as the first endpoint of a range, enclose it in [.
and
.]
. With the exception of these characters, some combinations using
[
, all other special characters lose their special significance
within a bracket expression. In particular, \
is not special.
Within a bracket expression, the name of a character class enclosed in
[:
and :]
stands for the list of all characters belonging to that
class. A character class cannot be used as an endpoint of a range. The
POSIX standard defines these character class names:
Class | Description |
---|---|
alnum | letters and numeric digits |
alpha | letters |
blank | space and tab |
cntrl | control characters |
digit | numeric digits |
graph | printable characters except space |
lower | lower-case letters |
print | printable characters including space |
punct | punctuation |
space | any white space |
upper | upper-case letters |
xdigit | hexadecimal digits |
Class-shorthand escapes provide shorthands for certain commonly-used character classes. They are shown in the table below.
Escape | Description |
---|---|
\d | matches any digit, like [[:digit:]] |
\s | matches any whitespace character, like [[:space:]] |
\w | matches any word character, like [[:word:]] |
\D | matches any non-digit, like [^[:digit:]] |
\S | matches any non-whitespace character, like [^[:space:]] |
\W | matches any non-word character, like [^[:word:]] |
The behavior of these standard character classes is generally consistent across platforms for characters in the 7-bit ASCII set. Whether a given non-ASCII character is considered to belong to one of these classes depends on the collation that is used for the regular-expression function or operator.
Regular Expression Escapes
Escapes are special sequences beginning with \
followed by an
alphanumeric character. Escapes come in several varieties: character
entry, class shorthands, constraint escapes, and back references. A
\
followed by an alphanumeric character but not constituting a valid
escape is illegal.
Character-entry escapes exist to make it easier to specify non-printing and other inconvenient characters in REs. They are shown in the Table below.
Escape | Description |
---|---|
\a | alert (bell) character, as in C |
\b | backspace, as in C |
\B | synonym for backslash () to help reduce the need for backslash doubling |
\c X | (where X is any character) the character whose low-order 5 bits are the same as those of X, and whose other bits are all zero |
\e | the character whose collating-sequence name is ESC, or failing that, the character with octal value 033 |
\f | form feed, as in C |
\n | newline, as in C |
\r | carriage return, as in C |
\t | horizontal tab, as in C |
\u wxyz | (where wxyz is exactly four hexadecimal digits) the character whose hexadecimal value is 0xwxyz |
\v | vertical tab, as in C |
\x hhh | (where hhh is any sequence of hexadecimal digits) the character whose hexadecimal value is 0xhhh (a single character no matter how many hexadecimal digits are used) |
\0 | the character whose value is 0 (the null byte) |
\ xy | (where xy is exactly two octal digits, and is not a back reference) the character whose octal value is 0xy |
\ xyz | (where xyz is exactly three octal digits, and is not a back reference) the character whose octal value is 0xyz |
Hexadecimal digits are 0-9, a-f, and A-F. Octal digits are 0-7.
A constraint escape is a constraint, matching the empty string if specific conditions are met, written as an escape. They are shown in the Table below.
Escape | Description |
---|---|
\A | matches only at the beginning of the string (see Section 9.7.3.5 for how this differs from ^) |
\m | matches only at the beginning of a word |
\M | matches only at the end of a word |
\y | matches only at the beginning or end of a word |
\Y | matches only at a point that is not the beginning or end of a word |
\Z | matches only at the end of the string |
A back reference (\
n)
matches the same string matched by the
previous parenthesized subexpression specified by the number n (see
the Table below). For example, ([bc])\1
matches bb
or cc
but not
bc
or cb
. The subexpression must entirely precede the back
reference in the RE. Subexpressions are numbered in the order of their
leading parentheses. Non-capturing parentheses do not define
subexpressions. The back reference considers only the string
characters matched by the referenced subexpression, not any
constraints contained in it. For example, (^\d)\1
will match 22
.
Escape | Description |
---|---|
\ m | (where m is a nonzero digit) a back reference to the m'th subexpression |
\ mnn | (where m is a nonzero digit, and nn is some more digits, and the decimal value mnn is not greater than the number of closing capturing parentheses seen so far) a back reference to the mnn'th subexpression |
Capture groups
A common way to use regexes is with capture groups. That is, instead of just looking for matches of an entire regex, parentheses are used to create groups that represent part of the match.
For example, consider a string with multiple lines, and each line has
three whitespace delimited fields where the second field is expected
to be a number and the third field a boolean. This can be expressed
with the following regular expression, where the capture groups have
been labeled $0
to $4
.
(?m)^\s*(\S+)\s+([0-9]+)\s+(true|false)\s*$
^^^^ ^^^^^ ^^^^^^^^ ^^^^^^^^^^^^
$1 $2 $3 $4
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
$0
Capture group 0 always corresponds to an implicit unnamed group that includes the entire match. If a match is found, this group is always present.
Subsequent groups may be named and are numbered, starting at 1, by the
order in which the opening parenthesis appears in the pattern. For
example, in the pattern (?<a>.(?<b>.))(?<c>.)
, a
, b
and c
correspond to capture groups $1
, $2
and $3
, respectively.
Regular expression functions
REGEXP_REPLACE(expr, pat[, repl])
If repl
is missing, it is assumed to be the empty string.
Replaces occurrences in the string expr
that match the regular
expression specified by the pattern pat
with the replacement string
repl
, and returns the resulting string. If any of expr
, pat
, or
repl
is NULL
, the return value is NULL
.
Replacement string syntax
All instances of $N
in the replacement string are replaced with the
substring corresponding to the capture group identified by N
.
N
may be an integer corresponding to the index of the capture group
(counted by order of opening parenthesis where 0 is the entire match)
or it can be a name (consisting of letters, digits or underscores)
corresponding to a named capture group.
If N
isn’t a valid capture group (whether the name doesn’t exist or
isn’t a valid index), then it is replaced with the empty string.
The longest possible name is used. For example, $1a
looks up the
capture group named 1a
and not the capture group at index 1
. To
exert more precise control over the name, use braces, e.g., ${1}a
.
To write a literal $
use $$
.
Examples:
select regexp_replace('1078910', '[^01]');
1010
select regexp_replace('deep fried', '(?<first>\w+)\s+(?<second>\w+)', '${first}_$second');
deep_fried
select regexp_replace('Springsteen, Bruce', '([^,\s]+),\s+(\S+)', '$2 $1');
Bruce Springsteen
select regexp_replace('Springsteen, Bruce', '(?<last>[^,\s]+),\s+(?<first>\S+)', '$first $last');
Bruce Springsteen
Note that using $2
instead of $first
or $1
instead of $last
would produce the same result.