regex 正则表达式
C++11 新增了正则表达式的标准库支持
regex的介绍
C++11 自带了 6 种正则表达式语法的支持
- ECMAScript
- basic
- extended
- awk
- grep
- egrep
C++11 默认使用 ECMAScript 语法,这也是 6 种语法中最强大的,假如想使用其他 5 种语法,只需在声明 regex
对象时指定即可
regex e("^a.", regex_constants::grep);
在C++中,有三种正则可以选择使用
- C ++regex
- C regex
- boost regex
如果在windows下开发C++,默认不支持后面两种正则,如果想快速应用,显然C++ regex 比较方便使用。
regex的使用
全词匹配
regex_match
是正则表达式匹配的函数
int main() {
string a = "modao";
regex e("m(.*)");
if(regex_match(a, e))cout << "modao";
return 0;
}
搜索匹配
regex_search和regex_match的主要区别是:regex_match是全词匹配,而regex_search是搜索其中匹配的字符串。
int main() {
regex e("abc*");
bool m = regex_search("abccc", e);
// 输出 yes
cout << (m ? "yes" : "no") << endl;
}
提取匹配
想要提取出匹配的部分,就需要用到 match_results
int main() {
string str("Email a@bc.com abc");
// 等同于 match_results<string>
smatch m;
regex e("([[:w:]]+)@([[:w:]]+\\.com)");
bool found = regex_search(str, m, e);
// m.size=3, 存储了 3 个 result
cout << "m.size=" << m.size() << endl;
for (int n=0; n<m.size(); n++){
cout << "m[" << n << "]=" << m[n].str() << endl;
//等价写法 m.str(n), *(m.begin()+n)
}
//匹配项的前缀和后缀
cout << "m.prefix=" << m.prefix().str() << endl;
cout << "m.suffix=" << m.suffix().str() << endl;
}
/*
迭代 match_results, 输出
m[0]=a@bc.com (整个匹配)
m[1]=a (第1个group)
m[2]=bc.com (第2个group)
m.prefix=Email
m.suffix= abc
*/
提取多个匹配
迭代result
假如我们想要匹配的字符串中,有多个子串都可以匹配正则表达式,并且我们想把这些子串全部找出来,例如一个字符串中包含多个邮箱地址,那么就需要用到 regex_iterator
int main() {
string str("a@bc.com, d@ef.com, aa@b.com");
regex e("([[:w:]]+)@([[:w:]]+\\.com)");
// 定义 regex_iteraror
sregex_iterator pos(str.cbegin(), str.cend(), e);
// C++惯例: 默认构造的迭代器表示序列结束
sregex_iterator end;
for (; pos!=end; pos++) {
cout << "email=" << pos->str(0)
<< ", user=" << pos->str(1)
<< ", domain=" << pos->str(2)
<< endl;
}
}
/*
email=a@bc.com, user=a, domain=bc.com
email=d@ef.com, user=d, domain=ef.com
email=aa@bb.com, user=aa, domain=b.com
*/
如上我们可以看到,regex_iterator
其实就是迭代字符串中所有正则表达式匹配的 match_results
。
迭代匹配串
除此之外,C++ 还提供了另一种迭代器 regex_token_iterator
。
不同的是,regex_token_iterator
迭代的是所有正则表达式匹配中的指定子表达式,或迭代未匹配的子字符串
int main() {
string str("a@bc.com, d@ef.com, aa@bb.com");
regex e("([[:w:]]+)@([[:w:]]+\\.com)");
// 定义regex_token_iterator
sregex_token_iterator pos(str.cbegin(), str.cend(), e);
sregex_token_iterator end; //序列结束
for (; pos!=end; pos++) {
cout << "Matched: " << *pos << endl;
}
}
/*
Matched: a@bc.com
Matched: d@ef.com
Matched: aa@bb.com
*/
我们可以修改 pos
的定义,使它每次迭代 match_results
的第 2 个 group
// 第 4 个参数表示第几个 group
sregex_token_iterator pos(str.cbegin(), str.cend(), e, 2);
值得注意的是,如果我们把这里的参数设为 -1,则迭代字符串中所有不匹配正则表达式的部分,相当于用正则表达式切割字符串
int main() {
string str("a bb cd");
regex e("\\s+"); // 匹配空格
// 迭代不匹配正则表达式的部分
sregex_token_iterator pos(str.cbegin(), str.cend(), e, -1);
sregex_token_iterator end;
for (; pos!=end; pos++) {
cout << "Matched: "
<< *pos << endl;
}
}
/*
Matched: a
Matched: bb
Matched: cd
*/
替换匹配
正则表达式还有一个常用的场景——字符串替换。C++ 中我们可以使用 regex_replace
int main() {
string str("a@bc.com, d@ef.com, aa@bb.com");
regex e("([[:w:]]+)@([[:w:]]+\\.com)");
//原字符串,匹配规则,替换规则
cout << regex_replace(str, e, "$1 is on $2");
}
输出为
a is on bc.com, d is on ef.com, aa is on bb.com
regex的语法
正则表达式的括号:
()匹配并捕获
[]匹配字符组里的每一个字符
{}匹配出现的次数
C++ regex正则表达式的规则和其他编程语言差不多,如下:
特殊字符(用于匹配很难形容的字符):
characters | description | matches |
---|---|---|
. | not newline | any character except line terminators (LF, CR, LS, PS). |
\t | tab (HT) | a horizontal tab character (same as \u0009). |
\n | newline (LF) | a newline (line feed) character (same as \u000A). |
\v | vertical tab (VT) | a vertical tab character (same as \u000B). |
\f | form feed (FF) | a form feed character (same as \u000C). |
\r | carriage return (CR) | a carriage return character (same as \u000D). |
\cletter | control code | a control code character whose code unit value is the same as the remainder of dividing the code unit value of letter by 32. For example: \ca is the same as \u0001, \cb the same as \u0002, and so on... |
\xhh | ASCII character | a character whose code unit value has an hex value equivalent to the two hex digits hh. For example: \x4c is the same as L, or \x23 the same as #. |
\uhhhh | unicode character | a character whose code unit value has an hex value equivalent to the four hex digitshhhh. |
\0 | null | a null character (same as \u0000). |
\int | backreference | the result of the submatch whose opening parenthesis is the int-th (int shall begin by a digit other than 0). See groups below for more info. |
\d | digit | a decimal digit character |
\D | not digit | any character that is not a decimal digit character |
\s | whitespace | a whitespace character |
\S | not whitespace | any character that is not a whitespace character |
\w | word | an alphanumeric or underscore character |
\W | not word | any character that is not an alphanumeric or underscore character |
\character | character | the character character as it is, without interpreting its special meaning within a regex expression. Any character can be escaped except those which form any of the special character sequences above. Needed for: ^ $ \ . * + ? ( ) [ ] { } | |
[class] | character class | the target character is part of the class |
1 | negated character class | the target character is not part of the class |
注意了,在C++反斜杠字符(\)会被转义
std::regex e1 ("\\d"); // \d -> 匹配数字字符
std::regex e2 ("\\\\"); // \\ -> 匹配反斜杠字符
数量**:**
characters | times | effects |
---|---|---|
***** | 0 or more | The preceding atom is matched 0 or more times. |
+ | 1 or more | The preceding atom is matched 1 or more times. |
? | 0 or 1 | The preceding atom is optional (matched either 0 times or once). |
{int} | int | The preceding atom is matched exactly int times. |
{int,} | int or more | The preceding atom is matched int or more times. |
{min,max} | between min and max | The preceding atom is matched at least min times, but not more than max. |
注意:
- 模式
(a+).*
匹配 "aardvark" 将匹配到 aa - 模式
(a+?).*
匹配 "aardvark" 将匹配到 a
组(用以匹配连续的多个字符):
characters | description | effects |
---|---|---|
(subpattern) | Group | Creates a backreference. |
(?:subpattern) | Passive group | Does not create a backreference. |
注意了,第一种将创建一个反向引用,用于提取匹配到的内容,第二种则没有,相对来说性能方面也没这部分的开销
characters | description | condition for match |
---|---|---|
^ | Beginning of line | Either it is the beginning of the target sequence, or follows a line terminator. |
$ | End of line | Either it is the end of the target sequence, or precedes a line terminator. |
| | Separator | Separates two alternative patterns or subpatterns.. |
单个字符
[abc]
匹配 a, b 或 c. [^xyz]
匹配任何非 x, y, z的字符
范围 [a-z] 匹配任何小写字母 (a, b, c, ..., z) [abc1-5] 匹配 a, b , c, 或 1 到 5 的数字
c++ regex还有一种类POSIX的写法
class | description | equivalent (with regex_traits, default locale) |
---|---|---|
[:alnum:] | alpha-numerical character | isalnum |
[:alpha:] | alphabetic character | isalpha |
[:blank:] | blank character | isblank |
[:cntrl:] | control character | iscntrl |
[:digit:] | decimal digit character | isdigit |
[:graph:] | character with graphical representation | isgraph |
[:lower:] | lowercase letter | islower |
[:print:] | printable character | isprint |
[:punct:] | punctuation mark character | ispunct |
[:space:] | whitespace character | isspace |
[:upper:] | uppercase letter | isupper |
[:xdigit:] | hexadecimal digit character | isxdigit |
[:d:] | decimal digit character | isdigit |
[:w:] | word character | isalnum |
[:s:] | whitespace character | isspace |
邮箱regex示例
#include <iostream>
#include <cstdlib>
#include <string>
#include <regex> // regular expression 正则表达式
using namespace std;
int main ( )
{
string email_address;
string user_name, domain_name;
regex pattern("([0-9A-Za-z\\-_\\.]+)@([0-9a-z]+\\.[a-z]{2,3}(\\.[a-z]{2})?)");
// 正则表达式,匹配规则:
// 第1组(即用户名),匹配规则:0至9、A至Z、a至z、下划线、点、连字符之中
// 的任意字符,重复一遍或以上
// 中间,一个“@”符号
// 第二组(即域名),匹配规则:0至9或a至z之中的任意字符重复一遍或以上,
// 接着一个点,接着a至z之中的任意字符重复2至3遍(如com或cn等),
// 第二组内部的一组,一个点,接着a至z之中的任意字符重复2遍(如cn或fr等)
// 内部一整组重复零次或一次
// 输入文件结尾符(Windows用Ctrl+Z,UNIX用Ctrl+D)结束循环
while ( cin >> email_address )
{
if ( regex_match( email_address, pattern ) )
{
cout << "您输入的电子邮件地址合法" << endl;
// 截取第一组
user_name = regex_replace( email_address, pattern, string("$1") );
// 截取第二组
domain_name = regex_replace( email_address, pattern, string("$2") );
cout << "用户名:" << user_name << endl;
cout << "域名:" << domain_name << endl;
cout << endl;
}
else
{
cout << "您输入的电子邮件地址不合法" << endl << endl;
}
}
return EXIT_SUCCESS;
}