regex 正则表达式

C++11 新增了正则表达式的标准库支持

regex的介绍

C++11 自带了 6 种正则表达式语法的支持

ECMAScript
basic
extended
awk
grep
egrep

C++11 默认使用 ECMAScript 语法，这也是 6 种语法中最强大的，假如想使用其他 5 种语法，只需在声明 regex 对象时指定即可

regex e("^a.", regex_constants::grep);

在C++中，有三种正则可以选择使用

C ++regex
C regex
boost regex

如果在windows下开发C++，默认不支持后面两种正则，如果想快速应用，显然C++ regex 比较方便使用。

regex的使用

全词匹配

regex_match是正则表达式匹配的函数

int main() {
    string a = "modao";
    regex e("m(.*)");
    if(regex_match(a, e))cout << "modao";
    return 0;
}

搜索匹配

regex_search和regex_match的主要区别是：regex_match是全词匹配，而regex_search是搜索其中匹配的字符串。

int main() {
  regex e("abc*");
  bool m = regex_search("abccc", e);

  // 输出 yes
  cout << (m ? "yes" : "no") << endl;
}

提取匹配

想要提取出匹配的部分，就需要用到 match_results

int main() {
  string str("Email a@bc.com abc");

  // 等同于 match_results<string>
  smatch m; 

  regex e("([[:w:]]+)@([[:w:]]+\\.com)");
  bool found = regex_search(str, m, e);

  // m.size=3, 存储了 3 个 result
  cout << "m.size=" << m.size() << endl;

  for (int n=0; n<m.size(); n++){
    cout << "m[" << n << "]=" << m[n].str() << endl;
	//等价写法 m.str(n), *(m.begin()+n) 
  }
  //匹配项的前缀和后缀
  cout << "m.prefix=" << m.prefix().str() << endl;
  cout << "m.suffix=" << m.suffix().str() << endl;
}
/*
迭代 match_results, 输出
m[0]=a@bc.com (整个匹配)
m[1]=a (第1个group)
m[2]=bc.com (第2个group)
m.prefix=Email 
m.suffix= abc
*/

提取多个匹配

迭代result

假如我们想要匹配的字符串中，有多个子串都可以匹配正则表达式，并且我们想把这些子串全部找出来，例如一个字符串中包含多个邮箱地址，那么就需要用到 regex_iterator

int main() {
  string str("a@bc.com, d@ef.com, aa@b.com");

  regex e("([[:w:]]+)@([[:w:]]+\\.com)");

  // 定义 regex_iteraror
  sregex_iterator pos(str.cbegin(), str.cend(), e); 
  // C++惯例: 默认构造的迭代器表示序列结束
  sregex_iterator end;

  for (; pos!=end; pos++) {
    cout << "email=" << pos->str(0) 
      << ", user=" << pos->str(1) 
      << ", domain=" << pos->str(2) 
      << endl;
  }
}

/*
email=a@bc.com, user=a, domain=bc.com
email=d@ef.com, user=d, domain=ef.com
email=aa@bb.com, user=aa, domain=b.com
*/

如上我们可以看到，regex_iterator 其实就是迭代字符串中所有正则表达式匹配的 match_results。

迭代匹配串

除此之外，C++ 还提供了另一种迭代器 regex_token_iterator。

不同的是，regex_token_iterator 迭代的是所有正则表达式匹配中的指定子表达式，或迭代未匹配的子字符串

int main() {
  string str("a@bc.com, d@ef.com, aa@bb.com");

  regex e("([[:w:]]+)@([[:w:]]+\\.com)");

  // 定义regex_token_iterator
  sregex_token_iterator pos(str.cbegin(), str.cend(), e); 
  sregex_token_iterator end; //序列结束
  
  for (; pos!=end; pos++) {
    cout << "Matched:  " << *pos << endl;
  }
}
/*
Matched: a@bc.com
Matched: d@ef.com
Matched: aa@bb.com
*/

我们可以修改 pos 的定义，使它每次迭代 match_results 的第 2 个 group

// 第 4 个参数表示第几个 group
sregex_token_iterator pos(str.cbegin(), str.cend(), e, 2);

值得注意的是，如果我们把这里的参数设为 -1，则迭代字符串中所有不匹配正则表达式的部分，相当于用正则表达式切割字符串

int main() {
  string str("a bb   cd");

  regex e("\\s+"); // 匹配空格

  // 迭代不匹配正则表达式的部分
  sregex_token_iterator pos(str.cbegin(), str.cend(), e, -1);
  sregex_token_iterator end;
  
  for (; pos!=end; pos++) {
    cout << "Matched: " 
      << *pos << endl;
  }
}
/*
Matched: a
Matched: bb
Matched: cd
*/

替换匹配

正则表达式还有一个常用的场景——字符串替换。C++ 中我们可以使用 regex_replace

int main() {
  string str("a@bc.com, d@ef.com, aa@bb.com");

  regex e("([[:w:]]+)@([[:w:]]+\\.com)");
  //原字符串，匹配规则，替换规则
  cout << regex_replace(str, e, "$1 is on $2");
}

输出为

a is on bc.com, d is on ef.com, aa is on bb.com

regex的语法

正则表达式的括号：
()匹配并捕获
[]匹配字符组里的每一个字符
{}匹配出现的次数

C++ regex正则表达式的规则和其他编程语言差不多，如下：

特殊字符（用于匹配很难形容的字符）:

characters	description	matches
.	not newline	any character except line terminators (LF, CR, LS, PS).
\t	tab (HT)	a horizontal tab character (same as \u0009).
\n	newline (LF)	a newline (line feed) character (same as \u000A).
\v	vertical tab (VT)	a vertical tab character (same as \u000B).
\f	form feed (FF)	a form feed character (same as \u000C).
\r	carriage return (CR)	a carriage return character (same as \u000D).
\cletter	control code	a control code character whose code unit value is the same as the remainder of dividing the code unit value of letter by 32. For example: \ca is the same as \u0001, \cb the same as \u0002, and so on...
\xhh	ASCII character	a character whose code unit value has an hex value equivalent to the two hex digits hh. For example: \x4c is the same as L, or \x23 the same as #.
\uhhhh	unicode character	a character whose code unit value has an hex value equivalent to the four hex digitshhhh.
\0	null	a null character (same as \u0000).
\int	backreference	the result of the submatch whose opening parenthesis is the int-th (int shall begin by a digit other than 0). See groups below for more info.
\d	digit	a decimal digit character
\D	not digit	any character that is not a decimal digit character
\s	whitespace	a whitespace character
\S	not whitespace	any character that is not a whitespace character
\w	word	an alphanumeric or underscore character
\W	not word	any character that is not an alphanumeric or underscore character
\character	character	the character character as it is, without interpreting its special meaning within a regex expression. Any character can be escaped except those which form any of the special character sequences above. Needed for: ^ $ \ . * + ? ( ) [ ] { } \|
[class]	character class	the target character is part of the class
[^class]	negated character class	the target character is not part of the class

注意了，在C++反斜杠字符（\）会被转义

std::regex e1 ("\\d"); // \d -> 匹配数字字符
std::regex e2 ("\\\\"); // \\ -> 匹配反斜杠字符

数量**：**

characters	times	effects
*****	0 or more	The preceding atom is matched 0 or more times.
+	1 or more	The preceding atom is matched 1 or more times.
?	0 or 1	The preceding atom is optional (matched either 0 times or once).
{int}	int	The preceding atom is matched exactly int times.
{int,}	int or more	The preceding atom is matched int or more times.
{min,max}	between min and max	The preceding atom is matched at least min times, but not more than max.

注意：

模式 (a+).* 匹配 "aardvark" 将匹配到 aa
模式 (a+?).* 匹配 "aardvark" 将匹配到 a

组（用以匹配连续的多个字符）:

characters	description	effects
(subpattern)	Group	Creates a backreference.
(?:subpattern)	Passive group	Does not create a backreference.

注意了，第一种将创建一个反向引用，用于提取匹配到的内容，第二种则没有，相对来说性能方面也没这部分的开销

characters	description	condition for match
^	Beginning of line	Either it is the beginning of the target sequence, or follows a line terminator.
$	End of line	Either it is the end of the target sequence, or precedes a line terminator.
\|	Separator	Separates two alternative patterns or subpatterns..

单个字符

[abc] 匹配 a, b 或 c. [^xyz]匹配任何非 x, y, z的字符

范围 [a-z] 匹配任何小写字母 (a, b, c, ..., z) [abc1-5] 匹配 a, b , c, 或 1 到 5 的数字

c++ regex还有一种类POSIX的写法

class	description	equivalent (with regex_traits, default locale)
[:alnum:]	alpha-numerical character	isalnum
[:alpha:]	alphabetic character	isalpha
[:blank:]	blank character	isblank
[:cntrl:]	control character	iscntrl
[:digit:]	decimal digit character	isdigit
[:graph:]	character with graphical representation	isgraph
[:lower:]	lowercase letter	islower
[:print:]	printable character	isprint
[:punct:]	punctuation mark character	ispunct
[:space:]	whitespace character	isspace
[:upper:]	uppercase letter	isupper
[:xdigit:]	hexadecimal digit character	isxdigit
[:d:]	decimal digit character	isdigit
[:w:]	word character	isalnum
[:s:]	whitespace character	isspace

邮箱regex示例

#include <iostream>
#include <cstdlib>
#include <string>
#include <regex>  // regular expression 正则表达式

using namespace std;

int main ( )
{
    string email_address;
    string user_name, domain_name;

    regex pattern("([0-9A-Za-z\\-_\\.]+)@([0-9a-z]+\\.[a-z]{2,3}(\\.[a-z]{2})?)");
    // 正则表达式，匹配规则：
    // 第1组（即用户名），匹配规则：0至9、A至Z、a至z、下划线、点、连字符之中
    // 的任意字符，重复一遍或以上
    // 中间，一个“@”符号
    // 第二组（即域名），匹配规则：0至9或a至z之中的任意字符重复一遍或以上，
    // 接着一个点，接着a至z之中的任意字符重复2至3遍（如com或cn等），
    // 第二组内部的一组，一个点，接着a至z之中的任意字符重复2遍（如cn或fr等）
    // 内部一整组重复零次或一次


    // 输入文件结尾符（Windows用Ctrl+Z，UNIX用Ctrl+D）结束循环
    while ( cin >> email_address ) 
    {
        if ( regex_match( email_address, pattern ) )
        {
            cout << "您输入的电子邮件地址合法" << endl;

            // 截取第一组
            user_name = regex_replace( email_address, pattern, string("$1") );

            // 截取第二组
            domain_name = regex_replace( email_address, pattern, string("$2") );

            cout << "用户名：" << user_name << endl;
            cout << "域名：" << domain_name << endl;
            cout << endl;
        }
        else
        {
            cout << "您输入的电子邮件地址不合法" << endl << endl;
        }
    }
    return EXIT_SUCCESS;
}