Python正则表达式：字符集及内部顺序

1.基础内容

[] （中括号）用于描述正则表达式中的字符集，可以通过向字符集内部输入字符来自定义匹配的内容。

            
              
                import
              
               re

regex1 
              
                =
              
               re
              
                .
              
              
                compile
              
              
                (
              
              
                '[ABC]'
              
              
                )
              
              
message1 
              
                =
              
              
                "Hello. This is ABC club. A man will serve you then."
              
              
                print
              
              
                (
              
              regex1
              
                .
              
              findall
              
                (
              
              message1
              
                )
              
              
                )

输出：[‘A’, ‘B’, ‘C’, ‘A’]

正则表达式会根据中括号里的任意一个字符进行匹配。

如果想要匹配的字符很多，例如所有的大写字母时，可以这样操作：（这里是为了演示字符集的用法）

            
              regex2 
              
                =
              
               re
              
                .
              
              
                compile
              
              
                (
              
              
                '[A-Z]'
              
              
                )
              
              
message2 
              
                =
              
              
                "It's my DUTY to save your SOUL."
              
              
                print
              
              
                (
              
              regex2
              
                .
              
              findall
              
                (
              
              message2
              
                )
              
              
                )

输出：[‘I’, ‘D’, ‘U’, ‘T’, ‘Y’]

如果还想匹配其他的内容，例如数字和大写字母时：

            
              regex3 
              
                =
              
               re
              
                .
              
              
                compile
              
              
                (
              
              
                '[0-9A-Z]'
              
              
                )
              
              
message3 
              
                =
              
              
                "There are 7 GOBLINS and 2 dwarves in my house! Help!"
              
              
                print
              
              
                (
              
              regex3
              
                .
              
              findall
              
                (
              
              message3
              
                )
              
              
                )

输出：[‘T’, ‘7’, ‘G’, ‘O’, ‘B’, ‘L’, ‘I’, ‘N’, ‘S’, ‘2’, ‘H’]

0-9，A-Z之间不分开。 如果想要匹配减号 ‘-’，使用转义字符 ‘\-’ 来代替。

2.内部顺序

继续研究这个字符集，我产生了一个疑问：能否使用 [A-z] 这样的顺序呢？

答案是肯定的。不过，这里会发生一些特别的事情：

            
              regex4 
              
                =
              
               re
              
                .
              
              
                compile
              
              
                (
              
              
                '[A-z]'
              
              
                )
              
              
message4 
              
                =
              
              
                "ABC Z [^] abc z"
              
              
                print
              
              
                (
              
              regex4
              
                .
              
              findall
              
                (
              
              message4
              
                )
              
              
                )
              
              

regex5 
              
                =
              
               re
              
                .
              
              
                compile
              
              
                (
              
              
                '[M-m]'
              
              
                )
              
              
message5 
              
                =
              
              
                "KLMNO...klmno"
              
              
                print
              
              
                (
              
              regex5
              
                .
              
              findall
              
                (
              
              message5
              
                )
              
              
                )

输出：[‘A’, ‘B’, ‘C’, ‘Z’, ‘[’, ‘^’, ‘]’, ‘a’, ‘b’, ‘c’, ‘z’]
[‘M’, ‘N’, ‘O’, ‘k’, ‘l’, ‘m’]

嗯……好像上面插入了某些符号？

这里大概可以得到出来一个结论： 对于 [(c1)-(c2)] 的结构，它是根据 ASCII 的顺序操作的，表示可以匹配 c1 到 c2 之间（包括 c1 和 c2）的所有字符。

让我们来验证一下：

            
              regex6 
              
                =
              
               re
              
                .
              
              
                compile
              
              
                (
              
              
                '[#-E]'
              
              
                )
              
              
message6 
              
                =
              
              
                "Awesome! The last 5% M&M Candies (10 candies) are nice; I'll enjoy them!"
              
              
                print
              
              
                (
              
              regex6
              
                .
              
              findall
              
                (
              
              message6
              
                )
              
              
                )

输出：[‘A’, ‘5’, ‘%’, ‘&’, ‘C’, ‘(’, ‘1’, ‘0’, ‘)’, ‘;’, “’”] -----（最后一个是单引号）

查看一下 ASCII 表：
35 - # ， 69 - E

33 - !，37 - %，38 - &，39 - '，40 - (，41 - )
48 - 0，49 - 1，53 - 5
59 - ;
65 - A，73 - I

于是基本上可以确定： 字符集中，以 ‘-’ 为连接的字符顺序满足 ASCII 的排列规则 ，即： 对于 [(c1)-(c2)]，对应的 ASCII 值 c1 <= c2 时，匹配 c1 到 c2 之间的字符。

让我们做一个验证：如果把 [(c1)-(c2)] 的顺序交换，即 c1 的 ASCII 值小于 c2，会怎么样呢？

            
              regex7 
              
                =
              
               re
              
                .
              
              
                compile
              
              
                (
              
              
                '[a-Z]'
              
              
                )
              
              
message7 
              
                =
              
              
                "Maybe it's wrong..."
              
              
                print
              
              
                (
              
              regex7
              
                .
              
              findall
              
                (
              
              message7
              
                )
              
              
                )

然后就会显示一长串的报错信息：

Traceback (most recent call last):
File “G:/work/examples/eg_7_2.py”, line 27, in
regex7 = re.compile(’[a-Z]’)
File “D:\Python37\lib\re.py”, line 234, in compile
return _compile(pattern, flags)
File “D:\Python37\lib\re.py”, line 286, in _compile
p = sre_compile.compile(pattern, flags)
File “D:\Python37\lib\sre_compile.py”, line 764, in compile
p = sre_parse.parse(p, flags)
File “D:\Python37\lib\sre_parse.py”, line 930, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File “D:\Python37\lib\sre_parse.py”, line 426, in _parse_sub
not nested and not items))
File “D:\Python37\lib\sre_parse.py”, line 580, in _parse
raise source.error(msg, len(this) + 1 + len(that))
re.error: bad character range a-Z at position 1 （手动加粗）

Process finished with exit code 1

大概意思就是：a-Z 不符合字符集匹配的顺序（因为 a - 97，而 Z - 90，a 的 ASCII 值大于 Z），从而引发了问题。

结论：字符集中，以 ‘-’ 为连接的字符顺序满足 ASCII 的排列规则，前小后大。

更多文章、技术交流、商务合作、联系博主

微信扫码或搜索：z360901061

微信扫一扫加我为好友

QQ号联系： 360901061

您的支持是博主写作最大的动力，如果您喜欢我的文章，感觉我的文章对您有帮助，请用微信扫描下面二维码支持博主2元、5元、10元、20元等您想捐的金额吧，狠狠点击下面给点支持吧，站长非常感激您！手机微信长按不能支付解决办法：请将微信支付二维码保存到相册，切换到微信，然后点击微信右上角扫一扫功能，选择支付二维码完成支付。

【本文对您有帮助就好】元

2元

5元

10元

20元

自定义