在数据处理的过程中,使用正则的场景主要是查找、提取和替换,对基本语法进行了学习,并在常用sql环境(impala/hive/oracle)和Python中进行了应用。

背景

应用正则表达式来查找、提取和替换信息是常用操作,但之前都是现查现用,这次通过系统整理来刻意练习。

基本语法

用途 命令 释义 备注
元字符 . 任意单一字符  
元字符 ^ 匹配字符串开头  
元字符 $ 匹配字符串结尾  
元字符 * 匹配同一个字符或字符集合的零次或多次重复 懒惰型[]*?
元字符 + 匹配同一个字符或字符集合的一次或多次重复 懒惰型[]+?
元字符 ? 匹配同一个字符或字符集合的零次或一次出现  
元字符 {} 重复次数,可以为准确值、最小值或区间 懒惰型{}?
元字符 [] 字符集 例,字母数字集合[a-zA-Z0-9],[^charset]为字符集取反
元字符 \ 转义符  
元字符 |  
元字符 () 子表达式  
匹配 \d 任何一个数字字符 等价于[0-9]
匹配 \D 任何一个非数字字符 等价于[^0-9]
匹配 \s 任何一个空白字符 等价于[ \f\n\r\t\v]
匹配 \S 任何一个非空白字符 等价于[^ \f\n\r\t\v]
匹配 \w 任何一个字母数字字符或下划线字符 等价于[a-zA-Z0-9_]
匹配 \W 任何一个非字母数字字符或非下划线字符 等价于[^a-zA-Z0-9_]
特殊字符 \b 字边界,仅在单次的开头或结尾处匹配 零宽度断言
特殊字符 \B 仅在当前位置不在字边界时匹配 零宽度断言
特殊字符 \f 换页符  
特殊字符 \n 换行符  
特殊字符 \r 回车符  
特殊字符 \t 制表符  
特殊字符 \v 垂直制表符  

示例,电子邮件\w+[\w.]*@[\w.]+\.\w+
网址 https?://[\w./]+
子表达式(19|20)\d{2}
子表达式(\d{1,3}\.){3}\d{1,3}

应用

sql

语法基本相近,尤其是hive和impala,oracle稍有差异。

引擎 用途 语法 返回
impala 查找 regexp_like(STRING source_string, STRING pattern[, STRING match_parameter]) 条件判断的布尔值,用于where或case when
impala 提取 regexp_extract(STRING source_string, STRING pattern, INT index) STRING
impala 替换 regexp_replace(STRING source_string, STRING pattern, STRING replace_string) STRING
hive 查找 regexp(STRING source_string, STRING pattern) 条件判断的布尔值,用于where或case when
hive 提取 regexp_extract(STRING source_string, STRING pattern, INT index) STRING
hive 替换 regexp_replace(STRING source_string, STRING pattern, STRING replace_string) STRING
oracle 查找 regexp_like(STRING source_string, STRING pattern[, STRING match_parameter]) 条件判断的布尔值,用于where或case when
oracle 查找 regexp_instr(STRING source_string, STRING pattern[, INT position=1, INT occurrence=1, INT return_option=0, STRING match_parameter]) INT
oracle 提取 regexp_substr(STRING source_string, STRING pattern[, INT position, INT occurrence, STRING match_parameter]) STRING
oracle 替换 regexp_replace(STING source_string, STRING pattern[, STRING replace_string, INT position, INT occurrence, STRING match_parameter]) STRING

释义

index:
0: matches the full pattern string, including the portion outside any () group
1: matches just the contents inside the first () group in the pattern string

match_parameter:
'i' specifies case-insensitive matching, even if the determined collation of the condition is case-sensitive.
'c' specifies case-sensitive and accent-sensitive matching, even if the determined collation of the condition is case-insensitive or accent-insensitive.

Python

对象 方法 用途 备注
regular expression objects match() 确定正则是否从字符串的开头匹配。  
regular expression objects search() 扫描字符串,查找此正则匹配的任何位置。  
regular expression objects findall() 找到正则匹配的所有子字符串,并将它们作为列表返回。  
regular expression objects finditer() 找到正则匹配的所有子字符串,并将它们返回为一个 iterator。  
regular expression objects split() 将字符串拆分为一个列表,在正则匹配的任何地方将其拆分  
regular expression objects sub() 找到正则匹配的所有子字符串,并用不同的字符串替换它们  
regular expression objects subn() 与 sub() 相同,但返回新字符串和替换次数  
match objects group() 返回正则匹配的字符串  
match objects start() 返回匹配的开始位置  
match objects end() 返回匹配的结束位置  
match objects span() 返回包含匹配 (start, end) 位置的元组  
# regular expression object
regex = re.compile(r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}", flags=re.IGNORECASE)
regex
re.compile(r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}', re.IGNORECASE|re.UNICODE)
# regular expression object search method, find first match
# return match object 
m = regex.search(text)
print(m)
text[m.start():m.end()]
<re.Match object; span=(5, 20), match='[email protected]'>





'[email protected]'
# generate match object iterator
for m in regex.finditer(text):
    print(m)
<re.Match object; span=(5, 20), match='[email protected]'>
<re.Match object; span=(27, 42), match='[email protected]'>
<re.Match object; span=(47, 60), match='[email protected]'>
<re.Match object; span=(66, 80), match='[email protected]'>
# find all
regex.findall(text)
# regular expression object match method
# return match object 
print(regex.match(text))
None
# split
print(regex.split(text))
['Dave ', '\nSteve ', '\nRob ', '\nRyan ', '']
# replace 
print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
regex = re.compile(r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})", flags=re.IGNORECASE)
m = regex.match('[email protected]')
print('match.groups: ', m.groups())
print('match.group(): ', m.group())
print('match.group(0): ', m.group(0))
print('match.group(3, 2, 1): ', m.group(3, 2, 1))
match.groups:  ('wesm', 'bright', 'net')
match.group():  [email protected]
match.group(0):  [email protected]
match.group(3, 2, 1):  ('net', 'bright', 'wesm')
regex.findall(text)
[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]
print(regex.sub(r"Username: \1, Domin: \2, Suffix: \3", text))
Dave Username: dave, Domin: google, Suffix: com
Steve Username: steve, Domin: gmail, Suffix: com
Rob Username: rob, Domin: gmail, Suffix: com
Ryan Username: ryan, Domin: yahoo, Suffix: com
text = """Ross McFluff: 834.345.1254 155 Elm Street
Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way


Heather Albrecht: 548.326.4584 919 Park Place"""
entries = re.split('\n+', text)
entries
['Ross McFluff: 834.345.1254 155 Elm Street',
 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
 'Frank Burger: 925.541.7625 662 South Dogwood Way',
 'Heather Albrecht: 548.326.4584 919 Park Place']
[re.split(':? ', entry, 3) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
text = "He was carefully disguised but captured quickly by police."

for m in re.finditer(r"\w+ly\b", text):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly

参考

  1. Forta, Ben. Sams Teach Yourself Regular Expressions in 10 Minutes. Sams, 2004.
  2. impala官方文档
  3. oracle官方文档

附件

oracle示例

SELECT first_name, last_name
FROM employees
WHERE REGEXP_LIKE (first_name, '^Ste(v|ph)en$')
ORDER BY first_name, last_name;

FIRST_NAME           LAST_NAME
-------------------- -------------------------
Steven               King
Steven               Markle
Stephen              Stiles

SELECT
  REGEXP_INSTR('500 Oracle Parkway, Redwood Shores, CA',
               '[s|r|p][[:alpha:]]{6}', 3, 2, 1, 'i') "REGEXP_INSTR"
  FROM DUAL;

REGEXP_INSTR
------------
          28

SELECT
  REGEXP_SUBSTR('http://www.example.com/products',
                'http://([[:alnum:]]+\.?){3,4}/?') "REGEXP_SUBSTR"
  FROM DUAL;

REGEXP_SUBSTR
----------------------
http://www.example.com/

SELECT
  REGEXP_REPLACE('500   Oracle     Parkway,    Redwood  Shores, CA',
                 '( ){2,}', ' ') "REGEXP_REPLACE"
  FROM DUAL;

REGEXP_REPLACE
--------------------------------------
500 Oracle Parkway, Redwood Shores, CA

版本记录

2024-05-06,初稿;
2024-10-25,增加Python正则;