正则表达式基本语法及在sql和Python中的应用
在数据处理的过程中,使用正则的场景主要是查找、提取和替换,对基本语法进行了学习,并在常用sql环境(impala/hive/oracle)和Python中进行了应用。
背景
应用正则表达式来查找、提取和替换信息是常用操作,但之前都是现查现用,这次通过系统整理来刻意练习。
基本语法
用途 | 命令 | 释义 | 备注 |
---|---|---|---|
元字符 | . | 任意单一字符 | |
元字符 | ^ | 匹配字符串开头 | |
元字符 | $ | 匹配字符串结尾 | |
元字符 | * | 匹配同一个字符或字符集合的零次或多次重复 | 懒惰型[]*? |
元字符 | + | 匹配同一个字符或字符集合的一次或多次重复 | 懒惰型[]+? |
元字符 | ? | 匹配同一个字符或字符集合的零次或一次出现 | |
元字符 | {} | 重复次数,可以为准确值、最小值或区间 | 懒惰型{}? |
元字符 | [] | 字符集 | 例,字母数字集合[a-zA-Z0-9],[^charset]为字符集取反 |
元字符 | \ | 转义符 | |
元字符 | | | 或 | |
元字符 | () | 子表达式 | |
匹配 | \d | 任何一个数字字符 | 等价于[0-9] |
匹配 | \D | 任何一个非数字字符 | 等价于[^0-9] |
匹配 | \s | 任何一个空白字符 | 等价于[ \f\n\r\t\v] |
匹配 | \S | 任何一个非空白字符 | 等价于[^ \f\n\r\t\v] |
匹配 | \w | 任何一个字母数字字符或下划线字符 | 等价于[a-zA-Z0-9_] |
匹配 | \W | 任何一个非字母数字字符或非下划线字符 | 等价于[^a-zA-Z0-9_] |
特殊字符 | \b | 字边界,仅在单次的开头或结尾处匹配 | 零宽度断言 |
特殊字符 | \B | 仅在当前位置不在字边界时匹配 | 零宽度断言 |
特殊字符 | \f | 换页符 | |
特殊字符 | \n | 换行符 | |
特殊字符 | \r | 回车符 | |
特殊字符 | \t | 制表符 | |
特殊字符 | \v | 垂直制表符 |
示例,电子邮件\w+[\w.]*@[\w.]+\.\w+
网址 https?://[\w./]+
子表达式(19|20)\d{2}
子表达式(\d{1,3}\.){3}\d{1,3}
应用
sql
语法基本相近,尤其是hive和impala,oracle稍有差异。
引擎 | 用途 | 语法 | 返回 |
---|---|---|---|
impala | 查找 | regexp_like(STRING source_string, STRING pattern[, STRING match_parameter]) | 条件判断的布尔值,用于where或case when |
impala | 提取 | regexp_extract(STRING source_string, STRING pattern, INT index) | STRING |
impala | 替换 | regexp_replace(STRING source_string, STRING pattern, STRING replace_string) | STRING |
hive | 查找 | regexp(STRING source_string, STRING pattern) | 条件判断的布尔值,用于where或case when |
hive | 提取 | regexp_extract(STRING source_string, STRING pattern, INT index) | STRING |
hive | 替换 | regexp_replace(STRING source_string, STRING pattern, STRING replace_string) | STRING |
oracle | 查找 | regexp_like(STRING source_string, STRING pattern[, STRING match_parameter]) | 条件判断的布尔值,用于where或case when |
oracle | 查找 | regexp_instr(STRING source_string, STRING pattern[, INT position=1, INT occurrence=1, INT return_option=0, STRING match_parameter]) | INT |
oracle | 提取 | regexp_substr(STRING source_string, STRING pattern[, INT position, INT occurrence, STRING match_parameter]) | STRING |
oracle | 替换 | regexp_replace(STING source_string, STRING pattern[, STRING replace_string, INT position, INT occurrence, STRING match_parameter]) | STRING |
释义
index:
0: matches the full pattern string, including the portion outside any () group
1: matches just the contents inside the first () group in the pattern string
match_parameter:
'i' specifies case-insensitive matching, even if the determined collation of the condition is case-sensitive.
'c' specifies case-sensitive and accent-sensitive matching, even if the determined collation of the condition is case-insensitive or accent-insensitive.
Python
对象 | 方法 | 用途 | 备注 |
---|---|---|---|
regular expression objects | match() | 确定正则是否从字符串的开头匹配。 | |
regular expression objects | search() | 扫描字符串,查找此正则匹配的任何位置。 | |
regular expression objects | findall() | 找到正则匹配的所有子字符串,并将它们作为列表返回。 | |
regular expression objects | finditer() | 找到正则匹配的所有子字符串,并将它们返回为一个 iterator。 | |
regular expression objects | split() | 将字符串拆分为一个列表,在正则匹配的任何地方将其拆分 | |
regular expression objects | sub() | 找到正则匹配的所有子字符串,并用不同的字符串替换它们 | |
regular expression objects | subn() | 与 sub() 相同,但返回新字符串和替换次数 | |
match objects | group() | 返回正则匹配的字符串 | |
match objects | start() | 返回匹配的开始位置 | |
match objects | end() | 返回匹配的结束位置 | |
match objects | span() | 返回包含匹配 (start, end) 位置的元组 |
import re
text = """Dave [email protected]
Steve [email protected]
Rob [email protected]
Ryan [email protected]"""
# regular expression object
regex = re.compile(r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}", flags=re.IGNORECASE)
regex
re.compile(r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}', re.IGNORECASE|re.UNICODE)
# regular expression object search method, find first match
# return match object
m = regex.search(text)
print(m)
text[m.start():m.end()]
<re.Match object; span=(5, 20), match='[email protected]'>
'[email protected]'
# generate match object iterator
for m in regex.finditer(text):
print(m)
<re.Match object; span=(5, 20), match='[email protected]'>
<re.Match object; span=(27, 42), match='[email protected]'>
<re.Match object; span=(47, 60), match='[email protected]'>
<re.Match object; span=(66, 80), match='[email protected]'>
# find all
regex.findall(text)
# regular expression object match method
# return match object
print(regex.match(text))
None
# split
print(regex.split(text))
['Dave ', '\nSteve ', '\nRob ', '\nRyan ', '']
# replace
print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
regex = re.compile(r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})", flags=re.IGNORECASE)
m = regex.match('[email protected]')
print('match.groups: ', m.groups())
print('match.group(): ', m.group())
print('match.group(0): ', m.group(0))
print('match.group(3, 2, 1): ', m.group(3, 2, 1))
match.groups: ('wesm', 'bright', 'net')
match.group(): [email protected]
match.group(0): [email protected]
match.group(3, 2, 1): ('net', 'bright', 'wesm')
regex.findall(text)
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
print(regex.sub(r"Username: \1, Domin: \2, Suffix: \3", text))
Dave Username: dave, Domin: google, Suffix: com
Steve Username: steve, Domin: gmail, Suffix: com
Rob Username: rob, Domin: gmail, Suffix: com
Ryan Username: ryan, Domin: yahoo, Suffix: com
text = """Ross McFluff: 834.345.1254 155 Elm Street
Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way
Heather Albrecht: 548.326.4584 919 Park Place"""
entries = re.split('\n+', text)
entries
['Ross McFluff: 834.345.1254 155 Elm Street',
'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
'Frank Burger: 925.541.7625 662 South Dogwood Way',
'Heather Albrecht: 548.326.4584 919 Park Place']
[re.split(':? ', entry, 3) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly\b", text):
print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly
参考
- Forta, Ben. Sams Teach Yourself Regular Expressions in 10 Minutes. Sams, 2004.
- impala官方文档
- oracle官方文档
附件
oracle示例
SELECT first_name, last_name
FROM employees
WHERE REGEXP_LIKE (first_name, '^Ste(v|ph)en$')
ORDER BY first_name, last_name;
FIRST_NAME LAST_NAME
-------------------- -------------------------
Steven King
Steven Markle
Stephen Stiles
SELECT
REGEXP_INSTR('500 Oracle Parkway, Redwood Shores, CA',
'[s|r|p][[:alpha:]]{6}', 3, 2, 1, 'i') "REGEXP_INSTR"
FROM DUAL;
REGEXP_INSTR
------------
28
SELECT
REGEXP_SUBSTR('http://www.example.com/products',
'http://([[:alnum:]]+\.?){3,4}/?') "REGEXP_SUBSTR"
FROM DUAL;
REGEXP_SUBSTR
----------------------
http://www.example.com/
SELECT
REGEXP_REPLACE('500 Oracle Parkway, Redwood Shores, CA',
'( ){2,}', ' ') "REGEXP_REPLACE"
FROM DUAL;
REGEXP_REPLACE
--------------------------------------
500 Oracle Parkway, Redwood Shores, CA
版本记录
2024-05-06,初稿;
2024-10-25,增加Python正则;