3 回答

TA贡献1789条经验 获得超8个赞
这种方法使用 pyparsing 来实际解析 SQL 语句(在示例中显示的子集语法中):
import pyparsing as pp
ppc = pp.pyparsing_common
ident = ppc.identifier
CREATE, INSERT, INTO, TABLE, AS, SELECT, FROM, JOIN = \
map(pp.CaselessKeyword, "CREATE INSERT INTO TABLE AS SELECT FROM JOIN".split())
select_stmt = (SELECT
+ (pp.delimitedList(ident) | '*')("columns")
+ FROM
+ ((ident + JOIN.suppress() + ident) | ident)("tables"))
src_target_stmt = ((CREATE + TABLE | INSERT + INTO)("action")
+ ident("target")
+ AS
+ pp.Group(select_stmt)("source"))
tests = """
CREATE TABLE TAR_TAB1 AS SELECT * FROM SRC_TAB1 JOIN SRC_TAB2
CREATE TABLE TAR_TAB2 AS SELECT * FROM SRC_TAB3 JOIN SRC_TAB4
INSERT INTO TAR_TAB3 AS SELECT COL1,COL2 FROM SRC_TAB5 JOIN SRC_TAB6
"""
# useful for debugging
#src_target_stmt.runTests(tests)
# dump parsed values out as CSV output
for t in tests.splitlines():
if not t.strip():
continue
result = src_target_stmt.parseString(t)
target = result.target
action = result.action[0]
for src in result.source.tables:
print("{},{},{}".format(action, target, src))
印刷:
CREATE,TAR_TAB1,SRC_TAB1
CREATE,TAR_TAB1,SRC_TAB2
CREATE,TAR_TAB2,SRC_TAB3
CREATE,TAR_TAB2,SRC_TAB4
INSERT,TAR_TAB3,SRC_TAB5
INSERT,TAR_TAB3,SRC_TAB6
随着您继续您的项目并发现新的需求(需要提取 SQL 操作,如本例所示,或者您发现您需要在解析中理解的 SQL 结构的其他变体),扩展解析器将更容易和更易于维护比扩展正则表达式。

TA贡献1863条经验 获得超2个赞
这是一个解决方案:
targets = re.findall(r'(?:CREATE\s+TABLE|INSERT\s+INTO)\s+([a-z0-9A-Z_]+)\s+AS', text)
sources = re.findall(r'SELECT\s+\*\s+FROM\s([a-z0-9A-Z_]+)\s+JOIN\s+([a-z0-9A-Z_]+)', text)
# each target has multiple sources, so repeat each target n times per
# number of sources.
lens = [len(src) for src in sources]
targets = np.repeat(targets, lens)
# 'flatten' the list of sources from [(t1, t2), (t3, t4)] to
# [t1, t2, t3, t4]
sources = [tab for exp in sources for tab in exp]
pd.DataFrame({"src": sources, "tgt": targets})
结果:
src tgt
0 SRC_TAB1 TAR_TAB1
1 SRC_TAB2 TAR_TAB1
2 SRC_TAB3 TAR_TAB2
3 SRC_TAB4 TAR_TAB2
4 SRC_TAB5 TAR_TAB3
5 SRC_TAB6 TAR_TAB3

TA贡献1844条经验 获得超8个赞
| : A|B,其中 A 和 B 可以是任意 RE,创建一个匹配 A 或 B 的正则表达式。
import re
import pandas as pd
Q1 = 'CREATE TABLE TAR_TAB1 AS SELECT * FROM SRC_TAB1 JOIN SRC_TAB2'
Q2 = 'CREATE TABLE TAR_TAB2 AS SELECT * FROM SRC_TAB3 JOIN SRC_TAB4'
Q3 = 'CREATE TABLE TAR_TAB3 AS SELECT * FROM SRC_TAB5 JOIN SRC_TAB6'
requests = [Q1, Q2, Q3]
target_filter = r'\w+(?=\s+AS)'
source_filter1 = r'(?<=FROM )\S+'
source_filter2 = r'(?<=JOIN )\S+'
regex_filter = target_filter + '|' + source_filter1 + '|' + source_filter2
results = [re.findall(regex_filter, Q) for Q in requests]
print(results)
# [['TAR_TAB1', 'SRC_TAB1', 'SRC_TAB2'], ['TAR_TAB2', 'SRC_TAB3', 'SRC_TAB4'], ['TAR_TAB3', 'SRC_TAB5', 'SRC_TAB6']]
要转换数据框中的结果列表:
df = pd.DataFrame(results, columns =['target', 'source1', 'source2'])
target source1 source2
0 TAR_TAB1 SRC_TAB1 SRC_TAB2
1 TAR_TAB2 SRC_TAB3 SRC_TAB4
2 TAR_TAB3 SRC_TAB5 SRC_TAB6
或者,如果您想以不同的方式呈现它:
l = []
for res in results:
l.append([res[0], res[1]])
l.append([res[0], res[2]])
df = pd.DataFrame(l, columns =['target', 'source'])
接着 :
target source
0 TAR_TAB1 SRC_TAB1
1 TAR_TAB1 SRC_TAB2
2 TAR_TAB2 SRC_TAB3
3 TAR_TAB2 SRC_TAB4
4 TAR_TAB3 SRC_TAB5
5 TAR_TAB3 SRC_TAB6
添加回答
举报