3 回答
TA贡献1921条经验 获得超9个赞
re解决方案:
import re
input = [
"[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,",
"[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue",
"[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)",
]
def extract(s):
match = re.search("(X=\d+(?:\.\d*)?).*?\](.*?)$",s)
return match.groups()
output = [extract(item) for item in input]
print(output)
输出:
[
('X=250.44', 'DECEMBER 31,'),
('X=307.5', 'respectively. The net decrease in the revenue'),
('X=49.5', '(US$ in millions)'),
]
解释:
\d
... 数字\d+
...一位或多位数字(?:...)
...非捕获(“正常”)括号\.\d*
... 点后跟零个或多个数字(?:\.\d*)?
...可选(零或一)“小数部分”(X=\d+(?:\.\d*)?)
...第一组,X=number
.*?
...零个或多个任何字符(非贪婪)\]
...]
符号$
... 字符串结尾\](.*?)$
...第二组,]
字符串之间和结尾之间的任何内容
TA贡献1827条经验 获得超8个赞
尝试这个:
(X=[^,]*)(?:.*])(.*)
import re
source = """[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,
[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue
[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)""".split('\n')
pattern = r"(X=[^,]*)(?:.*])(.*)"
for line in source:
print(re.search(pattern, line).groups())
输出:
('X=250.44', 'DECEMBER 31,')
('X=307.5', 'respectively. The net decrease in the revenue')
('X=49.5', '(US$ in millions)')
您X=在所有捕获前面,所以我只做了一个捕获组,如果重要的话,请随意添加非捕获组。
TA贡献1868条经验 获得超4个赞
使用带有命名组的正则表达式来捕获相关位:
>>> line = "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,"
>>> m = re.search(r'(?:\(X=)(?P<x_coord>.*?)(?:,.*])(?P<text>.*)$', line)
>>> m.groups()
('250.44', 'DECEMBER 31,')
>>> m['x_coord']
'250.44'
>>> m['text']
'DECEMBER 31,'
添加回答
举报