首页猿问使用...

使用 BeautifulSoup、Python、Regex 在 Javascript 函数中获取变量

Python

噜噜哒 2022-06-22 16:22:15

在 Javascript 函数中定义了一个数组images，需要将其从字符串中提取并转换为 Python 列表对象。PythonBeautifulsoup被用于进行解析。 var images = [ { src: "http://example.com/bar/001.jpg", title: "FooBar One" }, { src: "http://example.com/bar/002.jpg", title: "FooBar Two" }, ] ;问题：为什么我下面的代码无法捕获这个images数组，我们该如何解决？谢谢！所需的输出 Python 列表对象。[ { src: "http://example.com/bar/001.jpg", title: "FooBar One" }, { src: "http://example.com/bar/002.jpg", title: "FooBar Two" }, ]实际代码import refrom bs4 import BeautifulSoup# Example of a HTML source code containing `images` arrayhtml = '''<html><head><script type="text/javascript"> $(document).ready(function(){ var images = [ { src: "http://example.com/bar/001.jpg", title: "FooBar One" }, { src: "http://example.com/bar/002.jpg", title: "FooBar Two" }, ] ; var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];</script><body><p>Some content</p></body></head></html>'''pattern = re.compile('var images = (.*?);')soup = BeautifulSoup(html, 'lxml')scripts = soup.find_all('script') # successfully captures the <script> elementfor script in scripts: data = pattern.match(str(script.string)) # NOT extracting the array!! if data: print('Found:', data.groups()[0]) # NOT being printed

查看完整描述

4 回答

白衣非少年

TA贡献1155条经验获得超0个赞

您可以使用较短的惰性正则表达式和hjson库来处理未引用的键

import re, hjson

html = '''

<html>

<head>

$(document).ready(function(){

var images = [

{

src: "http://example.com/bar/001.jpg",

title: "FooBar One"

{

src: "http://example.com/bar/002.jpg",

title: "FooBar Two"

]

;

var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];

</script>

'''

p = re.compile(r'var images = (.*?);', re.DOTALL)

data = hjson.loads(p.findall(html)[0])

print(data)

反对回复 2022-06-22

桃花长相依

TA贡献1860条经验获得超8个赞

方法一

也许，

\bvar\s+images\s*=\s*(\[[^\]]*\])

可能在某种程度上起作用：

测试

import re

from bs4 import BeautifulSoup

# Example of a HTML source code containing `images` array

html = '''

<html>

<head>

$(document).ready(function(){

var images = [

{

src: "http://example.com/bar/001.jpg",

title: "FooBar One"

{

src: "http://example.com/bar/002.jpg",

title: "FooBar Two"

]

;

var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];

</script>

<body>

<p>Some content</p>

</body>

</head>

</html>

'''

soup = BeautifulSoup(html, 'html.parser')

scripts = soup.find_all('script') # successfully captures the <script> element

for script in scripts:

data = re.findall(

r'\bvar\s+images\s*=\s*(\[[^\]]*\])', script.string, re.DOTALL)

print(data[0])

输出

[ {

src：“ http://example.com/bar/001.jpg ”，

标题：“FooBar One” }，

{

src：“ http://example.com/bar/002.jpg ”，

标题：“ FooBar 两个" },

]

如果您想简化/修改/探索表达式，它已在regex101.com的右上角面板中进行了说明。如果您愿意，您还可以在此链接中观看它如何与一些示例输入匹配。

方法二

另一种选择是：

import re

string = '''

<html>

<head>

$(document).ready(function(){

var images = [

{

src: "http://example.com/bar/001.jpg",

title: "FooBar One"

{

src: "http://example.com/bar/002.jpg",

title: "FooBar Two"

]

;

var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];

</script>

<body>

<p>Some content</p>

</body>

</head>

</html>

'''

expression = r'src:\s*"([^"]*)"\s*,\s*title:\s*"([^"]*)"'

matches = re.findall(expression, string, re.DOTALL)

output = []

for match in matches:

output.append(dict({"src": match[0], "title": match[1]}))

print(output)

输出

[{'src': 'http://example.com/bar/001.jpg', 'title': 'FooBar One'}, {'src': 'http://example.com/bar/002.jpg', 'title': 'FooBar Two'}]

反对回复 2022-06-22

慕容708150

TA贡献1831条经验获得超4个赞

这是一种到达那里的方法，没有正则表达式，甚至没有 beautifulsoup - 只是简单的 Python 字符串操作 - 只需 4 个简单的步骤 :)

step_1 = html.split('var images = [')

step_2 = " ".join(step_1[1].split())

step_3 = step_2.split('] ; var other_data = ')

step_4= step_3[0].replace('}, {','}xxx{').split('xxx')

print(step_4)

输出：

['{ src: "http://example.com/bar/001.jpg", title: "FooBar One" }',

'{ src: "http://example.com/bar/002.jpg", title: "FooBar Two" }, ']

反对回复 2022-06-22

RISEBY

TA贡献1856条经验获得超5个赞

re.match 从字符串的开头匹配。您的正则表达式必须传递整个字符串。利用

pattern = re.compile('.*var images = (.*?);.*', re.DOTALL)

该字符串仍然不是有效的 python 列表格式。您必须先进行一些操作才能申请ast.literal_eval

for script in scripts:

data = pattern.match(str(script.string))

if data:

list_str = data.groups()[0]

# Remove last comma

last_comma_index = list_str.rfind(',')

list_str = list_str[:last_comma_index] + list_str[last_comma_index+1:]

# Modify src to 'src' and title to 'title'

list_str = re.sub(r'\s([a-z]+):', r'"\1":', list_str)

# Strip

list_str = list_str.strip()

final_list = ast.literal_eval(list_str.strip())

print(final_list)

输出

[{'src': 'http://example.com/bar/001.jpg', 'title': 'FooBar One'}, {'src': 'http://example.com/bar/002.jpg', 'title': 'FooBar Two'}]

反对回复 2022-06-22

4 回答
0 关注
203 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

使用 BeautifulSoup、Python、Regex 在 Javascript 函数中获取变量

使用 BeautifulSoup、Python、Regex 在 Javascript 函数中获取变量

4 回答

添加回答