首页猿问执行某些步骤后无法获取从网页动态填...

执行某些步骤后无法获取从网页动态填充的号码

PHP

天涯尽头无女友 2023-11-09 10:14:23

我使用 requests 模块和 BeautifulSoup 库创建了一个脚本来从网页中获取一些表格内容。要生成该表，必须手动按照我在所附图片中显示的步骤进行操作。我在下面粘贴的代码是一个有效的代码，但我试图解决的主要问题是以title编程方式获取数字，在本例中，628086906该数字附加到table_link我在此处硬编码的代码。单击第 6 步中的工具按钮后，当您将光标悬停在地图上时，您可以看到此选项，Multiple当您单击该选项时，您将转到包含标题编号的 url。首页这正是脚本所遵循的步骤。0030278592这是第 6 步中需要在输入框中输入的linc 编号。我尝试过（工作之一，因为我在中使用了硬编码的标题编号table_link）：import requestsfrom bs4 import BeautifulSouplink = 'https://alta.registries.gov.ab.ca/spinii/logon.aspx'lnotice = 'https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx'search_page = 'https://alta.registries.gov.ab.ca/SpinII/SearchSelectType.aspx'map_page = 'http://alta.registries.gov.ab.ca/SpinII/mapindex.aspx'map_find = 'http://alta.registries.gov.ab.ca/SpinII/mapfinds.aspx'table_link = 'https://alta.registries.gov.ab.ca/SpinII/popupTitleSearch.aspx?title=628086906'def get_content(s,link): r = s.get(link) soup = BeautifulSoup(r.text,"lxml") payload = {i['name']:i.get('value','') for i in soup.select('input[name]')} payload['uctrlLogon:cmdLogonGuest.x'] = '80' payload['uctrlLogon:cmdLogonGuest.y'] = '20' r = s.post(link,data=payload) soup = BeautifulSoup(r.text,"lxml") payload = {i['name']:i.get('value','') for i in soup.select('input[name]')} payload['cmdYES.x'] = '52' payload['cmdYES.y'] = '8' s.post(lnotice,data=payload) s.headers['Referer'] = 'https://alta.registries.gov.ab.ca/spinii/welcomeguest.aspx' s.get(search_page) s.headers['Referer'] = 'https://alta.registries.gov.ab.ca/SpinII/SearchSelectType.aspx' 如何从 url 中获取标题编号？或者如何从该站点获取所有 linc 号码，以便我根本不需要使用地图？The only problem with this site is that it is unavailable in daytime for maintenance.

查看完整描述

2 回答

拉风的咖菲猫

TA贡献1995条经验获得超2个赞

数据调用自：

POST http://alta.registries.gov.ab.ca/SpinII/mapserver.aspx

内容在被OpenLayers 库使用之前以自定义格式进行编码。所有的解码都位于这个JS文件中。如果你美化了，你可以找一下它的WayTo.Wtb.Format.WTB解码OpenLayers.Class。二进制文件按照 JS 中的如下所示逐字节解码：

switch(elementType){

case 1:

var lineColor = new WayTo.Wtb.Element.LineColor();

byteOffset = lineColor.parse(dataReader, byteOffset);

outputElement = lineColor;

break;

case 2:

var lineStyle = new WayTo.Wtb.Element.LineStyle();

byteOffset = lineStyle.parse(dataReader, byteOffset);

outputElement = lineStyle;

break;

case 3:

var ellipse = new WayTo.Wtb.Element.Ellipse();

byteOffset = ellipse.parse(dataReader, byteOffset);

outputElement = ellipse;

break;

........

}

我们必须重现这个解码算法才能获得原始数据。我们不需要解码所有对象，我们只想获得正确的偏移量并strings正确提取。这里有一个Python解码部分的脚本，用于解码文件中的数据（输出卷曲):

with open("wtb.bin", mode='rb') as file:

encodedData = file.read()

offset = 0

objects = []

while offset < len(encodedData):

elementSize = encodedData[offset]

offset+=1

elementType = encodedData[offset]

offset+=1

if elementType == 0:

break

curElemSize = elementSize

curElemType = elementType

if elementType== 114:

largeElementSize = int.from_bytes(encodedData[offset:offset + 4], "big")

offset+=4

largeElementType = int.from_bytes(encodedData[offset:offset+2], "little")

offset+=2

curElemSize = largeElementSize

curElemType = largeElementType

print(f"type {curElemType} | size {curElemSize}")

offsetInit = offset

if curElemType == 1:

offset+=4

elif curElemType == 2:

offset+=2

elif curElemType == 3:

offset+=20

elif curElemType == 4:

offset+=28

elif curElemType == 5:

offset+=12

elif curElemType == 6:

textLength = curElemSize - 3

objects.append({

"type": "Text",

"x_position": int.from_bytes(encodedData[offset:offset+2], "little"),

"y_position": int.from_bytes(encodedData[offset+2:offset+4], "little"),

"rotation": int.from_bytes(encodedData[offset+4:offset+6], "little"),

"text": encodedData[offset+6:offset+6+(textLength*2)].decode("utf-8").replace('\x00','')

})

offset+=6+(textLength*2)

elif curElemType == 7:

numPoint = int(curElemSize / 2)

offset+=4*numPoint

elif curElemType == 27:

numPoint = int(curElemSize / 4)

offset+=8*numPoint

elif curElemType == 8:

numPoint = int(curElemSize / 2)

offset+=4*numPoint

elif curElemType == 28:

numPoint = int(curElemSize / 4)

offset+=8*numPoint

elif curElemType == 13:

offset+=4

elif curElemType == 14:

offset+=2

elif curElemType == 15:

offset+=2

elif curElemType == 100:

pass

elif curElemType == 101:

offset+=20

elif curElemType == 102:

offset+=2

elif curElemType == 103:

pass

elif curElemType == 104:

highShort = int.from_bytes(encodedData[offset+2:offset+4], "little")

lowShort = int.from_bytes(encodedData[offset+4:offset+6], "little")

objects.append({

"type": "StartNumericCell",

"entity": int.from_bytes(encodedData[offset:offset+2], "little"),

"occurrence": (highShort << 16) + lowShort

})

offset+=6

elif curElemType == 105:

#end cell

pass

elif curElemType == 109:

textLength = curElemSize - 1

objects.append({

"type": "StartAlphanumericCell",

"entity": int.from_bytes(encodedData[offset:offset+2], "little"),

"occurrence":encodedData[offset+2:offset+2+(textLength*2)].decode("utf-8").replace('\x00','')

})

offset+=2+(textLength*2)

elif curElemType == 111:

offset+=40

elif curElemType == 112:

objects.append({

"type": "CoordinatePlane",

"projection_code": encodedData[offset+48:offset+52].decode("utf-8").replace('\x00','')

})

offset+=52

elif curElemType == 113:

offset+=24

elif curElemType == 256:

nameLength = int.from_bytes(encodedData[offset+14:offset+16], "little")

objects.append({

"type": "LargePolygon",

"name": encodedData[offset+16:offset+16+nameLength].decode("utf-8").replace('\x00',''),

"occurence": int.from_bytes(encodedData[offset+2:offset+6], "little")

})

if nameLength > 0:

offset+= 16 + nameLength

if encodedData[offset] == 0:

offset+=1

else:

offset+= 16

numberOfPoints = int.from_bytes(encodedData[offset:offset+2], "little")

offset+=2

offset+=numberOfPoints*8

elif curElemType == 257:

pass

else:

offset+= curElemSize*2

print(f"offset diff {offset-offsetInit}")

print("--------------------------------")

print(objects)

print(len(encodedData))

print(offset)

（旁注：请注意，元素大小采用大端字节序，所有其他值均采用小端字节序）

运行这个 repl.it以查看它如何解码文件

从那里我们构建了抓取数据的步骤，为了清楚起见，我将描述所有步骤（甚至是您已经弄清楚的步骤）：

法律声明

法律声明调用对于获取地图值不是必需的，但对于获取项目信息是必需的（帖子中的最后一步）

GET https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx

抓取input标签名称/值并设置cmdYES.x然后cmdYES.y调用

POST https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx

地图数据

调用服务器地图API：

POST http://alta.registries.gov.ab.ca/SpinII/mapserver.aspx

有以下数据：

{

"mt":"titleresults",

"qt":"lincNo",

"LINCNumber": lincNumber,

"rights": "B", #not required

"cx": 1920, #screen definition

"cy": 1080,

}

cx/xy是画布尺寸

使用上述方法对编码数据进行解码。你会得到：

[{'type': 'LargePolygon', 'name': '0010495134 8722524;1;162', 'entity': 23, 'occurence': 628079167, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012170859 8022146;8;99', 'entity': 23, 'occurence': 628048595, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010691822 8722524;1;163', 'entity': 23, 'occurence': 628222354, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012169736 8022146;8;89', 'entity': 23, 'occurence': 628021327, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694454 8722524;1;179', 'entity': 23, 'occurence': 628191678, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694362 8722524;1;178', 'entity': 23, 'occurence': 628307403, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010433381 8722524;1;177', 'entity': 23, 'occurence': 628209696, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012169710 8022146;8;88A', 'entity': 23, 'occurence': 628021328, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694355 8722524;1;176', 'entity': 23, 'occurence': 628315826, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012170866 8022146;8;100', 'entity': 23, 'occurence': 628163431, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694347 8722524;1;175', 'entity': 23, 'occurence': 628132810, 'line_color_green': 0, 'line_color_red': 129,

提取信息

如果您想针对特定的目标，lincNumber则需要查找多边形的样式，因为对于“多个”值（例如具有多个项目的值），没有提及lincNumber响应的 id，只有链接引用。以下将获取所选项目：

selectedZone = [

for t in objects

if t.get("fill_color_green", 255) < 255 and t.get("line_color_red") == 255

][0]

print(selectedZone)

调用您在帖子中提到的网址来获取数据并提取表：

GET https://alta.registries.gov.ab.ca/SpinII/popupTitleSearch.aspx?title={selectedZone["occurence"]}

完整代码：

import requests

from bs4 import BeautifulSoup

import pandas as pd

lincNumber = "0030278592"

#lincNumber = "0010661156"

s = requests.Session()

# 1) login

r = s.get("https://alta.registries.gov.ab.ca/spinii/logon.aspx")

soup = BeautifulSoup(r.text, "html.parser")

payload = dict([

(t["name"], t.get("value", ""))

for t in soup.findAll("input")

])

payload["uctrlLogon:cmdLogonGuest.x"] = 76

payload["uctrlLogon:cmdLogonGuest.y"] = 25

s.post("https://alta.registries.gov.ab.ca/spinii/logon.aspx",data=payload)

# 2) legal notice

r = s.get("https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx")

soup = BeautifulSoup(r.text, "html.parser")

payload = dict([

(t["name"], t.get("value", ""))

for t in soup.findAll("input")

])

payload["cmdYES.x"] = 82

payload["cmdYES.y"] = 3

s.post("https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx", data = payload)

# 3) map data

r = s.post("http://alta.registries.gov.ab.ca/SpinII/mapserver.aspx",

data= {

"mt":"titleresults",

"qt":"lincNo",

"LINCNumber": lincNumber,

"rights": "B", #not required

"cx": 1920, #screen definition

"cy": 1080,

})

def decodeWtb(encodedData):

offset = 0

objects = []

iteration = 0

while offset < len(encodedData):

elementSize = encodedData[offset]

offset+=1

elementType = encodedData[offset]

offset+=1

if elementType == 0:

break

curElemSize = elementSize

curElemType = elementType

if elementType== 114:

largeElementSize = int.from_bytes(encodedData[offset:offset + 4], "big")

offset+=4

largeElementType = int.from_bytes(encodedData[offset:offset+2], "little")

offset+=2

curElemSize = largeElementSize

curElemType = largeElementType

offsetInit = offset

if curElemType == 1:

offset+=4

elif curElemType == 2:

offset+=2

elif curElemType == 3:

offset+=20

elif curElemType == 4:

offset+=28

elif curElemType == 5:

offset+=12

elif curElemType == 6:

textLength = curElemSize - 3

offset+=6+(textLength*2)

elif curElemType == 7:

numPoint = int(curElemSize / 2)

offset+=4*numPoint

elif curElemType == 27:

numPoint = int(curElemSize / 4)

offset+=8*numPoint

elif curElemType == 8:

numPoint = int(curElemSize / 2)

offset+=4*numPoint

elif curElemType == 28:

numPoint = int(curElemSize / 4)

offset+=8*numPoint

elif curElemType == 13:

offset+=4

elif curElemType == 14:

offset+=2

elif curElemType == 15:

offset+=2

elif curElemType == 100:

pass

elif curElemType == 101:

offset+=20

elif curElemType == 102:

offset+=2

elif curElemType == 103:

pass

elif curElemType == 104:

offset+=6

elif curElemType == 105:

pass

elif curElemType == 109:

textLength = curElemSize - 1

offset+=2+(textLength*2)

elif curElemType == 111:

offset+=40

elif curElemType == 112:

offset+=52

elif curElemType == 113:

offset+=24

elif curElemType == 256:

nameLength = int.from_bytes(encodedData[offset+14:offset+16], "little")

objects.append({

"type": "LargePolygon",

"name": encodedData[offset+16:offset+16+nameLength].decode("utf-8").replace('\x00',''),

"entity": int.from_bytes(encodedData[offset:offset+2], "little"),

"occurence": int.from_bytes(encodedData[offset+2:offset+6], "little"),

"line_color_green": encodedData[offset + 8],

"line_color_red": encodedData[offset + 7],

"line_color_blue": encodedData[offset + 9],

"fill_color_green": encodedData[offset + 10],

"fill_color_red": encodedData[offset + 11],

"fill_color_blue": encodedData[offset + 13]

})

if nameLength > 0:

offset+= 16 + nameLength

if encodedData[offset] == 0:

offset+=1

else:

offset+= 16

numberOfPoints = int.from_bytes(encodedData[offset:offset+2], "little")

offset+=2

offset+=numberOfPoints*8

elif curElemType == 257:

pass

else:

offset+= curElemSize*2

return objects

# 4) decode custom format

objects = decodeWtb(r.content)

# 5) get the selected area

selectedZone = [

for t in objects

if t.get("fill_color_green", 255) < 255 and t.get("line_color_red") == 255

][0]

print(selectedZone)

# 6) get the info about item

r = s.get(f'https://alta.registries.gov.ab.ca/SpinII/popupTitleSearch.aspx?title={selectedZone["occurence"]}')

df = pd.read_html(r.content, attrs = {'class': 'bodyText'}, header =0)[0]

del df['Add to Cart']

del df['View']

print(df[:-1])

在 repl.it 上运行这个

输出

Title Number Type LINC Number Short Legal Rights Registration Date Change/Cancel Date

0 052400228 Current Title 0030278592 0420091;16 Surface 19/09/2005 13/11/2019

1 072294084 Current Title 0030278551 0420091;12 Surface 22/05/2007 21/08/2007

2 072400529 Current Title 0030278469 0420091;3 Surface 05/07/2007 28/08/2007

3 072498228 Current Title 0030278501 0420091;7 Surface 18/08/2007 08/02/2008

4 072508699 Current Title 0030278535 0420091;10 Surface 23/08/2007 13/12/2007

5 072559500 Current Title 0030278477 0420091;4 Surface 17/09/2007 19/11/2007

6 072559508 Current Title 0030278576 0420091;14 Surface 17/09/2007 09/01/2009

7 072559521 Current Title 0030278519 0420091;8 Surface 17/09/2007 07/11/2007

8 072559530 Current Title 0030278493 0420091;6 Surface 17/09/2007 25/08/2008

9 072559605 Current Title 0030278485 0420091;5 Surface 17/09/2007 23/12/2008

objects如果您想获得更多条目，可以查看该字段。如果您想获得有关坐标等项目的更多信息，您可以改进解码器......

还可以通过查看包含 lincNumber 的字段来匹配目标周围的其他 lincNumber，name除非其中存在“多个”名称。

反对回复 2023-11-09

白猪掌柜的

TA贡献1893条经验获得超10个赞

有两种选择可以获取您正在寻找的信息，其中一种是您可能已经知道的硒。

当您将鼠标悬停在地图上时，打开网络选项卡并监视浏览器传递的请求是否向服务器发出请求。对于请求和 BS4，您最好的选择是如果数据已经加载，那么下面的解决方案可能会起作用

import re 
print(re.findall(r’628086906’, r.text) )

如果它打印出数字，则意味着数据在 json 中可用并随页面一起加载，您可以加载 json 或使用正则表达式查找。否则你唯一的选择是硒

反对回复 2023-11-09

2 回答
0 关注
283 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

执行某些步骤后无法获取从网页动态填充的号码

执行某些步骤后无法获取从网页动态填充的号码

2 回答

登录

法律声明

地图数据

提取信息

添加回答