1 回答
TA贡献1946条经验 获得超4个赞
当您分析网站网络调用时,它会发出 ajax 请求以获取要下载的数据的所有链接。
import requests
res = requests.get("https://ped.uspto.gov/api/")
data = res.json()
print(data)
输出:
{'message': None,
'helpText': '{}',
'xmlDownloadMetadata': [{'lastUpdated': 'Sat 15 Aug 2020 01:30:57-0400',
'sizeInBytes': 10429068701,
'fileName': 'pairbulk-delta-20200815-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:10-0400',
'sizeInBytes': 100685778,
'fileName': '1900-1919-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:14-0400',
'sizeInBytes': 13877,
'fileName': '1920-1939-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:15-0400',
'sizeInBytes': 93016,
'fileName': '1940-1959-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:15-0400',
'sizeInBytes': 82353484,
'fileName': '1960-1979-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:16-0400',
'sizeInBytes': 5019098918,
'fileName': '1980-1999-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:20:46-0400',
'sizeInBytes': 33231977060,
'fileName': '2000-2019-pairbulk-full-20200809-xml',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 10:23:23-0400',
'sizeInBytes': 24313575,
'fileName': '2020-2020-pairbulk-full-20200809-xml',
'updatedFile': False}],
'jsonDownloadMetadata': [{'lastUpdated': 'Sat 15 Aug 2020 03:08:00-0400',
'sizeInBytes': 5957650088,
'fileName': 'pairbulk-delta-20200815-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:23-0400',
'sizeInBytes': 66467976,
'fileName': '1900-1919-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:25-0400',
'sizeInBytes': 10100,
'fileName': '1920-1939-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:27-0400',
'sizeInBytes': 69891,
'fileName': '1940-1959-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:29-0400',
'sizeInBytes': 54076774,
'fileName': '1960-1979-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:31-0400',
'sizeInBytes': 3009216952,
'fileName': '1980-1999-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:18:46-0400',
'sizeInBytes': 18853619536,
'fileName': '2000-2019-pairbulk-full-20200809-json',
'updatedFile': False},
{'lastUpdated': 'Sun 09 Aug 2020 15:20:30-0400',
'sizeInBytes': 17518389,
'fileName': '2020-2020-pairbulk-full-20200809-json',
'updatedFile': False}],
'links': [{'rel': 'swagger-api-docs', 'href': '/api-docs'}]}
解析 json 并使用这些链接,您可以轻松下载您要查找的文件。但我会说这些文件非常大,最好在请求中使用流式下载。
您要查找的链接是中的第一个元素data["jsonDownloadMetadata"]
为了获得可下载的链接,解析 json
data = res.json()
for links in data["jsonDownloadMetadata"]:
print(f"https://ped.uspto.gov/api/full-download?fileName={links['fileName']}")
输出:
https://ped.uspto.gov/api/full-download?fileName=pairbulk-delta-20200815-json
https://ped.uspto.gov/api/full-download?fileName=1900-1919-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1920-1939-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1940-1959-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1960-1979-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=1980-1999-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=2000-2019-pairbulk-full-20200809-json
https://ped.uspto.gov/api/full-download?fileName=2020-2020-pairbulk-full-20200809-json
添加回答
举报