我编写了一个python脚本来href从给定网页上的所有链接中提取值:from BeautifulSoup import BeautifulSoupimport urllib2import rehtml_page = urllib2.urlopen("http://kteq.in/services")soup = BeautifulSoup(html_page)for link in soup.findAll('a'): print link.get('href')当我运行上面的代码时,我得到以下输出,其中包括外部和内部链接:indexindex#solutions#internet-of-thingssolutions#online-billing-and-payment-solutionssolutions#customer-relationship-managementsolutions#enterprise-mobilitysolutions#enterprise-content-managementsolutions#artificial-intelligencesolutions#b2b-and-b2c-web-portalssolutions#roboticssolutions#augement-reality-virtual-reality`enter code here`solutions#azuresolutions#omnichannel-commercesolutions#document-managementsolutions#enterprise-extranets-and-intranetssolutions#business-intelligencesolutions#enterprise-resource-planningservicesclientscontact###https://www.facebook.com/KTeqSolutions/#####contactform###############indexservices#contact#iOSDevelopmentServicesAndroidAppDevelopmentWindowsAppDevelopmentHybridSoftwareSolutionsCloudServicesHTML5DevelopmentiPadAppDevelopmentservicesservicesservicesservicesservicesservicescontactcontactcontactcontactcontactNonehttps://www.facebook.com/KTeqSolutions/####我想删除具有完整URL的外部链接,https://www.facebook.com/KTeqSolutions/同时保留诸如的链接solutions#internet-of-things。我如何有效地做到这一点?
2 回答
慕神8447489
TA贡献1780条经验 获得超1个赞
如果我对您的理解正确,则可以尝试以下方法:
l = []
for link in soup.findAll('a'):
print link.get('href')
l.append(link.get('href'))
l = [x for x in l if "www" not in x] #or 'https'
智慧大石
TA贡献1946条经验 获得超3个赞
您可以parse_url从requests模块中使用。
import requests
url = 'https://www.facebook.com/KTeqSolutions/'
requests.urllib3.util.parse_url(url)
给你
Url(scheme='https', auth=None, host='www.facebook.com', port=None, path='/KTeqSolutions/', query=None, fragment=None)
添加回答
举报
0/150
提交
取消