4 回答
TA贡献2080条经验 获得超4个赞
您可以使用另一个列表来保存404 url(如果404 url小于正常url),然后获取差异集,所以:
from urllib.request import urlopen
from urllib.error import HTTPError
exclude_urls = set()
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
exclude_urls.add(url)
valid_urls = set(all_urls) - exclude_urls
TA贡献1789条经验 获得超10个赞
你可以这样做:
from urllib.request import urlopen
from urllib.error import HTTPError
def load_data(csv_name):
...
def save_data(data,csv_name):
...
links=load_data(csv_name)
new_links=set()
for i in links:
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
print ('invalid')
else:
new_links.add(i)
save_data( list(new_links),csv_name)
TA贡献1783条经验 获得超4个赞
尝试这样的事情:
from urllib.request import urlopen
from urllib.error import HTTPError
# 1. Load the CSV file into a list
with open('urls.csv', 'r') as file:
reader = csv.reader(file)
urls = [row[0] for row in reader] # Assuming each row has one URL
# 2. Check each URL for validity using your code
valid_urls = []
for url in urls:
try:
urlopen(url)
valid_urls.append(url)
except HTTPError as err:
if err.code == 404:
print(f'Invalid URL: {url}')
else:
raise # If it's another type of error, raise it so you're aware
# 3. Write the cleaned list back to the CSV file
with open('cleaned_urls.csv', 'w') as file:
writer = csv.writer(file)
for url in valid_urls:
writer.writerow([url])
- 4 回答
- 0 关注
- 135 浏览
添加回答
举报