Python 3 remove duplicate weblinks with extra character rstrip -

using python 3. trying pull unique links website , seem have code working except few links have / @ end.

example: program include &

i'd make sure program removes last character ensure no duplicates return. have researched rstrip() can't seem work. here code:

import bs4 bs import urllib.request import urllib.parse   source = urllib.request.urlopen('').read() soup = bs.beautifulsoup(source,'lxml')  filename = "uniqueweblinks.csv" f = open(filename, "w") headers = "weblinks\n" f.write(headers)  all_links = soup.find_all('a')  url_set = set()  link in all_links:     web_links = link.get("href")     ab_url = urllib.parse.urljoin('', web_links)     print (ab_url)     if ab_url , ab_url not in url_set:         f.write(str(ab_url) + "\n")         url_set.add(ab_url) 

i'd keep simple , explicit how you're cleaning urls. example, strip last character if it's slash (/) or hash (#) (if url ends hash, it's same not ending hash). after glancing @ data, i'd remove blank urls because that's not you're looking for.

base_url = ''  all_links = soup.find_all('a')  def clean_links(tags, base_url):     cleaned_links = set()     tag in tags:         link = tag.get('href')         if link none:             continue         if link.endswith('/') or link.endswith('#'):             link = link[-1]         full_url = urllib.parse.urljoin(base_url, link)         cleaned_links.add(full_url)     return cleaned_links  cleaned_links = clean_links(all_links, base_url)  link in cleaned_links:     f.write(str(link) + '\n') 


Popular posts from this blog

python Tkinter Capturing keyboard events save as one single string -

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

javascript - Z-index in d3.js -