Python 3 remove duplicate weblinks with extra character rstrip -
using python 3. trying pull unique links website , seem have code working except few links have / @ end.
example: program include http://www.google.com & http://www.google.com/
i'd make sure program removes last character ensure no duplicates return. have researched rstrip()
can't seem work. here code:
import bs4 bs import urllib.request import urllib.parse source = urllib.request.urlopen('https://www.census.gov/data/tables/2016/demo/popest/state-total.html').read() soup = bs.beautifulsoup(source,'lxml') filename = "uniqueweblinks.csv" f = open(filename, "w") headers = "weblinks\n" f.write(headers) all_links = soup.find_all('a') url_set = set() link in all_links: web_links = link.get("href") ab_url = urllib.parse.urljoin('https://www.census.gov/data/tables/2016/demo/popest/state-total.html', web_links) print (ab_url) if ab_url , ab_url not in url_set: f.write(str(ab_url) + "\n") url_set.add(ab_url)
i'd keep simple , explicit how you're cleaning urls. example, strip last character if it's slash (/
) or hash (#
) (if url ends hash, it's same not ending hash). after glancing @ data, i'd remove blank urls because that's not you're looking for.
base_url = 'https://www.census.gov/data/tables/2016/demo/popest/state-total.html' all_links = soup.find_all('a') def clean_links(tags, base_url): cleaned_links = set() tag in tags: link = tag.get('href') if link none: continue if link.endswith('/') or link.endswith('#'): link = link[-1] full_url = urllib.parse.urljoin(base_url, link) cleaned_links.add(full_url) return cleaned_links cleaned_links = clean_links(all_links, base_url) link in cleaned_links: f.write(str(link) + '\n')
Comments
Post a Comment