Python 3 remove duplicate weblinks with extra character rstrip -

February 15, 2011

using python 3. trying pull unique links website , seem have code working except few links have / @ end.

example: program include http://www.google.com & http://www.google.com/

i'd make sure program removes last character ensure no duplicates return. have researched rstrip() can't seem work. here code:

import bs4 bs import urllib.request import urllib.parse   source = urllib.request.urlopen('https://www.census.gov/data/tables/2016/demo/popest/state-total.html').read() soup = bs.beautifulsoup(source,'lxml')  filename = "uniqueweblinks.csv" f = open(filename, "w") headers = "weblinks\n" f.write(headers)  all_links = soup.find_all('a')  url_set = set()  link in all_links:     web_links = link.get("href")     ab_url = urllib.parse.urljoin('https://www.census.gov/data/tables/2016/demo/popest/state-total.html', web_links)     print (ab_url)     if ab_url , ab_url not in url_set:         f.write(str(ab_url) + "\n")         url_set.add(ab_url)

i'd keep simple , explicit how you're cleaning urls. example, strip last character if it's slash (/) or hash (#) (if url ends hash, it's same not ending hash). after glancing @ data, i'd remove blank urls because that's not you're looking for.

base_url = 'https://www.census.gov/data/tables/2016/demo/popest/state-total.html'  all_links = soup.find_all('a')  def clean_links(tags, base_url):     cleaned_links = set()     tag in tags:         link = tag.get('href')         if link none:             continue         if link.endswith('/') or link.endswith('#'):             link = link[-1]         full_url = urllib.parse.urljoin(base_url, link)         cleaned_links.add(full_url)     return cleaned_links  cleaned_links = clean_links(all_links, base_url)  link in cleaned_links:     f.write(str(link) + '\n')

Search This Blog

LP

Python 3 remove duplicate weblinks with extra character rstrip -

Comments

Post a Comment

Popular posts from this blog

PHP and MySQL WP -

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

nginx - phpPgAdmin - log in works but I have to login again after clicking on any links -