beautifulsoup - Python with Beautiful Soup not looping through pages. -
hey guys first post. marketer (ewww) , new python please don't shoot me.
i learning through trial , error, hacking scripts one.
can tell me how loop through pages of website , print info each url?
url = "http://example.com" urls = [url] # stack of urls scrape visited = [url] # record of scraped urls htmltext = urllib.urlopen(urls[0]).read() # while stack of urls greater 0, keep scraping links while len(urls) > 0: try: htmltext = urllib.urlopen(urls[0]).read() # except visited urls except: print urls[0] # , print information soup = beautifulsoup(htmltext, "lxml") urls.pop(0) info = soup.findall(['title', 'h1', 'h2', 'p']) script in soup("script"): soup.script.extract() print info # number of urls in stack print len(urls) # append incomplete tags tag in soup.findall('a',href=true): tag['href'] = urlparse.urljoin(url,tag['href']) if url in tag['href'] , tag['href'] not in visited: urls.append(tag['href']) visited.append(tag['href'])
comments:
- i see
visited
mentioned twice. in once instance, it's used list of urls, in other list of html elements. naughty. - in
while
loop each url solicited site , readhtmltext
barringexception
. notice though that, each time through loop previous contents ofhtmltext
overwritten , lost. beautifulsoup must called each timehtmltext
made available ,soup
beautifulsoup processed beforesoup
created again.
following style write bare bones of code in way.
import requests import bs4 urls = ['url_1', 'url_2', 'url_3', 'url_4', 'url_5', 'url_6', 'url_7', 'url_8', 'url_9', 'url_10'] while urls: url = urls.pop(0) print (url) try: htmltext = requests.get(url).content except: print ('*** attempt open '+url+' failed') continue soup = bs4.beautifulsoup(htmltext, 'lxml') title = soup.find('title') print (title)
- i use requests library rather urllib because makes life easier.
- since use
pop
method ofurls
list remove items don't need keep record ofvisited
urls. they're eliminatedurls
list becomes shorter , becomes empty. while urls
asks whetherurls
empty or not.- the main point of code shows how interrogating remote site , calling beautifulsoup in same loop spins through list
urls
.
Comments
Post a Comment