beautifulsoup - Python with Beautiful Soup not looping through pages. -

April 15, 2010

hey guys first post. marketer (ewww) , new python please don't shoot me.

i learning through trial , error, hacking scripts one.

can tell me how loop through pages of website , print info each url?

url = "http://example.com"  urls = [url] # stack of urls scrape visited = [url] # record of scraped urls htmltext = urllib.urlopen(urls[0]).read()   # while stack of urls greater 0, keep scraping links while len(urls) > 0: try:     htmltext = urllib.urlopen(urls[0]).read()  # except visited urls except:     print urls[0]  # , print information soup = beautifulsoup(htmltext, "lxml") urls.pop(0) info = soup.findall(['title', 'h1', 'h2', 'p']) script in soup("script"): soup.script.extract()  print info  # number of urls in stack print len(urls)  # append incomplete tags tag in soup.findall('a',href=true):     tag['href'] = urlparse.urljoin(url,tag['href'])     if url in tag['href'] , tag['href'] not in visited:         urls.append(tag['href'])         visited.append(tag['href'])

comments:

i see visited mentioned twice. in once instance, it's used list of urls, in other list of html elements. naughty.
in while loop each url solicited site , read htmltext barring exception. notice though that, each time through loop previous contents of htmltext overwritten , lost. beautifulsoup must called each time htmltext made available , soup beautifulsoup processed before soup created again.

following style write bare bones of code in way.

import requests import bs4  urls = ['url_1', 'url_2', 'url_3', 'url_4', 'url_5', 'url_6', 'url_7', 'url_8', 'url_9', 'url_10']  while urls:     url = urls.pop(0)     print (url)     try:         htmltext = requests.get(url).content     except:         print ('*** attempt open '+url+' failed')         continue     soup = bs4.beautifulsoup(htmltext, 'lxml')     title = soup.find('title')     print (title)

i use requests library rather urllib because makes life easier.
since use pop method of urls list remove items don't need keep record of visited urls. they're eliminated urls list becomes shorter , becomes empty.
while urls asks whether urls empty or not.
the main point of code shows how interrogating remote site , calling beautifulsoup in same loop spins through list urls.

Search This Blog

LP

beautifulsoup - Python with Beautiful Soup not looping through pages. -

Comments

Post a Comment

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -