beautifulsoup - Python with Beautiful Soup not looping through pages. -


hey guys first post. marketer (ewww) , new python please don't shoot me.

i learning through trial , error, hacking scripts one.

can tell me how loop through pages of website , print info each url?

url = "http://example.com"  urls = [url] # stack of urls scrape visited = [url] # record of scraped urls htmltext = urllib.urlopen(urls[0]).read()   # while stack of urls greater 0, keep scraping links while len(urls) > 0: try:     htmltext = urllib.urlopen(urls[0]).read()  # except visited urls except:     print urls[0]  # , print information soup = beautifulsoup(htmltext, "lxml") urls.pop(0) info = soup.findall(['title', 'h1', 'h2', 'p']) script in soup("script"): soup.script.extract()  print info  # number of urls in stack print len(urls)  # append incomplete tags tag in soup.findall('a',href=true):     tag['href'] = urlparse.urljoin(url,tag['href'])     if url in tag['href'] , tag['href'] not in visited:         urls.append(tag['href'])         visited.append(tag['href']) 

comments:

  • i see visited mentioned twice. in once instance, it's used list of urls, in other list of html elements. naughty.
  • in while loop each url solicited site , read htmltext barring exception. notice though that, each time through loop previous contents of htmltext overwritten , lost. beautifulsoup must called each time htmltext made available , soup beautifulsoup processed before soup created again.

following style write bare bones of code in way.

import requests import bs4  urls = ['url_1', 'url_2', 'url_3', 'url_4', 'url_5', 'url_6', 'url_7', 'url_8', 'url_9', 'url_10']  while urls:     url = urls.pop(0)     print (url)     try:         htmltext = requests.get(url).content     except:         print ('*** attempt open '+url+' failed')         continue     soup = bs4.beautifulsoup(htmltext, 'lxml')     title = soup.find('title')     print (title) 
  • i use requests library rather urllib because makes life easier.
  • since use pop method of urls list remove items don't need keep record of visited urls. they're eliminated urls list becomes shorter , becomes empty.
  • while urls asks whether urls empty or not.
  • the main point of code shows how interrogating remote site , calling beautifulsoup in same loop spins through list urls.

Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -