beautifulsoup - Python with Beautiful Soup not looping through pages. -
hey guys first post. marketer (ewww) , new python please don't shoot me.
i learning through trial , error, hacking scripts one.
can tell me how loop through pages of website , print info each url?
url = "http://example.com"  urls = [url] # stack of urls scrape visited = [url] # record of scraped urls htmltext = urllib.urlopen(urls[0]).read()   # while stack of urls greater 0, keep scraping links while len(urls) > 0: try:     htmltext = urllib.urlopen(urls[0]).read()  # except visited urls except:     print urls[0]  # , print information soup = beautifulsoup(htmltext, "lxml") urls.pop(0) info = soup.findall(['title', 'h1', 'h2', 'p']) script in soup("script"): soup.script.extract()  print info  # number of urls in stack print len(urls)  # append incomplete tags tag in soup.findall('a',href=true):     tag['href'] = urlparse.urljoin(url,tag['href'])     if url in tag['href'] , tag['href'] not in visited:         urls.append(tag['href'])         visited.append(tag['href'])      
comments:
- i see 
visitedmentioned twice. in once instance, it's used list of urls, in other list of html elements. naughty. - in 
whileloop each url solicited site , readhtmltextbarringexception. notice though that, each time through loop previous contents ofhtmltextoverwritten , lost. beautifulsoup must called each timehtmltextmade available ,soupbeautifulsoup processed beforesoupcreated again. 
following style write bare bones of code in way.
import requests import bs4  urls = ['url_1', 'url_2', 'url_3', 'url_4', 'url_5', 'url_6', 'url_7', 'url_8', 'url_9', 'url_10']  while urls:     url = urls.pop(0)     print (url)     try:         htmltext = requests.get(url).content     except:         print ('*** attempt open '+url+' failed')         continue     soup = bs4.beautifulsoup(htmltext, 'lxml')     title = soup.find('title')     print (title)   - i use requests library rather urllib because makes life easier.
 - since use 
popmethod ofurlslist remove items don't need keep record ofvisitedurls. they're eliminatedurlslist becomes shorter , becomes empty. while urlsasks whetherurlsempty or not.- the main point of code shows how interrogating remote site , calling beautifulsoup in same loop spins through list 
urls. 
Comments
Post a Comment