Identical HTML code read differently using python -


excuse amateurish code, i'm sure it's painful @ experience.

i'm trying write code able save data on following link: http://pq.gov.mt/pqweb.nsf/bysitting?openview, , save them in searchable csv file.

the code have written seems work fine, in manages save information need in different columns of csv file. breaks down when reaches 1 question, 412 on page: http://pq.gov.mt/pqweb.nsf/bysitting!openview&start=1&count=20&expand=9#9, fails register last entry reason (marked arrow <<<<<-----).

as far can tell, html page identical rest, seem work fine can't understand how or why different.

not sure how i've explained problem happy elaborate if necessary.

thanks in advance. code below

 item in html_search_1:     x = item.find_all('a',href = true)      t in x:         store = []         y = t.get('href')         new_url = ("http://pq.gov.mt"+y)         page_2 = urllib.request.urlopen(new_url).read()         soup_2 = beautifulsoup(page_2, 'html.parser')         html_search_3 = soup_2.find_all("div", class_ = "col-md-10")          ccc in html_search_3:             html_search_4 = ccc.find_all("div", class_ = "row")             haga in html_search_4:                     z = haga.find_all("div", class_ = ["col-md-2","col-md-4","col-md-12","col-md-10"])                      new_item in z:                         store.append(new_item.text)      var0 = 1      var1 = 3      var2 = 5     var3 = 7      var4 = 9     var5 = 13      var6 = 14     var7 = 15     var8 = 17     count = 1        o in range(1):           try:               legislature.append(store[var0])             category.append(store[var1])             question_number.append(store[var2])             date.append(store[var3])             sitting.append(store[var4])             title.append(store[var5])             mps.append(store[var6])             question.append(store[var7])             print(store[var7])             answer.append(store[var8])             print(store[var8]) #<<<<<<<<<<<<<<<<<<<<--------------------               var0 = var0 + 19             var1 = var1 + 19             var2 = var2 + 19             var3 = var3 + 19             var4 = var4 + 19             var5 = var5 + 19             var6 = var6 + 19             var7 = var7 + 19             var8 = var8 + 19           except:             pass 

that interesting question. took me while grasp intent, knowledge of maltesian language limited :-), think got solution. never mind amateur code, tried , thought problem. code may not win adward also, on question 412 without problems.

import requests bs4 import beautifulsoup  div_index = {'legislature': (0, 'md4'),              'category': (1, 'md4'),              'qnumber': (2, 'md4'),              'qdate': (3, 'md4'),              'sitting': (4, 'md4'),              'title': (3, 'md10'),              'mps': (0, 'md12'),              'question': (1, 'md12'),              'answer': (3, 'md12')}  def process_file(urlstring):     r = requests.get("http://pq.gov.mt/" + urlstring)     data = r.content     doc = beautifulsoup(data, 'html.parser')     divs = {}     divs['md2'] = doc.find_all("div", class_ = ["col-md-2"])     divs['md4'] = doc.find_all("div", class_ = ["col-md-4"])     divs['md10'] = doc.find_all("div", class_ = ["col-md-10"])     divs['md12'] = doc.find_all("div", class_ = ["col-md-12"])      result = {}     key, index in div_index.items():         result[key] = divs[index[1]][index[0]].text      return result   def main():     r = requests.get('http://pq.gov.mt/pqweb.nsf/bysitting!openview&start=1&count=20&expand=9#9')     data = r.content     doc = beautifulsoup(data, 'html.parser')     links = doc.find_all("a")     link in links:         if 'href' in link.attrs , link.attrs['href'].find('!opendocument') > 0:             result = process_file(link.attrs['href'])             print(result)  if __name__ == '__main__':     main() 

what do, first store index each fields in dictionary, along class of div interested in. after groked structure of page, found easier parse list every class. way it's easy find data (web scraping being is, have manually follow if change page structure, not concern me, having more readable code was).

the advantage using enumeration of dict keys , values remove more 2 dozen lines form prototype code. way code loops on every entry in index dict , retrieves text each dic.

i created function each detail url , cleaned code little, returning dictionary convenience. if want split names of mp individual lines, still have use .split("\n") on result['mps']. hope helps ...


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -