web scraping - Getting a URL from hyperlinks using BeautifulSoup in Python -


i scraped webpage using beautifulsoup, assigned 'soup'. can text 'aberdeen' adding .text onto end of 'site_url'.

what want complete url in string, e.g. "http://www.somewebsite.com/networks/site-info?site_id=abd"

>>>site_link = soup.find_all('a', string='aberdeen')[0] >>>site_row = site_link.findparent('td').findparent('tr') >>>site_column = site_row.findall('td') >>>site_url = site_column[0].contents[0] >>>print(site_url)  <a href="../networks/site-info?site_id=abd">aberdeen</a> 

i have not had luck far , not know else try. how can url?

you can use regular expression links use urljoin correct urls.

import requests import re  try:     urlparse import urljoin  # python2 except importerror:     urllib.parse import urljoin  # python3  bs4 import beautifulsoup url= 'https://uk-air.defra.gov.uk/latest/currentlevels' r = requests.get(url, headers={'user-agent': 'not blank'}) data = r.text soup = beautifulsoup(data, 'html.parser') elem in soup('a', href=re.compile(r'site_id')):     print (elem.text)     print (urljoin(url,elem['href'])) 

outputs:

auchencorth moss https://uk-air.defra.gov.uk/networks/site-info?site_id=acth bush estate https://uk-air.defra.gov.uk/networks/site-info?site_id=bush dumbarton roadside https://uk-air.defra.gov.uk/networks/site-info?site_id=dumb edinburgh st leonards https://uk-air.defra.gov.uk/networks/site-info?site_id=ed3 glasgow great western road https://uk-air.defra.gov.uk/networks/site-info?site_id=ggwr glasgow high street https://uk-air.defra.gov.uk/networks/site-info?site_id=ghsr ... 

if want aberdeen use:

for elem in soup('a',href=re.compile(r'site_id'), string='aberdeen'): 

instead of:

for elem in soup('a', href=re.compile(r'site_id')): 

outputs:

aberdeen https://uk-air.defra.gov.uk/networks/site-info?site_id=abd 

Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -