web scraping - Getting a URL from hyperlinks using BeautifulSoup in Python -
i scraped webpage using beautifulsoup, assigned 'soup'. can text 'aberdeen' adding .text onto end of 'site_url'.
what want complete url in string, e.g. "http://www.somewebsite.com/networks/site-info?site_id=abd"
>>>site_link = soup.find_all('a', string='aberdeen')[0] >>>site_row = site_link.findparent('td').findparent('tr') >>>site_column = site_row.findall('td') >>>site_url = site_column[0].contents[0] >>>print(site_url) <a href="../networks/site-info?site_id=abd">aberdeen</a>
i have not had luck far , not know else try. how can url?
you can use regular expression links use urljoin correct urls.
import requests import re try: urlparse import urljoin # python2 except importerror: urllib.parse import urljoin # python3 bs4 import beautifulsoup url= 'https://uk-air.defra.gov.uk/latest/currentlevels' r = requests.get(url, headers={'user-agent': 'not blank'}) data = r.text soup = beautifulsoup(data, 'html.parser') elem in soup('a', href=re.compile(r'site_id')): print (elem.text) print (urljoin(url,elem['href']))
outputs:
auchencorth moss https://uk-air.defra.gov.uk/networks/site-info?site_id=acth bush estate https://uk-air.defra.gov.uk/networks/site-info?site_id=bush dumbarton roadside https://uk-air.defra.gov.uk/networks/site-info?site_id=dumb edinburgh st leonards https://uk-air.defra.gov.uk/networks/site-info?site_id=ed3 glasgow great western road https://uk-air.defra.gov.uk/networks/site-info?site_id=ggwr glasgow high street https://uk-air.defra.gov.uk/networks/site-info?site_id=ghsr ...
if want aberdeen use:
for elem in soup('a',href=re.compile(r'site_id'), string='aberdeen'):
instead of:
for elem in soup('a', href=re.compile(r'site_id')):
outputs:
aberdeen https://uk-air.defra.gov.uk/networks/site-info?site_id=abd
Comments
Post a Comment