Python can be used for doing a lot of stuff either it be Web Development using Django or Data Visualisation using Matplotlib but one of interesting application of Python is to use it for doing Web Scraping => Crawling the webpages for analyzing. For example – If you want to find out how many links are there in a webpage, Python can used.
Before straight jumping into how to use Python for getting links inside a Webpage. Let’s first understand What are links in Webpage? Webpages are made up of HTML which consists of different tags like p, a, div and many more. Out of all these HTML tags, anchor tag written as <a></a> is used for denoting linkes in a Webpage. This anchor tag have href attribute whose value denote which link anchor tag is pointing to.
For example <a href=”google.com”>Google</a> is a link inside a Webpage point to google.com
So for extracting links out of a Webpage using Python
• First extract all anchor tags out of HTML of Webpage
• Extract values of href attributes of all of these anchor tags
So now as you know, what exactly is need to be done for getting out all of links from a Webpage. Let’s see how this can be done using Python Programming Language.
Follow below step-by-step procedure for Getting all links from a Webpage using Python.
- Install Python Module BeautifulSoup using python3 -m pip install bs4 statement in terminal
- From BeautifulSoup package import BeautifulSoup Function using from bs4 import BeautifulSoup statement
- Import Request, urlopen functions from urllib.request Module using from urllib.request import Request, urlopen statement
- Pass URL to Request Function which returns Webpage as Request Object
- Pass request object returned by Request Function to urlopen Function which parses it to text
- Pass parsed text returned by urlopen Function to BeautifulSoup Function which parses text to a HTML Object
- Use findAll(‘a’) Function to look for anchor tag <a></a> in HTML Object
Let’s put all of above steps together as Python Code.
Below is Python Code for Extracting Links from a Webpage and saving these into a file.
from bs4 import BeautifulSoup from urllib.request import Request, urlopen import re req = Request("https://computersciencehub.io/python/os-path-module-in-python/") html_page = urlopen(req) soup = BeautifulSoup(html_page, "html.parser") f = open("links.txt", "w") for link in soup.find_all('a'): one_link = link.get('href') if (one_link): # Checking Wether href is empty f.write(one_link) f.write("\n") f.close()
Output File links.txt generated by above code will look like =>