Extracting URLs from a Webpage using Python

Python can be used for doing a lot of stuff either it be Web Development using Django or Data Visualisation using Matplotlib but one of interesting application of Python is to use it for doing Web Scraping => Crawling the webpages for analyzing. For example – If you want to find out how many links are there in a webpage, Python can used.
Before straight jumping into how to use Python for getting links inside a Webpage. Let’s first understand What are links in Webpage? Webpages are made up of HTML which consists of different tags like p, a, div and many more. Out of all these HTML tags, anchor tag written as <a></a> is used for denoting linkes in a Webpage. This anchor tag have href attribute whose value denote which link anchor tag is pointing to.
For example <a href=”google.com”>Google</a> is a link inside a Webpage point to google.com

So for extracting links out of a Webpage using Python
• First extract all anchor tags out of HTML of Webpage
• Extract values of href attributes of all of these anchor tags

So now as you know, what exactly is need to be done for getting out all of links from a Webpage. Let’s see how this can be done using Python Programming Language.

Follow below step-by-step procedure for Getting all links from a Webpage using Python.

  • Install Python Module BeautifulSoup using python3 -m pip install bs4 statement in terminal
  • From BeautifulSoup package import BeautifulSoup Function using from bs4 import BeautifulSoup statement
  • Import Request, urlopen functions from urllib.request Module using from urllib.request import Request, urlopen statement
  • Pass URL to Request Function which returns Webpage as Request Object
  • Pass request object returned by Request Function to urlopen Function which parses it to text
  • Pass parsed text returned by urlopen Function to BeautifulSoup Function which parses text to a HTML Object
  • Use findAll(‘a’) Function to look for anchor tag <a></a> in HTML Object

Let’s put all of above steps together as Python Code.
Below is Python Code for Extracting Links from a Webpage and saving these into a file.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request("https://computersciencehub.io/python/os-path-module-in-python/")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "html.parser")

f = open("links.txt", "w")

for link in soup.find_all('a'):
	one_link = link.get('href')
	if (one_link):                # Checking Wether href is empty
		f.write(one_link)
		f.write("\n")

f.close()

Output File links.txt generated by above code will look like =>

Gagan

Hi, there I'm founder of ComputerScienceHub(Started this to bring useful Computer Science information just at one place). Personally I've been doing JavaScript, Python development since 2015(Been long) - Worked upon couple of Web Development Projects, Did some Data Science stuff using Python. Nowadays primarily I work as Freelance JavaScript Developer(Web Developer) and on side-by-side managing team of Computer Science specialists at ComputerScienceHub.io

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Posts