Python Program to Find All Links on Webpage

In this article, I will discuss about How Python Programming Language can be used for extracting all of links in a webpage.
This program will will use Python Modules – urllib.request and re.
Using function urlopen form urllib.request, we can retrive the a webpage from a server as a HTTP Response Object. For extracting html code out of this just use python read() function, which will take out html code inside HTTP Response Object as Object Byte Code. But in order to find out URLs inside this Object Byte Code, this need to be converted to simpler text format like utf-8. Which can be done using decode(‘utf-8’) function on Byte Code Object. After this Urls inside this simpler text can be searched using re module using ‘”((http|ftp)s?://.*?)”‘ as pattern matching string.

Steps to Find out all links on a Webpage using Python

#importing the required modules
from urllib.request import urlopen
import re

# Connecting to a URL
webpage = urlopen("https://computersciencehub.io")

# Reading html code of Webpage
html = webpage.read().decode('utf-8')

# using re module of Python for extracting all of links in Webpage
links = re.findall('"((http|ftp)s?://.*?)"', html)

# printing list of links in a webpgae
for i in links:
	print(i[0])

Gagan

Hi, there I'm founder of ComputerScienceHub(Started this to bring useful Computer Science information just at one place). Personally I've been doing JavaScript, Python development since 2015(Been long) - Worked upon couple of Web Development Projects, Did some Data Science stuff using Python. Nowadays primarily I work as Freelance JavaScript Developer(Web Developer) and on side-by-side managing team of Computer Science specialists at ComputerScienceHub.io

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Posts