In this article, I will discuss about How Python Programming Language can be used for extracting all of links in a webpage.
This program will will use Python Modules – urllib.request and re.
Using function urlopen form urllib.request, we can retrive the a webpage from a server as a HTTP Response Object. For extracting html code out of this just use python read() function, which will take out html code inside HTTP Response Object as Object Byte Code. But in order to find out URLs inside this Object Byte Code, this need to be converted to simpler text format like utf-8. Which can be done using decode(‘utf-8’) function on Byte Code Object. After this Urls inside this simpler text can be searched using re module using ‘”((http|ftp)s?://.*?)”‘ as pattern matching string.
Steps to Find out all links on a Webpage using Python
#importing the required modules from urllib.request import urlopen import re # Connecting to a URL webpage = urlopen("https://computersciencehub.io") # Reading html code of Webpage html = webpage.read().decode('utf-8') # using re module of Python for extracting all of links in Webpage links = re.findall('"((http|ftp)s?://.*?)"', html) # printing list of links in a webpgae for i in links: print(i)