2 Ways to Extract Text From HTML Using Python

Python is a quite simple and powerful programming language in the sense that it can be applied to so many areas like Scientific Computing, Natural Language Processing but one specific area of application of Python which I found quite fascinating is => Doing Web Scraping Using Python.
In this article, I’ll discuss How to Extract text from a HTML file or Webpage using Python Programming Langauge? But let’s first see Why sometimes it can be useful to extract text from a Webpage or where text taken out from Webpage can be used?
Most probably people want to extract text out of a Webpage so as to do some analysis. For example – It may be possible that your developing some Text Processing Machine Learning Algorithm and need some text data for doing Training Process then scraping Webpages and using text inside those as Training Set can be quite handy. Also some people want to take Text out of a WebPage so as to do SEO Analysis and check why there competitor website is performing well in Google Search Results.

Anyway I’m not sure for What reason you searched Extract Text from HTML on Google and come to this page, but please let me know in comments for what purpose you searched this. 😊 😊 That would be quite interesting to know. Let’s get into 2 Ways which can be used for Extracting Text out of HTML Webpage or File using Python Programming language.

  • Using BeautifulSoup for Extracting text out of HTML
  • Using html2text Python Package for Extracting text out of HTML

Let’s see how each of this method can be used for taking text out of HTML.

Extracting text out of HTML using BeautifulSoup Package

  1. Install Python Module BeautifulSoup using python3 -m pip install bs4 statement in terminal
  2. From BeautifulSoup package import BeautifulSoup Function using from bs4 import BeautifulSoup statement
  3. Import Request, urlopen functions from urllib.request Module using from urllib.request import Request, urlopen statement
  4. Pass URL to Request Function which returns Webpage as Request Object
  5. Pass request object returned by Request Function to urlopen Function which parses it to text
  6. Pass parsed text returned by urlopen Function to BeautifulSoup Function which parses text to a HTML Object
  7. Now call get_text() Function on HTML Object returned by BeautifulSoup Function

Let’s put all of above 7 steps together as Python Code. Let’s try to scrap text in Python’s Wikipedia Page and save that text as html_text.txt file.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request("https://en.wikipedia.org/wiki/Python_(programming_language)")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "html.parser")

html_text = soup.get_text()

f = open("html_text.txt", "w")         # Creating html_text.txt File

for line in html_text:
	f.write(line)

f.close()

Below is an image of text file created by above code => html_text.txt

Scraped a wikipedia page using BeautifulSoup Package and Python programming language

Text Extracting out of HTML page using Python’s html2text Package

Please check above if you have not as html2text just extends above steps further.

  1. Install Python package html2text using python3 -m pip install html2text statement in terminal
  2. Import HTML2Text() Function Object from html2text package using from html2text import HTML2Text() statement
  3. Set ignore_links attribute of HTML2Text() Function Object to True for avoiding conversion of Anchor Text href attribute(<a href=”computersciencehub.io”></a>) to text
  4. Call handle(parameter) function on HTML2Text() Object passing HTML File as parameter
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import html2text

req = Request("https://en.wikipedia.org/wiki/Python_(programming_language)")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "html.parser")

# soup is a BeautifulSoup object Type which contains HTML

h = html2text.HTML2Text()
h.ignore_links = True

f = open("html_text.txt", "w")         # Creating html_text.txt File

for line in h.handle(str(soup)):       # handle() Function only accepts string as parameter
	f.write(line)                      # That's why converted soup object to string str(soup)

f.close()

Below is an image of text file created by above code => html_text.txt

Getting text out of HTML using html2text package | Python Programming Language

Final Thoughts

Personally for extracting text out of HTML Webpage I would use First approach “Extracting text out of HTML using BeautifulSoup Package” rather than using second one “Text Extracting out of HTML page using Python’s html2text Package” as in second one both packages => BeautifulSoup and html2text need to installed.
So better just install one package BeautifulSoup and extract HTML text out of Webpage.

Gagan

Hi, there I'm founder of ComputerScienceHub(Started this to bring useful Computer Science information just at one place). Personally I've been doing JavaScript, Python development since 2015(Been long) - Worked upon couple of Web Development Projects, Did some Data Science stuff using Python. Nowadays primarily I work as Freelance JavaScript Developer(Web Developer) and on side-by-side managing team of Computer Science specialists at ComputerScienceHub.io

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Posts