Scraping Wikipedia Using Python

Python is a quite simple and powerful programming language in the sense that it can be applied to so many areas like Scientific Computing, Natural Language Processing but one specific area of application of Python which I found quite fascinating is => Doing Web Scraping Using Python.
In this article, I’ll discuss How to Extract text from a WikiPedia Page using Python Programming Language? Let’s first discuss about Why we sometimes need to extract WikiPedia Page’s Text. There may be many reasons for extracting text from WikiPedia page like in 4rd Semester of my Bachelor’s, I did a project on Machine Learning in which me and three of other Project Group Members trained a Query Processing Algorithm and we trained this module using Birds related Pages Text on Wikipedia and after training => some general questions about birds can be asked from this algorithm. For this Machine Learning Training we used almost 600 MB of text files all of which was extracted from Birds related pages from Wikipedia.
Similar to this use case, there does exist many other applications where having Wikipedia Pages Text can be quite helpful.
Let’s now see Python Code for Extracting Wikipedia Page Text using Python.

Text from a WikiPedia Page can be extracted using Python’s BeautifulSoup Package. Below are 7 steps for setting up BeautifulSoup Package with Python Code for Extracting WikiPedia Page Text.

  1. Install Python Module BeautifulSoup using python3 -m pip install bs4 statement in terminal
  2. From BeautifulSoup package import BeautifulSoup Function using from bs4 import BeautifulSoup statement
  3. Import Request, urlopen functions from urllib.request Module using from urllib.request import Request, urlopen statement
  4. Pass URL to Request Function which returns Webpage as Request Object
  5. Pass request object returned by Request Function to urlopen Function which parses it to text
  6. Pass parsed text returned by urlopen Function to BeautifulSoup Function which parses text to a HTML Object
  7. Now call get_text() Function on HTML Object returned by BeautifulSoup Function

Example 1 => Extracting Text from Computer Programming WikiPedia Page

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request("https://en.wikipedia.org/wiki/Computer_programming")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "html.parser")

html_text = soup.get_text()

f = open("wikipedia_text_1.txt", "w")         # Creating wikipedia_text_1.txt File

for line in html_text:
	f.write(line)

f.close()

Below is an image showing Text file wikipedia_text_1.txt generated by above code, which contains text inside Computer Programming WikiPedia webpage.

Get out text from WikiPedia Page using Python

Example 2 => Extracting Text from Data Science WikiPedia

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request("https://en.wikipedia.org/wiki/Data_science")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "html.parser")

html_text = soup.get_text()

f = open("wikipedia_text_2.txt", "w")         # Creating wikipedia_text_2.txt File

for line in html_text:
	f.write(line)

f.close()

Below is an image showing Text file wikipedia_text_2.txt generated by above code, which contains text inside Data Science webpage.

Get out text from WikiPedia Page titled

Example 3 => Extracting Text from United States WikiPedia

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request("https://en.wikipedia.org/wiki/United_States")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "html.parser")

html_text = soup.get_text()

f = open("wikipedia_text_3.txt", "w")         # Creating wikipedia_text_3.txt File

for line in html_text:
	f.write(line)

f.close()

Below is an image showing Text file wikipedia_text_3.txt generated by above code, which contains text inside United States webpage.

Get out text from WikiPedia Page titled

Gagan

Hi, there I'm founder of ComputerScienceHub(Started this to bring useful Computer Science information just at one place). Personally I've been doing JavaScript, Python development since 2015(Been long) - Worked upon couple of Web Development Projects, Did some Data Science stuff using Python. Nowadays primarily I work as Freelance JavaScript Developer(Web Developer) and on side-by-side managing team of Computer Science specialists at ComputerScienceHub.io

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Posts