पायथन के साथ एचटीएमएल टेबल लाने के लिए एचटीएमएल पेजों को कैसे पार्स करें?

समस्या

आपको वेब पेज से HTML टेबल निकालने की जरूरत है।

परिचय

इंटरनेट और वर्ल्ड वाइड वेब (WWW), आज सूचना का सबसे प्रमुख स्रोत है। वहाँ बहुत सारी जानकारी है, इतने सारे विकल्पों में से सामग्री को चुनना बहुत कठिन है। उस जानकारी में से अधिकांश HTTP के माध्यम से पुनर्प्राप्त करने योग्य है।

लेकिन हम स्वचालित रूप से जानकारी पुनर्प्राप्त करने और संसाधित करने के लिए प्रोग्रामेटिक रूप से इन कार्यों को भी निष्पादित कर सकते हैं।

पायथन हमें इसकी मानक लाइब्रेरी एक HTTP क्लाइंट का उपयोग करके ऐसा करने की अनुमति देता है, लेकिन अनुरोध मॉड्यूल वेब पेजों की जानकारी बहुत आसान प्राप्त करने में मदद करता है।

इस पोस्ट में, हम देखेंगे कि पेजों में एम्बेडेड एचटीएमएल टेबल निकालने के लिए एचटीएमएल पेजों के माध्यम से पार्स कैसे करें।

इसे कैसे करें..

1. हम अनुरोध, पांडा, सुंदरसूप4 और सारणीबद्ध पैकेजों का उपयोग करेंगे। कृपया उन्हें अपने सिस्टम पर स्थापित करें यदि वे गायब हैं। यदि आप सुनिश्चित नहीं हैं, तो कृपया पुष्टि करने के लिए पिप फ़्रीज़ का उपयोग करें।

import requests
import pandas as pd
from tabulate import tabulate

2. हम पृष्ठ के माध्यम से प्रशंसा करने के लिए https://www.tutorialspoint.com/python/python_basic_operators.htm का उपयोग करेंगे और उनके अंदर एम्बेडेड सभी HTML पृष्ठों का प्रिंट आउट लेंगे।

# set the site url
site_url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"

3.हम सर्वर से अनुरोध करेंगे और प्रतिक्रिया देखेंगे।

# Make a request to the server
response = requests.get(site_url)

# Check the response
print(f"*** The response for {site_url} is {response.status_code}")

4. खैर, प्रतिक्रिया कोड 200 - सर्वर से प्रतिक्रिया का प्रतिनिधित्व करता है सफलता है। इसलिए अब हम अनुरोध शीर्षलेख, प्रतिक्रिया शीर्षलेख और साथ ही सर्वर द्वारा लौटाए गए पहले 100 पाठों की जांच करेंगे।

# Check the request headers
print(f"*** Printing the request headers - \n {response.request.headers} ")

# Check the response headers
print(f"*** Printing the request headers - \n {response.headers} ")

# check the content of the results
print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ")

आउटपुट

*** Printing the request headers -
{'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
*** Printing the request headers -
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '213246', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Tue, 20 Oct 2020 09:45:18 GMT', 'Expires': 'Thu, 19 Nov 2020 09:45:18 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'}
*** Accessing the first 100/37624 characters -

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>

5. अब हम HTML के माध्यम से पार्स करने के लिए BeautifulSoup का उपयोग करेंगे।

# Parse the HTML pages

from bs4 import BeautifulSoup
tutorialpoints_page = BeautifulSoup(response.text, 'html.parser')
print(f"*** The title of the page is - {tutorialpoints_page.title}")

# You can extract the page title as string as well
print(f"*** The title of the page is - {tutorialpoints_page.title.string}")

6. खैर, अधिकांश तालिकाओं में या तो h2, h3, h4, h5 या h6 टैग में शीर्षक परिभाषित होंगे। हम पहले इन टैगों की पहचान करते हैं, फिर हम पहचाने गए टैग के बगल में html तालिका को उठाते हैं। इस तर्क के लिए, हम नीचे बताए अनुसार फाइंड, सिबलिंग और फाइंड_नेक्स्ट_सिब्लिंग्स का उपयोग करेंगे।

# Find all the h3 elements
print(f"{tutorialpoints_page.find_all('h2')}")
tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6")
for sibling in tags.find_next_siblings():
if sibling.name == "table":
my_table = sibling
df = pd.read_html(str(my_table))
print(tabulate(df[0], headers='keys', tablefmt='psql'))

पूर्ण कोड

7. अब सब कुछ एक साथ रखना।

# STEP1 : Download the page required
import requests
import pandas as pd


# set the site url
site_url = "https://www.tutorialspoint.com/python/python_basic_operators.htm"

# Make a request to the server
response = requests.get(site_url)

# Check the response
print(f"*** The response for {site_url} is {response.status_code}")

# Check the request headers
print(f"*** Printing the request headers - \n {response.request.headers} ")

# Check the response headers
print(f"*** Printing the request headers - \n {response.headers} ")

# check the content of the results
print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ")

# Parse the HTML pages

from bs4 import BeautifulSoup
tutorialpoints_page = BeautifulSoup(response.text, 'html.parser')
print(f"*** The title of the page is - {tutorialpoints_page.title}")

# You can extract the page title as string as well
print(f"*** The title of the page is - {tutorialpoints_page.title.string}")

# Find all the h3 elements
# print(f"{tutorialpoints_page.find_all('h2')}")
tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6")
for sibling in tags.find_next_siblings():
if sibling.name == "table":
my_table = sibling
df = pd.read_html(str(my_table))
print(df)

आउटपुट

*** The response for https://www.tutorialspoint.com/python/python_basic_operators.htm is 200
*** Printing the request headers -
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
*** Printing the request headers -
{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '558841', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Sat, 24 Oct 2020 09:45:13 GMT', 'Expires': 'Mon, 23 Nov 2020 09:45:13 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'}
*** Accessing the first 100/37624 characters -

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Python - Basic Operators - Tutorialspoint</title>
*** The title of the page is - <title>Python - Basic Operators - Tutorialspoint</title>
*** The title of the page is - Python - Basic Operators - Tutorialspoint
[<h2>Types of Operator</h2>, <h2>Python Arithmetic Operators</h2>, <h2>Python Comparison Operators</h2>, <h2>Python Assignment Operators</h2>, <h2>Python Bitwise Operators</h2>, <h2>Python Logical Operators</h2>, <h2>Python Membership Operators</h2>, <h2>Python Identity Operators</h2>, <h2>Python Operators Precedence</h2>]
[ Operator Description \
0 + Addition Adds values on either side of the operator.
1 - Subtraction Subtracts right hand operand from left hand op...
2 * Multiplication Multiplies values on either side of the operator
3 / Division Divides left hand operand by right hand operand
4 % Modulus Divides left hand operand by right hand operan...
5 ** Exponent Performs exponential (power) calculation on op...
6 // Floor Division - The division of operands wher...

उदाहरण

0 a + b = 30
1 a – b = -10
2 a * b = 200
3 b / a = 2
4 b % a = 0
5 a**b =10 to the power 20
6 9//2 = 4 and 9.0//2.0 = 4.0, -11//3 = -4, -11.... ]
[ Operator Description \
0 == If the values of two operands are equal, then ...
1 != If values of two operands are not equal, then ...
2 <> If values of two operands are not equal, then ...
3 > If the value of left operand is greater than t...
4 < If the value of left operand is less than the ...
5 >= If the value of left operand is greater than o...
6 <= If the value of left operand is less than or e...

उदाहरण

0 (a == b) is not true.
1 (a != b) is true.
2 (a <> b) is true. This is similar to != operator.
3 (a > b) is not true.
4 (a < b) is true.
5 (a >= b) is not true.
6 (a <= b) is true. ]
[ Operator Description \
0 = Assigns values from right side operands to lef...
1 += Add AND It adds right operand to the left operand and ...
2 -= Subtract AND It subtracts right operand from the left opera...
3 *= Multiply AND It multiplies right operand with the left oper...
4 /= Divide AND It divides left operand with the right operand...
5 %= Modulus AND It takes modulus using two operands and assign...
6 **= Exponent AND Performs exponential (power) calculation on op...
7 //= Floor Division It performs floor division on operators and as...

उदाहरण

0 c = a + b assigns value of a + b into c
1 c += a is equivalent to c = c + a
2 c -= a is equivalent to c = c - a
3 c *= a is equivalent to c = c * a
4 c /= a is equivalent to c = c / a
5 c %= a is equivalent to c = c % a
6 c **= a is equivalent to c = c ** a
7 c //= a is equivalent to c = c // a ]
[ Operator \
0 & Binary AND
1 | Binary OR
2 ^ Binary XOR
3 ~ Binary Ones Complement
4 << Binary Left Shift
5 >> Binary Right Shift

Description \
0 Operator copies a bit to the result if it exis...
1 It copies a bit if it exists in either operand.
2 It copies the bit if it is set in one operand ...
3 It is unary and has the effect of 'flipping' b...
4 The left operands value is moved left by the n...
5 The left operands value is moved right by the ...

उदाहरण

0 (a & b) (means 0000 1100)
1 (a | b) = 61 (means 0011 1101)
2 (a ^ b) = 49 (means 0011 0001)
3 (~a ) = -61 (means 1100 0011 in 2's complement...
4 a << 2 = 240 (means 1111 0000)
5 a >> 2 = 15 (means 0000 1111) ]
[ Operator Description \
0 and Logical AND If both the operands are true then condition b...
1 or Logical OR If any of the two operands are non-zero then c...
2 not Logical NOT Used to reverse the logical state of its operand.

Example
0 (a and b) is true.
1 (a or b) is true.
2 Not(a and b) is false. ]
[ Operator Description \
0 in Evaluates to true if it finds a variable in th...
1 not in Evaluates to true if it does not finds a varia...

उदाहरण

0 x in y, here in results in a 1 if x is a membe...
1 x not in y, here not in results in a 1 if x is... ]
[ Operator Description \
0 is Evaluates to true if the variables on either s...
1 is not Evaluates to false if the variables on either ...

उदाहरण

0 x is y, here is results in 1 if id(x) equals i...
1 x is not y, here is not results in 1 if id(x) ... ]
[ Sr.No. Operator & Description
0 1 ** Exponentiation (raise to the power)
1 2 ~ + - Complement, unary plus and minus (method...
2 3 * / % // Multiply, divide, modulo and floor di...
3 4 + - Addition and subtraction
4 5 >> << Right and left bitwise shift
5 6 & Bitwise 'AND'
6 7 ^ | Bitwise exclusive `OR' and regular `OR'
7 8 <= < > >= Comparison operators
8 9 <> == != Equality operators
9 10 = %= /= //= -= += *= **= Assignment operators
10 11 is is not Identity operators
11 12 in not in]