पाइथन में किसी वेबपृष्ठ से डेटा पार्स करने के लिए सुंदर सूप पैकेज का उपयोग कैसे किया जा सकता है?

BeautifulSoup एक थर्ड पार्टी पायथन लाइब्रेरी है जिसका उपयोग वेब पेजों से डेटा को पार्स करने के लिए किया जाता है। यह वेब स्क्रैपिंग में मदद करता है, जो विभिन्न संसाधनों से डेटा निकालने, उपयोग करने और हेरफेर करने की एक प्रक्रिया है।

वेब स्क्रैपिंग का उपयोग अनुसंधान उद्देश्यों के लिए डेटा निकालने, बाजार के रुझानों को समझने/तुलना करने, एसईओ निगरानी करने आदि के लिए भी किया जा सकता है।

विंडोज़ पर ब्यूटीफुल सूप इंस्टाल करने के लिए नीचे की लाइन चलाई जा सकती है -

pip install beautifulsoup4

आइए एक उदाहरण देखते हैं -

उदाहरण

import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import urllib
url = 'https://en.wikipedia.org/wiki/Algorithm'
html = urlopen(url).read()
print("Reading the webpage...")
soup = BeautifulSoup(html, features="html.parser")
print("Parsing the webpage...")
for script in soup(["script", "style"]):
   script.extract() # rip it out
print("Extracting text from the webpage...")
text = soup.get_text()
print("Data cleaning...")
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
text = str(text)
print(text)

आउटपुट

Reading the webpage...
Parsing the webpage...
Extracting text from the webpage...
Data cleaning...
Recursive C implementation of Euclid's algorithm from the above flowchart
Recursion
A recursive algorithm is one that invokes (makes reference to) itself repeatedly until a certain condition (also known as termination condition) matches, which is a method common to functional programming….
…..
Developers
Statistics
Cookie statement

स्पष्टीकरण

आवश्यक पैकेज आयात किए जाते हैं, और उपनामित होते हैं।
वेबसाइट परिभाषित है।
यूआरएल खोला गया है, और 'स्क्रिप्ट' टैग और अन्य अप्रासंगिक एचटीएमएल टैग हटा दिए गए हैं।
वेबपेज डेटा से टेक्स्ट निकालने के लिए 'get_text' फ़ंक्शन का उपयोग किया जाता है।
अतिरिक्त रिक्त स्थान और अमान्य शब्द हटा दिए गए हैं।
टेक्स्ट कंसोल पर प्रिंट होता है।