अनुच्छेद स्क्रैपिंग और अवधि के लिए पायथन मॉड्यूल समाचार पत्र?

हम डेटा माइनिंग, सूचना पुनर्प्राप्ति आदि जैसे विभिन्न डोमेन से वेब पेजों में सामग्री निकाल सकते हैं। समाचार पत्रों और पत्रिकाओं की वेबसाइटों से जानकारी निकालने के लिए हम समाचार पत्र पुस्तकालय का उपयोग करने जा रहे हैं।

इस पुस्तकालय का मुख्य उद्देश्य समाचार पत्रों और इसी तरह की वेबसाइटों से लेखों को निकालना और उन्हें व्यवस्थित करना है।

इंस्टॉलेशन:

अख़बार पुस्तकालय स्थापना के लिए, अपने टर्मिनल में चलाएं:

$ pip install newspaper3k

एलएक्सएमएल निर्भरता के लिए, अपने टर्मिनल में कमांड के नीचे चलाएँ

$pip install lxml

जनहित याचिका स्थापित करने के लिए, चलाएं

$pip install Pillow

एनएलपी निगम डाउनलोड किया जाएगा:

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python

पाइथॉन न्यूपेपर लाइब्रेरी का उपयोग लेखों से जुड़ी जानकारी एकत्र करने के लिए किया जाता है। इसमें लेखक का नाम, लेख में प्रमुख चित्र, प्रकाशन तिथियां, लेख में मौजूद वीडियो, लेख का वर्णन करने वाले कीवर्ड और लेख का सारांश शामिल हैं।

#Import required library
from newspaper import Article
# url link-which you want to extract
url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117"
# Download the article
>>> from newspaper import Article
>>> url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117"
>>> article = Article(url)
>>> article.download()
# Parse the article and fetch authors name
>>> article.parse()
>>> print(article.authors)

आउटपुट:

['Kristina Peterson', 'Andrew Duehren', 'Natalie Andrews', 'Kristina.Peterson Wsj.Com', 'Andrew.Duehren Wsj.Com', 'Natalie.Andrews Wsj.Com']

# Extract Publication date
>>> print("Article Publication Date:")
>>> print(article.publish_date)

# Extract URL of the major images

>>> print(article.top_image)

आउटपुट:

https://images.wsj.net/im-51122/social

# Extract keywords using NLP

print ("Keywords in the article", article.keywords)

# Extract summary of the article

print("Article Summary", article.summary)

नीचे पूरा कार्यक्रम है:

from newspaper import Article
url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117"
article = Article(url)
article.download()
article.parse()
print(article.authors)
print("Article Publication Date:")
print(article.publish_date)
print("Major Image in the article:")
print(article.top_image)
article.nlp()
print ("Keywords in the article")
print(article.keywords)
print("Article Summary")
print(article.summary)

आउटपुट:

['Kristina Peterson', 'Andrew Duehren', 'Natalie Andrews', 'Kristina.Peterson Wsj.Com', 'Andrew.Duehren Wsj.Com', 'Natalie.Andrews Wsj.Com']
Article Publication Date:
None
Major Image in the article:
https://images.wsj.net/im-51122/social
Keywords in the article
['state', 'spending', 'sweeping', 'southern', 'security', 'border', 'principle', 'lawmakers', 'avoid', 'shutdown', 'reach', 'weekendthe', 'fund', 'trump', 'union', 'agreement', 'wall']
Article Summary
President Trump made the case in his State of the Union address for the construction of a wall along the southern U.S. border, calling it a “moral issue."
Photo: GettyWASHINGTON—Senior lawmakers said Monday night they had reached an agreement in principle on a sweeping deal to end a monthslong fight over border security and avoid a partial government shutdown this weekend.
The top four lawmakers on the House and Senate Appropriations Committees emerged after three closed-door meetings Monday and announced that they had agreed to a framework for all seven spending bills whose funding expires at 12:01 a.m. Saturday.