पाइथन में एनएलटीके का उपयोग करके स्टॉप शब्दों के साथ भाषण टैगिंग का हिस्सा?

प्राकृतिक भाषा प्रसंस्करण के पीछे मुख्य विचार यह है कि मशीन मानव हस्तक्षेप के बिना कम से कम कुछ स्तर तक विश्लेषण या प्रसंस्करण कर सकती है जैसे पाठ के कुछ हिस्से को समझना या कहने की कोशिश करना।

पाठ को संसाधित करने का प्रयास करते समय, कंप्यूटर को पाठ से बेकार या कम-महत्वपूर्ण डेटा (शब्दों) को फ़िल्टर करने की आवश्यकता होती है। NLTK में, बेकार शब्दों (डेटा) को स्टॉप वर्ड्स के रूप में संदर्भित किया जाता है।

आवश्यक पुस्तकालय स्थापित करना

सबसे पहले आपको nltk लाइब्रेरी की आवश्यकता है, बस अपने टर्मिनल में निम्न कमांड चलाएँ:

$pip install nltk

इसलिए हम इन स्टॉप वर्ड्स को हटाने जा रहे हैं, ताकि वे हमारे डेटाबेस में जगह न लें या मूल्यवान प्रोसेसिंग समय न लें।

आप उन शब्दों की अपनी सूची बना सकते हैं जिन्हें आप स्टॉप वर्ड मान सकते हैं। डिफ़ॉल्ट रूप से, एनएलटीके में शब्दों का कुछ समूह होता है जिसे वे स्टॉप शब्द मानते हैं, आप इसे एनएलटीके कॉर्पस के माध्यम से एक्सेस कर सकते हैं:

>>> import nltk
>>> from nltk.corpus import stopwords

यहां एनएलटीके स्टॉप वर्ड्स की सूची दी गई है:

>>> set(stopwords.words('english'))
{'not', 'other', 'shan', "hadn't", 'she', 'did', 'through', 'and', 'does', "that'll", "weren't", 'your', "should've", "hasn't", 'myself', 'should', 'because', 'wasn', 'what', 'to', 'this', 'was', 'more', 'y', 'again', "needn't", 'into', 'above', 'themselves', 'd', "won't", 'during', 'haven', 'both', "shan't", 'their', 'on', 'hadn', 'up', 'once', 'its', 'against', 'before', 't', 'while', 'needn', 'doing', "don't", 'yourselves', 'until', 'is', 'all', 's', 'will', "you've", 'being', 'under', 'they', 'ours', 'wouldn', 'of', 'didn', 'below', 'just', 'ma', 'yours', "you'll", 'mightn', 'where', 'are', 'that', 'those', 'most', 'them', 'if', 'you', "shouldn't", 'off', 'for', 'her', 'such', 'now', 'than', 're', 'no', 'm', 'or', "aren't", 'further', 'here', "wasn't", 'after', "haven't", 'my', 'himself', 'at', 'had', 'yourself', 'by', 'weren', 'only', 'have', 'we', 'do', 'same', "isn't", 'herself', 'll', 'down', 'then', 'why', 'own', 'him', 'so', 'having', 'nor', 'isn', 'few', 'how', 'each', 'there', 'with', 'couldn', 'about', 'very', 'am', 'me', "didn't", "doesn't", 'which', "she's", 'doesn', 'were', 'he', 'in', "mightn't", 'when', 'our', 'who', 'his', "couldn't", 'the', "you'd", 'be', 'hers', 'hasn', 'between', 'it', 'mustn', 'but', 'out', 'can', "wouldn't", 'ourselves', 'whom', 'been', 'these', 'aren', 'over', 'itself', 'a', 'i', 'too', 'theirs', 'some', "you're", 'as', 'won', "it's", 'from', 'o', 'don', 'any', 've', 'ain', 'has', 'an', "mustn't", 'shouldn'}

नीचे एक संपूर्ण कार्यक्रम है जो आपके टेक्स्ट से स्टॉप वर्ड्स को हटाने के लिए स्टॉपवर्ड्स का उपयोग करने का तरीका दिखाएगा:

उदाहरण कोड

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "Python is a powerful high-level, object-oriented programming language created by Guido van Rossum."\
"It has simple easy-to-use syntax, making it the perfect language for someone trying to learn computer programming for the first time."\
"This is a comprehensive guide on how to get started in Python, why you should learn it and how you can learn it. However, if you knowledge "\
"of other programming languages and want to quickly get started with Python."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

आउटपुट

टेक्स्ट आउटपुट:बिना फ़िल्टर के (स्टॉपवर्ड के साथ)

['Python', 'is', 'a', 'powerful', 'high-level', ',', 'object-oriented', 'programming', 'language', 'created', 'by', 'Guido', 'van', 'Rossum.It', 'has', 'simple', 'easy-to-use', 'syntax', ',', 'making', 'it', 'the', 'perfect', 'language', 'for', 'someone', 'trying', 'to', 'learn', 'computer', 'programming', 'for', 'the', 'first', 'time.This', 'is', 'a', 'comprehensive', 'guide', 'on', 'how', 'to', 'get', 'started', 'in', 'Python', ',', 'why', 'you', 'should', 'learn', 'it', 'and', 'how', 'you', 'can', 'learn', 'it', '.', 'However', ',', 'if', 'you', 'knowledge', 'of', 'other', 'programming', 'languages', 'and', 'want', 'to', 'quickly', 'get', 'started', 'with', 'Python', '.']

टेक्स्ट आउटपुट:फिल्टर के साथ(स्टॉपवर्ड हटाएं)

['Python', 'powerful', 'high-level', ',', 'object-oriented', 'programming', 'language', 'created', 'Guido', 'van', 'Rossum.It', 'simple', 'easy-to-use', 'syntax', ',', 'making', 'perfect', 'language', 'someone', 'trying', 'learn', 'computer', 'programming', 'first', 'time.This', 'comprehensive', 'guide', 'get', 'started', 'Python', ',', 'learn', 'learn', '.', 'However', ',', 'knowledge', 'programming', 'languages', 'want', 'quickly', 'get', 'started', 'Python', '.']