इलियड डेटासेट को पायथन का उपयोग करके प्रशिक्षण के लिए कैसे तैयार किया जा सकता है?

Tensorflow एक मशीन लर्निंग फ्रेमवर्क है जो Google द्वारा प्रदान किया जाता है। यह एक ओपन-सोर्स फ्रेमवर्क है जिसका उपयोग पायथन के साथ एल्गोरिदम, डीप लर्निंग एप्लिकेशन और बहुत कुछ को लागू करने के लिए किया जाता है। इसका उपयोग अनुसंधान और उत्पादन उद्देश्यों के लिए किया जाता है।

कोड की निम्न पंक्ति का उपयोग करके विंडोज़ पर 'टेंसरफ़्लो' पैकेज स्थापित किया जा सकता है -

pip install tensorflow

Tensor एक डेटा संरचना है जिसका उपयोग TensorFlow में किया जाता है। यह प्रवाह आरेख में किनारों को जोड़ने में मदद करता है। इस प्रवाह आरेख को 'डेटा प्रवाह ग्राफ' के रूप में जाना जाता है। टेंसर और कुछ नहीं बल्कि एक बहुआयामी सरणी या एक सूची है।

हम इलियड के डेटासेट का उपयोग करेंगे, जिसमें विलियम काउपर, एडवर्ड (डर्बी के अर्ल) और सैमुअल बटलर के तीन अनुवाद कार्यों का टेक्स्ट डेटा शामिल है। जब पाठ की एक पंक्ति दी जाती है तो मॉडल को अनुवादक की पहचान करने के लिए प्रशिक्षित किया जाता है। उपयोग की गई टेक्स्ट फ़ाइलें प्रीप्रोसेसिंग कर रही हैं। इसमें दस्तावेज़ शीर्षलेख और पाद लेख, पंक्ति संख्या और अध्याय शीर्षक निकालना शामिल है।

हम नीचे दिए गए कोड को चलाने के लिए Google सहयोग का उपयोग कर रहे हैं। Google Colab या Colaboratory ब्राउज़र पर पायथन कोड चलाने में मदद करता है और इसके लिए शून्य कॉन्फ़िगरेशन और GPU (ग्राफ़िकल प्रोसेसिंग यूनिट) तक मुफ्त पहुंच की आवश्यकता होती है। जुपिटर नोटबुक के शीर्ष पर सहयोगात्मक बनाया गया है।

उदाहरण

निम्नलिखित कोड स्निपेट है -

print("Prepare the dataset for training")
tokenizer = tf_text.UnicodeScriptTokenizer()
print("Defining a function named 'tokenize' to tokenize the text data")
def tokenize(text, unused_label):
   lower_case = tf_text.case_fold_utf8(text)
   return tokenizer.tokenize(lower_case)
tokenized_ds = all_labeled_data.map(tokenize)
print("Iterate over the dataset and print a few samples")
for text_batch in tokenized_ds.take(6):
   print("Tokens: ", text_batch.numpy())

कोड क्रेडिट - https://www.tensorflow.org/tutorials/load_data/text

आउटपुट

Prepare the dataset for training
Defining a function named 'tokenize' to tokenize the text data
WARNING:tensorflow:From /usr/local/lib/python3.6/distpackages/tensorflow/python/util/dispatch.py:201: batch_gather (from
tensorflow.python.ops.array_ops) is deprecated and will be removed after 2017-10-25.
Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.
Iterate over the dataset and print a few samples
Tokens: [b'but' b'i' b'have' b'now' b'both' b'tasted' b'food' b',' b'and' b'given']
Tokens: [b'all' b'these' b'shall' b'now' b'be' b'thine' b':' b'but' b'if' b'the'
b'gods']
Tokens: [b'their' b'spiry' b'summits' b'waved' b'.' b'there' b',' b'unperceived']
Tokens: [b'"' b'i' b'pray' b'you' b',' b'would' b'you' b'show' b'your' b'love'
b',' b'dear' b'friends' b',']
Tokens: [b'entering' b'beneath' b'the' b'clavicle' b'the' b'point']
Tokens: [b'but' b'grief' b',' b'his' b'father' b'lost' b',' b'awaits' b'him'
b'now' b',']

स्पष्टीकरण

एक 'टोकनाइज़' फ़ंक्शन परिभाषित किया गया है जो डेटासेट में रिक्त स्थान को हटाकर वाक्यों को शब्दों में विभाजित करता है।
यह फ़ंक्शन पूरी तरह से डेटासेट पर कॉल किया जाता है।
कंसोल पर टोकनिंग के बाद डेटासेट का एक नमूना प्रदर्शित होता है।