🧠 UNIT 6 – NATURAL LANGUAGE PROCESSING (NLP)
🌟 Test Yourself (Objective Questions with Answers)
1. What is the primary challenge faced by computers in understanding human languages?
A) Complexity of human languages
B) Lack of computational power
C) Incompatibility with numerical data
D) Limited vocabulary
✅ Answer: A) Complexity of human languages
💬 Explanation: Human languages have grammar, tone, slang, and multiple meanings that make it hard for computers to interpret correctly.
2. How do voice assistants utilize NLP?
A) To analyze visual data
B) To process numerical data
C) To understand natural language
D) To execute tasks based on computer code
✅ Answer: C) To understand natural language
💬 Explanation: Voice assistants like Siri and Alexa use NLP to process human speech and respond meaningfully.
3. Which of the following is NOT a step in Text Normalisation?
A) Tokenization
B) Lemmatization
C) Punctuation removal
D) Document summarization
✅ Answer: D) Document summarization
💬 Explanation: Text normalization cleans text for processing; summarization is a later step, not part of normalization.
4. In text processing, what is the purpose of tokenisation?
A) To convert text into numerical data
B) To segment sentences into smaller units
C) To translate text into multiple languages
D) To summarize documents for analysis
✅ Answer: B) To segment sentences into smaller units
💬 Explanation: Tokenization splits sentences into words or phrases (tokens) so computers can analyze them easily.
5. What distinguishes lemmatization from stemming?
A) Lemmatization produces meaningful words after affix removal, while stemming does not.
B) Lemmatization is faster than stemming.
C) Stemming ensures the accuracy of the final word.
D) Stemming generates shorter words compared to lemmatization.
✅ Answer: A) Lemmatization produces meaningful words after affix removal, while stemming does not.
💬 Explanation: Lemmatization gives dictionary-based words, whereas stemming may give incomplete word forms.
6. What is the primary purpose of the Bag of Words model in NLP?
A) To translate text into multiple languages
B) To extract features from text for machine learning algorithms
C) To summarize documents for analysis
D) To remove punctuation marks from text
✅ Answer: B) To extract features from text for machine learning algorithms
💬 Explanation: Bag of Words converts text into numerical word frequencies, helping machines analyze word importance.
7. In text processing, what are stop words?
A) Words with frequent occurrence in the corpus
B) Words with negligible value that are often removed during preprocessing
C) Words with the lowest occurrence in the corpus
D) Words with the most value added to the corpus
✅ Answer: B) Words with negligible value that are often removed during preprocessing
💬 Explanation: Words like is, the, an, and add little meaning, so they are removed to focus on key terms.
8. What is the characteristic of rare or valuable words in the described plot?
A) They have the highest occurrence in the corpus
B) They are often considered stop words
C) They occur the least but add the most value to the corpus
D) They are typically removed during preprocessing
✅ Answer: C) They occur the least but add the most value to the corpus
💬 Explanation: Rare words like earthquake or democracy carry unique information in a text.
9. What information does the document vector table provide?
A) The frequency of each word across all documents
B) The frequency of each word in a single document
C) The total number of words in the corpus
D) The average word length in the corpus
✅ Answer: A) The frequency of each word across all documents
💬 Explanation: It shows how often each word appears across documents, helping to create document comparisons.
10. What is the primary purpose of TF-IDF in text processing?
A) To identify the presence of stop words in documents
B) To remove punctuation marks from text
C) To identify the value of each word in a document
D) To translate text into multiple languages
✅ Answer: C) To identify the value of each word in a document
💬 Explanation: TF-IDF shows how important a word is in a document compared to all other documents.
11. Assertion–Reasoning Question
Assertion: Pragmatic analysis in NLP involves assessing sentences for real-world meaning.
Reasoning: It looks at the context and intention behind words, not just their dictionary meaning.
✅ Answer: A) Both Assertion and Reasoning are true, and Reasoning correctly explains Assertion.
12. Assertion–Reasoning Question
Assertion: Converting text into lowercase is an important preprocessing step.
Reasoning: It prevents the computer from treating “Apple” and “apple” as different words.
✅ Answer: A) Both Assertion and Reasoning are true, and Reasoning correctly explains Assertion.
💬 Reflection Time
1. Mention a few features of natural languages.
✅ Answer:
Natural languages (like English, Hindi) are:
- Complex – have grammar rules and sentence structures.
- Ambiguous – words can have more than one meaning.
- Context-dependent – meaning changes with situation.
- Evolving – new words and slang are added often.
💬 Example: The word “bark” can mean a dog’s sound or the outer layer of a tree.
2. What is the significance of NLP?
✅ Answer:
NLP helps computers understand, interpret, and respond to human language.
💬 Importance:
- Improves human–computer interaction.
- Used in chatbots, translators, and voice assistants.
- Helps analyze opinions, emails, and documents automatically.
3. What do you mean by lexical analysis in NLP?
✅ Answer:
Lexical analysis means breaking sentences into smaller parts (tokens) to analyze words and their meaning.
💬 Example: “I love coding” → [I], [love], [coding].
4. What is a chatbot?
✅ Answer:
A chatbot is a computer program that talks to users using natural language.
💬 Example: ChatGPT, customer support bots, and Alexa.
They understand text or voice commands and respond automatically.
Q5. What does the term “Bag of Words” refer to in Natural Language Processing (NLP)?
✅ Answer:
The Bag of Words (BoW) model is a simple way to represent text data so that machines can understand it.
It treats each document as a collection (bag) of words and records how many times each word appears — but it does not care about grammar or word order.
💬 Detailed Explanation :
- The Bag of Words model helps convert text into numerical form by counting word frequency.
- Each unique word becomes a feature (column), and each document becomes a row with the number of times that word occurs.
- This way, we can compare documents or train models on them.
_________________________________________________________________________________________
6. Describe two practical uses of NLP in real-world scenarios.
✅ Answer:
- Voice Assistants: NLP helps Siri or Alexa understand speech.
- Email Filtering: NLP detects and filters spam emails.
💬 Other uses: Chatbots, translation tools, and sentiment analysis.
7. Explain stemming and lemmatization with an example.
✅ Answer:
Both stemming and lemmatization are text preprocessing steps used in NLP to reduce words to their base or root form so that similar words can be treated as one.
💬 Detailed Explanation:
| Process | Definition (in simple words) | Example | Result | Note |
| Stemming | Removes the end part of a word (like -ing, -ed, -s) to get the base form. It just cuts off letters without checking grammar. | Playing, Played, Plays | play | Fast but sometimes gives incomplete words. |
| Lemmatization | Changes a word into its dictionary form using language rules. It considers grammar and meaning. | Running → run, Better → good | run, good | More accurate but slower. |
📘 Example Sentence:
“The children are playing and laughed loudly.”
After stemming → [child, play, laugh, loudli]
After lemmatization → [child, play, laugh, loudly]
8. Describe any four applications of TF-IDF.
✅ Answer:
- Search Engines: To rank results based on keyword importance.
- Spam Detection: Identifies keywords used in spam mails.
- Text Summarization: Helps find important terms in large documents.
- Sentiment Analysis: Detects emotional or opinion words in text.
___________________________________________________________________
Q9. Samiksha got stuck while performing text normalisation. Help her to normalise the text on the segmented sentences given below:
Document 1: Akash and Ajay are best friends.
Document 2: Akash likes to play football but Ajay prefers to play online games.
Answer:
Step 1 — Sentence segmentation
(The documents are already single sentences.)
- Doc1: Akash and Ajay are best friends.
- Doc2: Akash likes to play football but Ajay prefers to play online games.
Step 2 — Lowercasing
Convert all characters to lowercase for uniformity.
- Doc1 → akash and ajay are best friends
- Doc2 → akash likes to play football but ajay prefers to play online games
Step 3 — Remove punctuation
Remove punctuation marks (periods, commas, question marks, etc.).
- Doc1 → akash and ajay are best friends
- Doc2 → akash likes to play football but ajay prefers to play online games
Step 4 — Tokenization
Split each sentence into words (tokens).
- Doc1 tokens → [akash, and, ajay, are, best, friends]
- Doc2 tokens → [akash, likes, to, play, football, but, ajay, prefers, to, play, online, games]
Step 5 — Remove stop words
Remove common words that add little meaning (e.g., and, are, to, but).
- Doc1 → [akash, ajay, best, friends]
- Doc2 → [akash, likes, play, football, ajay, prefers, play, online, games]
Step 6 — Lemmatization (preferred)
Convert words to their dictionary/base form (lemmas). This keeps real words (better than simple stemming).
- friends → friend
- likes → like
- prefers → prefer
- games → game
- (other words remain same)
After lemmatization:
- Doc1 → [akash, ajay, best, friend]
- Doc2 → [akash, like, play, football, ajay, prefer, play, online, game]
Step 7 — Reconstruct normalized text
Join tokens back (usually space-separated) for final normalized form used for further processing.
- Final Normalised Doc1: akash ajay best friend
- Final Normalised Doc2: akash like play football ajay prefer play online game
________________________________________________________________________
🧮 Q10. Through a step-by-step process, calculate TF–IDF for the given corpus:
Document 1: Johny Johny Yes Papa
Document 2: Eating sugar? No Papa
Document 3: Telling lies? No Papa
Document 4: Open your mouth, Ha! Ha! Ha!
✅ Answer:
Step 1 – Clean the Text
| Document | Cleaned Text |
| Document 1 | johny johny yes papa |
| Document 2 | eating sugar no papa |
| Document 3 | telling lies no papa |
| Document 4 | open your mouth ha ha ha |
Step 2 – Dictionary (Vocabulary Words)
johny | yes | papa | eating | sugar | no | telling | lies | open | your | mouth | ha
Step 3 – Document Vector Table (Term Frequency – TF)
Each number shows how many times a word appears in that document.
| Document | johny | yes | papa | eating | sugar | no | telling | lies | open | your | mouth | ha |
| Doc 1 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Doc 2 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| Doc 3 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| Doc 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 3 |
🟢 These values represent the Term Frequency (TF) — the frequency of each word in one document.
Step 4 – Document Frequency (DF)
Number of documents in which each word appears (no matter how many times).
| Word | johny | yes | papa | eating | sugar | no | telling | lies | open | your | mouth | ha |
| DF | 1 | 1 | 3 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 |
Step 5 – Inverse Document Frequency (IDF)
The formula for each word:

| Word | johny | yes | papa | eating | sugar | no | telling | lies | open | your | mouth | ha |
| IDF = log(4/DF) | log(4/1) | log(4/1) | log(4/3) | log(4/1) | log(4/1) | log(4/2) | log(4/1) | log(4/1) | log(4/1) | log(4/1) | log(4/1) | log(4/1) |
Step 6 – TF–IDF Table (TF × IDF)
Multiply each term frequency by its corresponding IDF (log formula) for each document.
| Document | johny | yes | papa | eating | sugar | no | telling | lies | open | your | mouth | ha |
| Doc 1 | 2×log(4/1) | 1×log(4/1) | 1×log(4/3) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Doc 2 | 0 | 0 | 1×log(4/3) | 1×log(4/1) | 1×log(4/1) | 1×log(4/2) | 0 | 0 | 0 | 0 | 0 | 0 |
| Doc 3 | 0 | 0 | 1×log(4/3) | 0 | 0 | 1×log(4/2) | 1×log(4/1) | 1×log(4/1) | 0 | 0 | 0 | 0 |
| Doc 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |