Notebook Four | Repository

Linguistic features

Andrea Leone
University of Trento
February 2022


Load the "classic" English nlp pre-trained language processing pipeline, optimised for CPU. Like in the first notebook, the pipeline ingests a raw string, i.e. the talk transcript, and returns a spaCy.Doc object that comprises the result of several different steps.
SpaCy NLP pipeline

Features

Statistical models While some of SpaCy’s features work independently, others require trained pipelines to be loaded, enabling SpaCy to predict linguistic annotations. A trained pipeline can consist of multiple components that use a statistical model trained on labeled data, like in the case of en_core_web_lg. It includes the following components:


Load the pipeline.


Process all transcripts.


Sample the pipeline, on an excerpt of Dave Isay's talk


Tokenisation

The first step in the process is tokenising the text: SpaCy segments it into words, punctuation and so on. This is done by applying rules specific to each language. First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

  1. Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
  2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.

If there is a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, SpaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.
SpaCy Tokenization


Linguistic annotations

SpaCy provides a variety of linguistic annotations to give insights into a text’s grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if we are analyzing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether “google” is used as a verb, or refers to the website or company in a specific context.

Extract and show annotations for each token in the sample


Morphological Features

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech.


Linguistic dependencies

Visualizing a dependency parse or named entities in a text is not only a fun NLP demo – it can also be incredibly helpful in speeding up development and debugging your code and training process. Displacy spins up a simple web server and enables us to view the result straight in the notebook.



Vectors and Semantic Similarity

Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec.


Get similarity scores from word tuples


Under the hood, we compute the cosine distance between the two word vectors.

Computing similarity scores can be helpful in many situations, but it’s also important to maintain realistic expectations about what information it can provide. Words can be related to each other in many ways, so a single “similarity” score will always be a mix of different signals, and vectors trained on different data can produce very different results that may not be useful for your purpose. Here are some important considerations to keep in mind:

There’s no objective definition of similarity. Whether “I like burgers” and “I like pasta” is similar depends on your application. Both talk about food preferences, which makes them very similar – but if you’re analyzing mentions of food, those sentences are pretty dissimilar, because they talk about very different foods. The similarity of Doc and Span objects defaults to the average of the token vectors. This means that the vector for “fast food” is the average of the vectors for “fast” and “food”, which isn’t necessarily representative of the phrase “fast food”. Vector averaging means that the vector of multiple tokens is insensitive to the order of the words. Two documents expressing the same meaning with dissimilar wording will return a lower similarity score than two documents that happen to contain the same words while expressing different meanings.