Notebook One | Repository

Transcript vectorisation

Andrea Leone
University of Trento
January 2022

Load the pre-trained pipelines for word embeddings:

Both pipelines are trained on WordNet 3.0 lexical database of English, ClearNLP Constituent-to-Dependency Conversion, and OntoNotes 5 corpus.


Static model


Query the records that still have no vector transcript


For each record retrieved, get the transcript, input it in the nlp pipeline to vectorise the entire document (token-per-token), extract the document vector converting the numerical values to float64.


Transformer model


Query the records that still have no vectorised transcript


As transformer-based pretrained models work at tensor-level, they eventually need to be re-aligned to the tokens to extract word/span/document vectors.


For each record retrieved, get the transcript, input it in the trf pipeline to vectorise the entire document using the transformer, align the tensors with the tokens with the custom task, and extract the document vector converting the numerical values to float64.