Getting Started

Image for post
Image for post
Photo by Viktor Talashuk on Unsplash

Introduction

Document similarities is one of the most crucial problems of NLP. Finding similarity across documents is used in several domains such as recommending similar books and articles, identifying plagiarised documents, legal documents, etc.

We can call two documents similar if they are semantically similar and define the same concept or if they are duplicates.

To make machines figure out the similarity between documents we need to define a way to measure the similarity mathematically and it should be comparable so that machine can tell us which documents are most similar or which are least. …


Image for post
Image for post
Image by author.

Introduction

Hypothesis testing or A/B testing is a crucial step before integrating a data science solution or any feature update into the product. This is basically a statistical way of measuring the impact of the feature we are trying to add/update in the product.

According to Wikipedia,

A/B testing is a way to compare two versions of a single variable, typically by testing a subject’s response to variant A against variant B, and determining which of the two variants is more effective. A/B testing includes an application of statistical hypothesis testing.

Since A/B testing is a testing it should have three…


Image for post
Image for post
Photo by Artem Kniaz on Unsplash

What is cosine similarity?

Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between the two vectors.

Cosine similarity is one of the most widely used and powerful similarity measure in Data Science. It is used in multiple applications such as finding similar documents in NLP, information retrieval, finding similar sequence to a DNA in bioinformatics, detecting plagiarism and may more.

Cosine similarity is calculated as follows,


Image for post
Image for post
Photo by Markus Winkler on Unsplash

Introduction

tf-idf, which stands for term frequency-inverse document frequency is used to calculate a quantitative digest of any document, which can be further used to find similar documents, classification of documents, etc.

This article will explain tf-idf, it’s variations and what is the impact of these variations on the model output.

tf-idf, which stands for term frequency-inverse document frequency is similar to Bag of Words (BoW) where documents are considered as a bag or collection of words/terms and converted to numerical forms by counting the occurrences of every term. …

Varun

Principal Data Scientist | Machine Learning | Deep Learning | NLP | www.linkedin.com/in/varun21290

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store