Jason Milliken

FAISS

How FAISS makes search better

Uncovering Valuable Data Hidden in the Notes Field of Line of Business Applications

Every line of business application seems to have a notes field. It is often where critical information is squirreled away, making it a valuable source of data for businesses. However, the notes field can also be a black hole of information, making it difficult to find the information you need. For example, how do you quickly find all the notes where you need to contact a customer?

One approach is to write SQL queries that search for notes containing specific keywords like "email," "call," or "contact." But this method can be error-prone and may not capture all relevant notes. For instance, other keywords like "text," "sms," or "message" may be missed.

Another approach is to ask users to prepend specific keywords like "COMMS" or "CONTACT" to notes that require communication. However, this method can be challenging to enforce and relies on users remembering to add the keywords.

What businesses really need is a semantic search that can find notes related to communication without relying on specific keywords. Fortunately, FAISS can do just that.

FAISS (Facebook AI Similarity Search) is a library for searching vectors of embeddings. In simple terms, it allows you to search for notes that are semantically similar to a given search term, even if they don't contain the exact same keywords. Here's an example of how it works:

Unlocking the value of the notes field

Let's say we want to find all the recently updated notes that require some form of customer communication.

  • Call client back
  • Client requested notification
  • Follow up with client
  • Discount all invoices by 15%
  • Shipping is free for this client
  • Do not call after 3PM
  • Contact customer after order is complete

Step 1: Pull the recent notes

data = [['Call client back'],
['Client requested notification'],
['Follow up with client'],
['Discount all invoices by 15%'],
['Shipping is free for this client'],
['Do not call after 3PM'],
['Contact customer after order is complete'],]
df = pd.DataFrame(data, columns = ['text'])

Step 2: Get embeddings from SBERT

text = df['text']
encoder = SentenceTransformer("paraphrase-mpnet-base-v2")
embeddings = encoder.encode(text)

Step 3: Load the FAISS index

embeddings_dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(embeddings_dimension)
faiss.normalize_L2(embeddings)
index.add(embeddings)

Step 4: Search for semantically similar data

search_text = 'contact customer'
search_embeddings = encoder.encode(search_text)
_embeddings = np.array([search_embeddings])
faiss.normalize_L2(_embeddings)
k = index.ntotal
distances, ann = index.search(_embeddings, k=k)
results = pd.DataFrame({'distances': distances[0], 'ann': ann[0]})
merge = pd.merge(results, df, left_on="ann", right_index=True)
print(merge)
distances  ann                                      text
0 0.498994 6 Contact customer after order is complete
1 1.062409 1 Client requested notification
2 1.117589 0 Call client back
3 1.195408 2 Follow up with client
4 1.349133 5 Do not call after 3PM
5 1.465960 4 Shipping is free for this client
6 1.624266 3 Discount all invoices by 15%

In this example, FAISS has ranked the notes based on their semantic similarity to the search term "contact customer." Even though some of the notes don't contain the exact keywords, they are still ranked highly because they are related to customer communication.

By using FAISS, businesses can unlock valuable data hidden in the notes field of their line of business applications. They can quickly find relevant notes without relying on specific keywords and get a more complete picture of their unstructured data.

Resources

FAISS Tutorial

FAISS Docs