This idea didn’t spontaneously come to you — it’s been in your mind for years. Years of watching and noticing something missing in the market. A unique way to solve a problem you’ve seen before.
If this year has shown us anything it’s that we need to just start. Things can change any time and that perfect moment you’ve been waiting for to launch your shiny new idea might never come.
So that’s what you do. You start working on your idea. In the end, you need to learn how to make the best out of every situation.
Just like you, many innovators have found themselves in that spot, myself included.
We started developing iSearch roughly one year ago, after the idea was in my mind for several years. We weren’t expecting a pandemic but, we decided to push through it. Today, iSearch is a reality.
I’ve already talked extensively about all the ways iSearch can help innovators validate their ideas, find their competitors, and innovate faster. But, today, I want to talk about how we made that happen. Specifically, about some of the tools we used to build iSearch.
It’s only by sharing what we know, that we can make our ideas stronger and better. It’s only by sharing that we can encourage others to keep innovating!
To build iSearch.ai, our agile competitive intelligence tool, we used many different fine-tune trained models and embeddings, this is how we used them.
Before trying any tool, we needed to have a super clear idea of what we were building. Although ideas evolve when you start working on them, we were sure about three things:
Our end goal is to help innovators to smoke out their competitors and iterate faster! We envision a future where you can use iSearch to fly past your competitor simply because you have the right tool by your side.
From the beginning, we knew we’d use AI and NLP (natural language processing) to accomplish this. But, as you already know, building something of this nature takes more than just mentioning AI and NLP.
The most important part of iSearch is its ability to understand and analyze ideas. To use this tool, innovators describe their ideas in a normal language, just as if they were sharing them with a friend over a cup of coffee. Then, iSearch can compare that idea with others out there, based on patents and a database of companies. Sounds easy enough!
But, of course, we quickly ran into a universal NLP problem: semantic textual similarity.
How similar is similar? And, what does similar even mean in the context we’re talking about?
This was a big problem because the core of our software was precisely to be able to compare ideas and decide how similar they were to other existing ones.
To solve this, we’d need to use word embedding and sentence embedding, which we talk about here.
But, if you’ve been paying attention, we didn’t just need to compare ideas, we needed to score ideas based on that comparison. Word and sentence embedding helped us with the first part, how did we solve the second part?
Once we had the embeddings, we started with cosine similarity to consider how similar is an idea to company and product descriptions, and also to patents.
We decided on cosine similarity because it isn’t sensitive to document length. It’s an ideal way to compare a short document, like the ideas of the innovators, to much longer documents, like a patent.
I personally really like this tutorial on how to calculate and use cosine similarity, it can really help you if you’re in the same situation!
We didn’t just use cosine similarity. We also used FAISS (Facebook AI Similarity Search). This is a great library which also proved to be super helpful. It has multiple distance measurements and as its name suggests, it helps to quickly search through a pile of documents to find the most similar one(s) to an input document.
In other words, it helps to quickly locate that most similar competitor or product description, or patent, based on your idea description.
Another problem we considered was how to cluster the information we were working with, like company descriptions, descriptions of products, and patents.
We did this by applying embedding models, followed by distance measurements.
One great method we used is the GMM (Gaussian Mixture Model). We’re partial to the scikit-learn GMM.
There are other great clustering algorithms like the Louvain Method (also called Louvain algorithm sometimes) and, of course, K-means clustering.
After we’d decided how to cluster the documents we had, we also needed to decide what to do with that information, especially, how the information we’d get would be used to give a score to an idea.
One method we used post-clustering is term frequency-inverse document frequency (TF-IDF). TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It strives to reject words commonly used across various documents, favoring words that are unique to a particular case, but appear in it often enough. Words that appear too often are assigned a low value.
This was super important, remember we were looking for a way to score how unique an idea is.
Not everything was perfect because when applying TF-IDF to the company descriptions dataset, one potential flaw of this method was encountered: a large variety of words led to a very low generalization of highlighted words in similar documents.
To make it more robust, TF-IDF values may be aggregated with groups of similar sentences.
That’s why we applied a clustering algorithm. For every cluster, words with the highest aggregated value of TF-IDF statistics are preferably chosen as representatives of the specific clusters.
After applying this process, we could get a tool to find the most similar company descriptions, product descriptions, or patents to an input idea.
What’s exciting about building software is that there’s always room for improvement. Now that innovators can use the tool, they can help us improve iSearch!
If you want to test the tool, go ahead and register here. I will be happy to hear back from you to know how your experience using it!