Web-Based Database-Powered Chatbot Project – Module 1: Approximate String Matching - Towards Data Science

Welcome to the forefront of conversational AI as we explore the fascinating world of AI chatbots in our dedicated blog series. Discover the latest advancements, applications, and strategies that propel the evolution of chatbot technology. From enhancing customer interactions to streamlining business processes, these articles delve into the innovative ways artificial intelligence is shaping the landscape of automated conversational agents. Whether you’re a business owner, developer, or simply intrigued by the future of interactive technology, join us on this journey to unravel the transformative power and endless possibilities of AI chatbots.
Publish AI, ML & data-science insights to a global community of data professionals.
A project spanning web programming, math, AI, and more.
I have always been fascinated by chatbots, and since my teenage years, I have created tens of them while learning to program. However, I never had any goal for my chatbots to achieve, so my only motivation was curiosity and my bots were limited to engaging in very limited chit-chat. I would quickly drop each project, of course.
But now, with a professional presence online for my main job, side hustles, and hobbies, I had the chance to create a chatbot with a purpose: answering the questions of my website’s visitors and guiding them through it. The power of JavaScript programming allowed me to do this quite easily. Moreover, I could gift my chatbot with two mechanisms to reply: one based on a large dataset of question-answer pairs assisted by approximate string matching functions; and another based on question-dependent fine-tuning of GPT-3. In this article I discuss the first track, which capitalizes on open datasets of question-answer pairs and stopwords, and on procedures to quantify the similarity of a pair of strings. The GPT-3 track will come up in a future article and is important to break free from the limitations of closed question-answer databases, so stay tuned!
As you saw in the summary above, I’ve always been interested in chatbots. It turns out that I’ve now just built (actually I keep extending every day) a complete chatbot, all web-based, that pops up in different pages of my website guiding its visitors and answering their questions. I wrote this chatbot as an HTML/JS component, so I can easily incorporate it into any page with just a script tag. By using the browser’s localStorage feature my chatbot can keep fluent conversations when the user opens up multiple pages of my site, even if in different tabs or windows. By using CSS the chatbot can easily adjust to smartphone or computer screens. And I let the bot write the conversations to a file on my site, which I can then inspect to learn what visitors usually ask and what deviations they usually make from the core knowledge database, so that I can then improve the chatbot accordingly.
But I will talk about the design and features in some future article. Here I want to describe the first track of its main module, the module that allows the bot to directly respond to questions of my visitors and also engage in some basic chit-chat extracted from a database parsed through approximate string matching. In an upcoming article I will describe the chatbot’s second track, which extends its capabilities enormously by combining a knowledge base with text generation by GPT-3.
There are many ways to classify chatbots, but one very important distinction is whether they only provide answers taken literally from a database or they can actually make up text that stands as a reasonable reply to the question. The latter is much more challenging from the programming point of view, requiring some form of high-end natural language processing protocol to transform questions into reasonable answers. The main advantage of this AI-based approach is that, if well done, it is very generalizable and can provide correct answers to questions asked in many different ways; besides, it will be natively tolerant to typos and grammar errors. But it is not easy to build such kinds of AI programs; and most importantly, you always risk that the bot might make up text that includes incorrect information or even inappropriate content, sensitive information, or just unreadable text. High-end natural language processing programs such as GPT-3 can be a good solution, but even these can make up text with errors or inappropriate content.
I will turn into the GPT-3 module of my chatbot soon, whereas here I will develop on the module that works through question/answer matching. In case you’re very curious and can’t wait for my article describing the bot’s GPT-3 module, let me share with you here some pieces I wrote about GPT-3 where you’ll find some negative and positive points plus quite a bit of code and instructions for you to make your own tests with this technology:
Build custom-informed GPT-3-based chatbots for your website with very simple code
Devising tests to measure GPT-3’s knowledge of the basic sciences
Testing GPT-3 on Elementary Physics Unveils Some Important Problems
GPT-3-like models with extended training could be the future 24/7 tutors for biology students
And be sure that I will come back to GPT-3 soon, as it powers the second track of my website’s chatbot’s brain.
Here you have a screenshot of my chatbot using its question/answer matching track to chat with a human user who happened to ask for jokes:
Let’s see how this question/answer matching module, alternative to complex AI models and safer because it only replies what’s coded in the database, works. Essentially, this consists in "simply" searching for the human’s inputs inside a database of questions and answers, and then providing the corresponding pre-compiled answer/s. If the database is sufficiently large and the human is warned that the bot’s code is limited to certain topics only, the overall experience might be just good enough, at least within its intended usage.
There are a couple of important points to treat, though:
The solution to the database problem is not easy. As I detail below, for my chatbot I took an open-source (MIT license) database of question-answer pairs from Microsoft’s GitHub account, and extended it with content specific about me and the projects I work on -because that’s what the chatbot is supposed to answer. Meanwhile, the difficulty introduced by typos, errors, and input variability can be tackled by searching the database not for exact questions but for questions that resemble each of the entries in the database. This requires the use of string comparison metrics rather than perfect matching, and quite some cleanup of the human’s input before doing the search.
Let’s see all these points one by one.
I coded into my chatbot a function that can clean up different aspects of human input. Depending on the arguments, the function will extend contractions, remove symbols, remove numbers in multiple forms, or remove stopwords. Notice that this means the database should preferably not have any symbols, numerical information, or contractions, as they will lower the match scores upon search. For example, all instances of "website’s" in the database are expanded to "website is".
The operations involved in cleaning seem trivial, but again are limited by the availability of databases, for example of stopwords. I compiled a quite long list from some resources, that you can just now borrow -but please acknowledge me, just as my code acknowledges my sources!
Here’s the full function, including the lists of stopwords, symbols, etc.: http://lucianoabriata.altervista.org/chatbotallweb/cleaningfunction.txt
Notice that on calling the function one can choose what to clean. Some parts of my code request full cleanup, while others request cleanup of symbols and numbers but not stopwords. Besides, my code also cleans up other potential sources of problems right before creating the search query for the knowledge database. For example, it replaces occurrences of "he", "him" and "his" by "luciano" -because I assume that anybody asking my website’s bot about a third person refers to me, and it is coded like this in the database. Of course, this will not work properly if the visitor is actually asking about another person… Anyway, the database has "Luciano" everywhere in its answers, so it will be clear that the answers refer to myself even if the human might be thinking about somebody else. Likewise, part of the cleaning procedure is taking all the inputs to lowercase and having all questions of the database in lowercase too (while all answers are properly capitalized). Besides, all inputs and questions are fully trimmed off any terminal spaces.
For my chatbot I took the English version of Microsoft’s personality chit-chat database for chatbots, and started adding content specific about me and the projects I work on. Why? Well, because the whole point of my bot is to guide the visitors of my websites and answer their questions about me and my projects -and of course Microsoft doesn’t know anything about me and my projects!. In fact, when the visitor lands on my website the chatbot already explains that its knowledge is quite limited to talking about certain topics (which I introduced manually into the database) and basic chit-chat (from Microsoft’s database plus some custom additions and edits).
Here’s the database I used from Microsoft. As you can see there are different languages and personality styles supported:
botframework-cli/packages/qnamaker/docs/qnaFormat at main · microsoft/botframework-cli
I actually reshaped this file to have question-answer pairs in the same line, which makes it much easier to then add more entries. And for many questions, I give multiple possible answers so that when a visitor repeats a question or asks two very similar questions, the bot doesn’t always repeat itself.
This is an example entry from the knowledge base:
The line is separated by ## delimiters in 4 fields: the first field contains all the possible ways to ask a question (well, here just some ways to say hello, and there are more in another line) separated by ||. The last field is a list of possible answers, again separated by ||, here two different options.
The second field contains a kind of "disclaimer" that the chatbot will use if it finds only a partial match to one of the questions, before giving the possible answer, to produce a more natural conversation. For example, if the user asks "what’s your name?" with some typos, then the bot will answer "Asking for my name?" followed by one of the preset answers. Notice that, the way I coded my bot, this will not trigger if the typo is very small. For example, here I ask first with a single typo (which results ina direct answer) and then with multiple typos (where the answer is preceded by a small disclaiming sentence):
The third field retains keywords representing the main topic of the question-answer pair, useful to help maintain at least some minimal context in a conversation. For example, here the human asks about moleculARweb (a website I developed together with a colleague at work) and then asks another question about referring to it by "it"… and the chatbot gets it:
Of course the fastest option to search questions is to simply match the input text typed by the human to each possible input listed in each possible line. And the chatbot does this, as the first thing it attempts. If a match is found, the code randomly chooses one of the listed answers and displays it. Try for example asking my chatbot for some jokes:
The chatbot also does this perfect-match search by cleaning the human’s input and all possible inputs from all their stopwords. Again, if there’s a perfect match it will display an answer from the list. But before doing this, the chatbot attempts to find the question as typed by the human inside the database, this time allowing for typos, grammar mistakes, and even swapped words. To achieve this it uses two functions that measure string similarity:
The Jaro-Wrinkler Distance, which measures the edit distance between two sequences, i.e. the minimum number of operations required to transform one into the other. It ranges from 0 to 1, with 1 meaning perfect match. See here for a Wikipedia entry. and here for the original paper by Jaro and Wrinker.
The Text Cosine Similarity, which measures how well the number of occurrences of each word matches between the two strings, as the cosine of the angle formed by the n-dimensional vectors made up of all the frequencies of all n words from both strings. It also ranges from 0 to 1 with 1 meaning perfect match. It is one specific application of the more general cosine similarity, about which you can read in this very nice article at TDS by Ben Chamblee.
Notice that by construction, the Jaro-Wrinkler distance will score pairs of strings with typos and different spellings highly similar. Say for example super and suter or modeling and modelling. On the other hand, text cosine similarity will score single words with typos very bad, because each will be counted as a different word for the frequency calculations. But on the contrary and unlike the Jaro-Wrinkler distance, the text cosine similarity metric will score pairs of sentences made up of the same words in different orders as matching perfectly. Thus, clearly these two types of string similarity metrics are very complementary. That’s why I incorporated both in my chatbot, and assume match when any of the two scores is above a threshold.
The Jaro-Wrinkler and text cosine similarity metrics are quite complementary, so I gave my chatbot the option to use both. If any one of them is above the similarity threshold, it is taken as the actual input the user typed (or might have wanted to type).
The threshold actually has 2 levels: when the similarity between the user’s text and one of the questions of the knowledge base is above 0.95, it is treated as a perfect match hence the answer from the base is given right away. If the score is between 0.88 and 0.95, the program gives the same answer but preceded by a variable sentence of the kind "Did you mean this?". If the score is between 0.8 and 0.88, the chatbot clarifies that it’s not sure about the question, followed by a candidate question from the base and its corresponding answer.
But how exactly to compute the Jaro-Wrinkler and Text Cosine similarities?
I took these functions from this excellent article by Suman Kunwar, which provides them already in JavaScript:
String Similarity Comparision in JS with Examples
In fact this article describes (and gives code) for 4 string comparison functions. But I took only Jaro-Wrinkler and Text Cosine for the reasons given above. And they turned out to work out quite well for me -although no, they aren’t infallible.
As one last note, it is important that the inputs and all possible questions are all in lowercase, trimmed, and cleared off all symbols and numbers. But not off stopwords, which often can help to define the overall meaning of a sentence.
You can try my chatbot here:
http://lucianoabriata.altervista.org/chatbotallweb/chatbottest.html
If you provide a GPT3 API key (that you can get for free here) you can use the bot’s more advanced module; but this is still under development (I will put up an article when it’s fully functional).
Written By

Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Step-by-step code guide to building a Convolutional Neural Network
A deep dive on the meaning of understanding and how it applies to LLMs
A beginner’s guide to forecast reconciliation
Symbolic Engines and Unexpected Results - A Personal Coding Experience
This article explores the relationship between a movie’s dialogue and its genre, leveraging domain-driven data…
This sophistication matrix can show you where you need to go
A hand-picked “listening list” on the questions and stakes at the forefront of artificial intelligence…
Your home for data science and Al. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

source

ZoomYourWeb3

Web-Based Database-Powered Chatbot Project – Module 1: Approximate String Matching – Towards Data Science

Contact Us

Quick Links