Semantic search in Elasticsearch: Step-by-step

Traditional search is great for matching exact terms and even goes further with synonyms, fuzzy matching, and other techniques that aim to find relevant results for a user’s query. However, these techniques fall short when a user expresses their query in a natural language or is looking for conceptually related content. This blog showcases an implementation of semantic search using Elasticsearch and e5-small, a NLP model developed by Microsoft researchers that generates embeddings which capture the underlying meaning of a text.

Understanding Semantic Search

Traditional keyword search has a few shortcomings and implies some management complexities, such as:

It misses conceptually related results.
Managing synonyms becomes increasingly more difficult.
Multi-language support needs a very complex implementation.
Context can get lost in the search process

The limitations of traditional text search become apparent when, for instance, when we search “car maintenance tips”. A traditional search engine will match documents containing the terms “car”, “maintenance” and/or “tips”. However, it will completely miss highly relevant documents containing the phrase “automobile repair guides” even if they are, conceptually, the same. We can mitigate this specific scenario using synonyms, but in general this approach would be increasingly hard to maintain and we would still miss the intent of the user because a traditional search engine, at the end, is just matching terms.

Other interesting examples come from technical vs casual language, for example consider:

“Internal combustion engine maintenance” vs. “how to take care of your car’s motor”
“Brake system inspection” vs. “checking your brakes”
“Transmission fluid replacement” vs. “changing gear oil”

On the other hand, semantic search completely changes this paradigm by extracting the meaning of a text rather than trying to get exact or partially exact matches based on the query. Instead of indexing terms and mapping synonyms, a semantic search is based on multidimensional vector representations of the text that capture its meaning and enable vector-related operations, such as K-nearest neighbor algorithms, to find vectors that are close to the one generated from the query in the same multidimensional space.

Semantic Search Implementation with Elasticsearch

This project will use an NLP model to generate vectors from Wikipedia articles and user queries. It will also use Elasticsearch as the vector database and search engine to retrieve relevant results.

System Architecture

NLP Model: The chosen model, multilingual-e5-small (https://huggingface.co/intfloat/multilingual-e5-small), can generate vectors from text in multiple languages.
Vector Database: Elasticsearch is one of the most versatile search engines. It supports vector storage and operations natively, making the entire process easier. Its RESTful API simplifies indexing documents, posting queries, and performing semantic searches.

Davila , A. (n.d.). Semantic search implementation architecture

Implementation Process

Data Ingestion: The e5 model running inside a Machine Learning node in Elasticsearch is used to generate vectors from Wikipedia articles through an ingest pipeline also running within Elasticsearch. This approach reduces the complexity of the implementation by eliminating the need for an external script to calculate the vectors. As a result, when a new document is received, the vector is automatically calculated and stored.
Query Vectorization: The user’s query is also vectorized using the same model, which is projected into the same vector space as the data and can be compared.
Search Execution: The vector from the query is then compared to the data vectors using a K-Nearest-Neighbor algorithm to find the most similar ones.
Semantic Search Results: The system returns results based on conceptual relevance, not just keyword presence.

Key Advantages

Context is everything: Semantic search understands the meaning behind queries, not just the words.
No need for synonyms: It finds relevant content even when exact terms aren’t used.
Relevance reimagined: Results are based on conceptual similarity, significantly enhancing accuracy.
Language barriers? What barriers?: It works effectively across multiple languages.

Example Queries

Davila , A. (n.d.). Semantic search results 1

Davila , A. (n.d.). Semantic search results 2

Davila , A. (n.d.). Semantic search results 3

Conclusion

This approach offers a very interesting alternative to traditional search. By understanding the context of the data, it can deliver better matches for complex user queries without requiring exact terms to be present.

With Elasticsearch, the process can be streamlined and enables more complex use cases, such as Retrieval-Augmented Generation (RAG) applications integrating semantic search with an LLM, or hybrid queries that provide a context-aware search engine.

Personally, I do recommend considering semantic search as a part of modern search engines. Given that it can deliver better results than traditional text matching by understanding the intent behind the query. This becomes even more powerful when combined with LLMs and AI agents, enabling things like conversational search, making it a foundational piece for next-generation search solutions.

Bibliography

intfloat. (n.d.). Multilingual-e5-small. intfloat/multilingual-e5-small. https://huggingface.co/intfloat/multilingual-e5-small

Elastic. (n.d.). Semantic Search. Semantic search. https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-search.html

Written by:

Édgar Alexander Dávila
Elasticsearch Engineer
Country: Ecuador

Careers Blog, Engineering

Leveraging Large Language Models

Even if you’re not closely following the advancements in AI and Large Language Models (LLMs), it’s highly likely that you’ve encountered ChatGPT, —whether in the news or through a friend. Instead of hearing, “Hey, let’s Google that”, you may now hear, “Hey, let’s ask ChatGPT”.

Let’s explore this technology and how it has been rapidly integrating into our daily lives, particularly in the workplace and across various activities within companies of all sizes. We’ll conclude with tips and advice on how to use it responsibly while maintaining realistic expectations. Your own workplace may already be taking initiatives to adopt tools or applications powered by LLMs.

First, let’s define what an LLM is: “It is a subset of Generative AI that refers to artificial intelligence systems capable of understanding and generating human-like language” (Chen, 2024, p.5). These generation capabilities are achieved through training on massive and diverse datasets over extended periods using deep learning — a branch of machine learning.

By leveraging layers of neural networks, which employ probability and other techniques to establish associations within unstructured data (e.g. text, images, videos, audio), LLMs generate the most likely response to a given query. Deep learning enables these models to mimic certain aspects of human cognition, processing vast amounts of information and making decisions in a way loosely analogous to how our brain maps memories, knowledge, and electric signals across billions of neurons.

Building on the concept of unstructured data, these models rely on it as the primary input and output for the services and tools increasingly integrated into our daily tasks. Text serves as the main interface for interacting with LLM applications. For instance, these models can summarize transcripts of virtual meetings, detailing what each participant shared, classifying topics discussed, and even suggesting action items— particularly helpful features for those of us who often forget to take notes.

In a broader, organization-wide use cases, LLMs assist with handling questions and answers by combining search functionalities to process queries efficiently. This often involves training a custom model using techniques like Retrieval-Augmented Generation (RAG), which equips the LLM to retain relevant context from an organization’s internal data. Such models can search through indexed company data in vector databases, delivering more accurate and tailored results.

The capabilities of multi-modal models that LLMs have recently improved, allowing the integration of more complex data and interactions. These models can now follow prompts and instructions at a higher level of programmability, enabling outputs such as customized image or video animations. Chatbots powered by LLMs can also generate highly realistic voice responses, making it increasingly difficult to distinguish between a bot and a real person.

All of this sounds fantastic—revolutionary, disruptive, and incredibly innovative. The hype surrounding LLMs remains high, fueled by the rapid pace of advancements in model performance, framework availability, and the emergence of new businesses and technologies. And then there’s the ambitious concept of Artificial General Intelligence (AGI), a goal that continues to captivate imaginations, though it carries significant risks and demands cautious exploration.

However, there are important factors to consider when adopting this technology. As end users, we must exercise caution when assigning tasks to any LLMs-powered service or application. One major limitation is the phenomenon of “hallucinations,” where models generate inaccurate or entirely fabricated responses. Even with well-tuned models, results can sometimes be unreliable. Therefore, it’s crucial not to trust these outputs blindly. Much like cross-checking information from multiple sources on Google, we should scrutinize LLM- generated responses and rely on our own domain knowledge and instincts.

To get the best results, turn interactions with LLMs into ongoing conversations rather than relying on one-shot exchanges (known as zero-shot prompts). Providing clear, specific instructions increases the likelihood of achieving accurate and useful outputs.

What Lies Ahead?
As mentioned earlier, the technological landscape around LLMs is evolving rapidly. Speaking of AGI, while it may be the ultimate aspiration for some, it’s a goal that requires careful and deliberate steps. When leveraging LLMs, we must prioritize quality, security, privacy and dependability in our professional and organizational activities. LLMs are powerful tools, but responsive use is key to unlocking their full potential.

Bibliography

Chen, J. (2024). Demystifying Large Language Models. James Chen.
Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., & Miani, A. (2024). A comprehensive overview of large language models. arXiv. https://arxiv.org/pdf/2307.06435
Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., & Ha, D. (2024). The AI scientist: Towards fully automated open-ended scientific discovery. arXiv. https://arxiv.org/pdf/2408.06292
Heaven, W. D. (2024). Large language models: Amazing, but nobody knows why. Technology Review. https://www.technologyreview.com/2024/03/04/1089403/large-language-models-amazing-but-nobody-knows-why/
Stöffelbauer, A. (2023). How large language models work. Medium. https://medium.com/data-science-at-microsoft/how-large-language-models-work-91c362f5b78f

Written by:

Alejandro Castillo
FullStack Engineer
Country: Costa Rica

Careers Blog, Engineering

Introduction to Elasticsearch

Elasticsearch stands as a powerful search engine enriched with analytical capabilities, all rooted in Lucene. This versatile platform seamlessly integrates three key solutions: Observability, Security, and Enterprise Search. Moreover, it offers the flexibility for users to craft ad hoc applications leveraging its robust search, machine learning, and analytics functionalities. Whether deployed on-premises or through the convenient Elastic Cloud service in the cloud, Elasticsearch empowers businesses with unparalleled search capabilities and data insights.

Key Features of Elasticsearch:

Full Text Search: Elasticsearch offers robust full-text search capabilities, including customizable analyzers tailored to suit specific use cases.
Distributed Architecture and Scalability: Its distributed architecture allows Elasticsearch to scale horizontally, facilitating efficient data management and lifecycle processes. This scalability ensures high availability, making data resilient to major outages.
Fast Response Times: Elasticsearch boosts impressively fast response times, making it ideal for customer-facing search applications. This attribute has led to its widespread adoption by online retailers worldwide.
Machine Learning Capabilities: Elasticsearch features dedicated machine learning nodes, providing access to pre-built models and the ability to upload and execute custom models. This opens up avenues for advanced natural language processing (NLP), clustering, and other machine-learning applications.

Main Concepts

1. Kibana: Kibana serves as a vital component within the Elastic ecosystem, offering a web interface for Elasticsearch. Positioned as the visualization and UI layer of the stack, Kibana empowers users with dashboards, maps, and a monitoring interface, facilitating the overall usability of the stack.

2. Elasticsearch Node: An Elasticsearch node represents an individual instance within the Elasticsearch infrastructure. Each node may fulfill one or more roles, such as data storage, master management, or machine learning capabilities.

2.1 Cluster: A cluster comprises one or more Elasticsearch nodes, with a minimum of three recommended to achieve high availability. Within an Elasticsearch cluster, data, processing, and management are shared, ensuring robustness and high availability.

3. Index: An index serves as a mechanism for organizing documents with similar characteristics within Elasticsearch. Each index has settings and mappings that dictate how data is stored and retrieved.

4. Shard: Shards are subdivisions of an index designed to be distributed on data nodes, thereby facilitating scalability and fault tolerance. Replicas are shards maintained on different nodes to ensure data availability in the event of node failures. Additionally, having replicas facilitate distributed query processing, leading to faster response times.

Basic Architecture for an Elastic Deployment

The simplest architecture ensuring high availability and stability typically consists of three data nodes, each fulfilling both data and master roles. Among these nodes, one is designated as the master node. With this configuration, up to two replicas can be maintained, distributing data across all nodes for redundancy.

Access is facilitated through a dedicated Kibana node, establishing a connection to the Elasticsearch nodes. Via Kibana, users can execute queries, construct visualizations, and manage the cluster, including configuration adjustments within Elasticsearch.

Alternatively, data access can be achieved by sending requests to the RESTful API provided by Elasticsearch. This approach enables performing tasks similar to those accomplished through Kibana programmatically. A common scenario involves generating a search request based on user input, forwarding it to Elasticsearch, and presenting the results on the frontend.

Going further we can have much more complex architectures, with multiple Kibana nodes, dedicated Coordinating, Master and machine learning Elasticsearch nodes and even with data tiers.

Elasticsearch emerges as an invaluable tool catering to a spectrum of real-time use cases, ranging from its comprehensive full-text search functionality to leveraging machine learning-powered forecasting. Having a robust architecture that ensures high availability and the option to use it as a service, Elasticsearch can be used in production environments with confidence. In my experience, Elasticsearch is a very useful tool that enables a wide range of use cases and adapts very well to any of the client’s needs. It is useful to build search engines, recommendation systems, observability, and security platforms alike.

Written by:

Alexander Dávila
Software Engineer – Elastic Certified Engineer & Elastic Certified Analyst
Country: Ecuador

Careers Blog, Engineering

Introduction to Machine Learning: Breaking Down the Basics

In 2023, “AI” has been declared the word of the year by the Collins Dictionary. According to the publishers, the use of this term has quadrupled. It can be asserted that 2023 will be remembered as the year that ushered in a new era of digital technology.

Wherever we turn, the presence of AI is evident in our daily lives – whether it’s in the creation of personal photos, video dubbing, the latest versions of company chatbots, or even in the new Beatles song playing on radio and music streaming platforms. This leads us to a question posed long ago by the mathematician and computer scientist, Alan Turing:

Can machines think?

This query forms part of a technical exercise proposed by the scientist in his 1950 article, famously dubbed the imitation game. In this game, a human judge engages with both a machine and a human without knowing which is which. If the judge cannot reliably distinguish between them based on their responses, the machine is deemed to have passed the Turing Test, showcasing a degree of artificial intelligence. The objective is to evaluate a machine’s capability for human-like conversation and behaviour.

This test serves as a potential origin for what we now recognize as machine learning. The prospect of encoding thoughts on computers, akin to those of living beings, marked a significant milestone for humanity. Presently, this concept finds application in diverse areas, with certain tasks exhibiting superior performance compared to those carried out by humans.

Decoding the Jargon

Here is my selection of terms that often confuse:

Artificial Intelligence (AI): The expansive field aiming to develop intelligent machines capable of emulating human cognition.
Machine Learning (ML): A branch of AI that concentrates on algorithms and statistical models, empowering systems to discern patterns and make decisions without explicit programming.
Deep Learning: A specialized variant of machine learning that utilizes neural networks with multiple layers to extract high-level features from data.
Statistical Learning: The broader concept encompassing machine learning, emphasizing the utilization of statistical methods to formulate predictions or decisions.

Machine learning

Tom Mitchel once stated, “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”. This might sound complex but let’s simplify it.

Imagine creating a program to predict the accumulated precipitation in the next hour based on past data. The task (T) here is to estimate the precipitation accumulation for the upcoming hour, with the performance (P) measure being some error metric, such as the difference between the predicted and observed values. The experience (E) involves various attempts to make the forecast. The program learns as its prediction approaches the observed values during these experiences. The program learns as its predictions approach the observed values during these experiences. The process by which the program learns is linked to a predefined set of configurations known as hyperparameters.

Types of Machine Learning

In general, there 3 types of machine learning:

Supervised Learning

In this paradigm, the model is provided with a dataset and already knows what the correct output should resemble; in other words, each given example has an associated label or target. A model based on supervised learning endeavours to identify the mapping from input to output, allowing it to offer precise predictions when presented with news, unseen data. This is particularly applicable in image recognition, speech recognition, and spam filtering scenarios.

Supervised learning algorithms can be categorized into regression and classification problems. In regression tasks, the model must aim to fit a function that best approximates the input data with the output data. Classification models seek to fit a function that best distinguishes a set of categorical variables.

Let’s consider a scenario where a botanist collects measurements associated with iris flowers, including the length and width of the petals and the length and width of the sepals, all measured in centimeters. These iris flowers have been previously identified by an expert botanist as belonging to the species setosa, versicolor, or virginica. If we want to build a machine learning model that can learn from the measurements of these irises, whose species is known, so that we can predict the species for a new iris, we are dealing with a classification problem. This is because we aim to categorize new irises based on a labeled dataset.

Now, imagine that we want to create an algorithm that predicts the price of a house based on its size and location in the real estate market. Price as a function of size and location is a continuous output, so this is a regression problem.

Unsupervised Learning

In contrast, unsupervised learning is a technique that tackles problems with little or no prior knowledge of what our results should resemble, using unlabeled data. This technique follows the outlined flow below:

So, imagine you have a basket of various fruits, but you don’t know which fruits belong to which category. Through unsupervised learning, the algorithm might group the fruits based on similarities in features like shape, color, and size. The algorithm, without any prior knowledge of specific fruit names, autonomously identifies clusters, revealing, for instance, that apples, oranges, and bananas share certain characteristics.

Reinforcement Learning

This subset of machine learning enables an AI to acquire knowledge through experimentation and feedback from its actions. This feedback can be either negative or positive to maximize cumulative reward.

In a certain sense, we can say that RL shares similarities with supervised learning when it involves mapping between input and output. However, in RL, the agent autonomously decides what actions to take to accomplish a task correctly.

This approach finds significant application in games like chess, where an agent refines its strategy based on accumulated experiences over time. Consider another example: suppose we want to develop an algorithm that guides a robot to explore and clean a room. It receives positive reinforcement when it successfully cleans a dirty area and experiences negative reinforcement when encountering obstacles or failing to clean certain areas. Through this feedback loop, the robotic vacuum learns to navigate efficiently, avoiding obstacles and optimizing its cleaning strategy over time.

Conclusion

In conclusion, delving into the realm of AI is akin to embarking on a journey of continual adjustments and twists. Changes don’t happen in the blink of an eye; they’re more like a slow burn. Yet, many individuals overlook these shifts. The trick? It’s all about hitting the books, maintaining a vigilant eye on the everyday grind, and giving things thoughtful consideration. These skills aren’t just useful; they’re the secret sauce for staying on the AI adaptation rollercoaster. No quick fixes here; it’s an ongoing commitment. So, let’s keep our learning hats on, stay curious, and ride the waves of AI’s ever-evolving journey!

Warning: This article was written with AI help 😉

Joyce Araujo

Sr. Software Engineer

References:

Mitchell, Tom M. 1997. Machine Learning. First. McGraw-Hill Science/Engineering/Math.

Turing, Alan 1950 https://academic.oup.com/mind/article/LIX/236/433/986238

BBC News, AI named word of the year by Collins Dictionary https://www.bbc.com/news/entertainment-arts-67271252

Andreas C. Müller & Sarah Guido. Introduction to Machine Learning with Python: A Guide for Data Scientists.

York University, what is reinforcement learning https://online.york.ac.uk/what-is-reinforcement-learning/

Unsupervised learning image

https://nixustechnologies.com/unsupervised-machine-learning/

[🌷✨ May this month-end be filled with renewal, cooperation, 🌟]

Engineering

Practical Introduction to Data Science in Python

Data science has emerged as one of the fastest-growing and most exciting fields in the world of technology today. With the increasing amount of information generated across every aspect of our lives (from our cell phones, social media, online banking, etc..), data scientists have become critical to the success of businesses around the globe because they understand the underlying business problems and can translate it into actionable recommendations for decision makers. In the past, data analysis was a tedious and time-consuming process, but with the rise of advanced tools and techniques, data scientists can now quickly and accurately analyze and interpret data.

Python is one of the most widely used programming languages in data science, thanks to its user-friendly syntax and extensive libraries that make analysis and visualization easier and more efficient. Python offers a range of powerful tools and libraries that make dataset manipulation, analysis, and visualization straightforward and efficient.

In this article, we’ll briefly introduce you to some of the essential tools for data science in Python, including Jupyter Notebooks, Pandas, Matplotlib, and scikit-learn. We’ll provide examples of usage for each library.

Jupyter Notebooks

This is an essential tool for data scientists and Python programmers alike. They provide an interactive environment for writing and executing code, as well as visualizing and sharing data. They also have many features that make them valuable tools for data scientists. For example, you can include markdown text in your notebook, which allows you to add notes, explanations, and visualizations to your code. You can also add visualizations and charts using Python’s Matplotlib or other libraries.

Pandas

Pandas is a popular Python library for data manipulation and analysis. It offers data-structures and functions that facilitate its analysis and manipulation.

One of the most important data structures in Pandas is the DataFrame. A DataFrame is a 2-dimensional labeled data structure with columns (like a table) of potentially different types.

Now, let’s say we want to group the data by the Gender column and calculate the mean age for each group. We can use the groupby method to achieve this:

In some cases, our data may contain missing values (NaN). We can drop these values using the dropna method:

These are just a few examples of what you can do with Pandas. The library offers many more tools and methods for manipulating and analyzing data, including filtering, merging, and transforming data.

Matplotlib

Matplotlib is a popular data visualization library for Python that provides a variety of tools for creating high-quality visualizations. With Matplotlib, you can create a wide range of charts, plots, and graphs, including scatter plots, line plots, bar charts, and more.

Some examples of different plots include:

Scatter Plot: A scatter plot is a great way to visualize the relationship between two variables.

Bar Chart: A bar chart is a great way to visualize categorical data

Histogram: A histogram is a great way to visualize the distribution of a dataset.

Line Plot: A line plot is a great way to visualize the trend of a dataset.

Scikit-learn

Scikit-learn is a powerful machine-learning library for Python that provides a wide range of tools for data mining, analysis, and modeling. It is built on top of other popular scientific Python libraries, including NumPy, SciPy, and Matplotlib, and provides an easy-to-use interface for building machine learning models.

Scikit-learn includes a variety of machine learning algorithms, including regression, classification, clustering, and dimensionality reduction. It also provides tools for feature extraction and selection, data preprocessing, and model evaluation. With Scikit-learn, you can build and train machine learning models on your data, evaluate their performance, and use them to make predictions.

Conclusion

Python is a versatile programming language that offers a range of powerful tools for data science. We introduced you to some of the essential libraries and tools for data analysis, manipulation, and visualization in Python. By mastering these tools, you’ll be well on your way to becoming a proficient data scientist in Python.

As technology evolutions, we can expect to see more powerful and sophisticated algorithms that can analyze and interpret vast amounts of data. Additionally, we may see increased adoption of machine learning and AI technologies in various fields, such as healthcare, finance, and transportation, to name a few. With these advancements, we can expect data science to play an even more crucial role in decision-making processes, innovation, and problem-solving across industries.

Édgar Alexander Dávila

Software Engineer

References:

Altintas, I., Porter L. (2022). Python for Data Science [MOOC], UCSanDiegoX DSE200x [Online course]. edX.
https://learning.edx.org/course/course-v1:UCSanDiegoX+DSE200x+3T2022/home

Parenthetical citation: (Altintas et al., 2022)

Narrative citation: Altintas et al. (2022)

Understanding Semantic Search

Semantic Search Implementation with Elasticsearch

System Architecture

Implementation Process

Key Advantages

Example Queries

Conclusion

Main Concepts

Basic Architecture for an Elastic Deployment

Decoding the Jargon

Types of Machine Learning

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Conclusion

Jupyter Notebooks

Pandas

Matplotlib

Scikit-learn

Conclusion

©COPYRIGHT 2024. MISMO (FORMERLY LOG(N) LLC). ALL RIGHTS RESERVED.