Data Lifecycle in Elasticsearch

Introduction: Understanding Data Management in Elasticsearch

When using Elasticsearch, we inevitably encounter large volumes of data that require an effective management strategy. This is crucial for maintaining optimal performance, scalability, and system optimization. To address this challenge, Elasticsearch introduces the lifecycle management concept through data tiers. These tiers dictate where the data is stored and the hardware resources utilized, influencing deployment costs. 

Understanding the Data Lifecycle

The data lifecycle encompasses the stages through which data progresses based on age or access requirements. Data is allocated to hardware with varying “temperatures” levels: Hot, Warm, Cold, and Frozen. Each level entails distinct characteristics and storage considerations.

Data Tiers and Their Significance

Hot Tier

  • They are designed for frequently queried and written data, such as customer-facing search use cases or real-time logging.
  • Utilizes high-performance storage, such as SSDs, to enhance read/write speeds.
  • Recommended storage/memory ratio of 1:30

Warm Tier

  • Suitable for less frequently accessed data still used regularly, such as for historical analysis or reporting purposes. 
  • The recommended storage/memory ratio is 1:160 (ES’ES’s ratio on Elastic Cloud).
  • Examples include older logs and user activity data.

Coldier

  • It is reserved for data that is rarely accessed and updated, intended for long-term retention.
  • Focuses on optimizing storage and access costs for old data.
  • Examples include data retention for compliance purposes and archived logs. 

Frozen Tier

  • Dedicated to read-only data that cannot be modified.
  • Optimized for cost savings on data storage

Index Lifecycle Management (ILM)

ILM is an Elasticsearch feature designed to manage the data lifecycle effectively. It enables users to create policies that automate the process of transitioning data from one tier to the next.

Defined as a series of steps triggered by various criteria such as index age or size, ILM facilitates actions beyond data allocation to different nodes corresponding to the data tier. These actions may include rolling over, deleting, or freezing indices. 
ILM ensures consistent data management according to the predefined policies, thereby enhancing deployment stability, performance, and cost optimization. 

Best Practices and Considerations

When designing an application it’s essential to consider data lifecycle management as a fundamental aspect of deployment sizing. To kickstart this process, asking the following questions can be invaluable:

  • How much data are we indexing per day?
  • For how many days will this data be needed?
  • What is the rate of indexing(documents per second)?
  • What is the rate of querying (queries per second)?

Additionally, another crucial concept related to data management is snapshots. This provides an alternative option for long-term data storage outside of Elasticsearch. Elastic supports sending snapshots to cloud platforms like AWS (S3), Azure, or GCP. Alternatively, creating an S3-like repository with MinIO on-premises, for example, is another viable option. 

Conclusion 

Effective data management is integral to Elasticsearch and should be carefully planned during deployment design. Data tiers play a crucial role in this process, determining the hardware used on the respective Elasticsearch nodes. Facilitating the movement of data between tiers, Index Lifecycle Management (ILM) employs policies to dictate when an index transitions tiers or is deleted.

Furthermore, snapshots emerge as another essential tool for Elasticsearch data management. They enable data storage across various cloud services, potentially reducing the need for additional Elasticsearch nodes. This underscores the importance of comprehensive data management strategies for optimizing Elasticsearch deployments.  

Bibliography
Register a snapshot repository | Elasticsearch Guide [8.13] | Elastic. (n.d.). Elastic. https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshots-register-repository.html 
Elastic. (2020, March 1). Clarification about recommended memory-disk ratio of 1:30. Discuss the Elastic Stack. https://discuss.elastic.co/t/clarification-about-recommended-memory-disk-ratio-of-1-30/221547
Humphrey, P., Gupta, A., Humphrey, P., & Gupta, A. (2020, November 12). Free Elasticsearch Service upgrade: 60% more storage for the same price with our improved hot-warm template. Elastic Blog. https://www.elastic.co/blog/free-elasticsearch-service-hot-warm-upgrade 

Written by:

Alexander Dávila
Software Engineer – Elastic Certified Engineer & Elastic Certified Analyst
Country: Ecuador