Make a call : +1 646 699 8676

Validating and Making sense of the Retail Data through ANAI

Managing and utilizing the data collected through various sources has always been a complicated endeavor. There is an abundance of data generated every second with every transaction made for every order placed, ticket booked, and services renewed; which can, fortunately, be structured and clean (which rarely happens) or mostly highly unstructured, waiting to be decoded and used for further key decision making.

The retail sector is one such field, where there is a humongous availability of data, where numerous units across the globe generate records, for every item sold, payments made, locations reached, the time of payments and deliveries, and the number of units shipped, etc., which can be used as the fuel to run various analytical models on it. Once the data is documented and cleaned, it can be further used to forecast future sales/demand, replenish inventories, analyze frequently sold items, and solve many other use cases that can ultimately be used for making optimal product decisions and aid in generating profits.

But the most challenging part comes with the data cleansing while doing Data Engineering and it is what majorly determines the success of your services/products. The quality of the data that is fed to the analytical and ML tools matter and various smaller aspects of the data like handling anomalies and outliers, missing and duplicate values, and biases need to be addressed beforehand. Retail data contains valuable information and patterns hidden within its features that can define the customer behavior, help with a new pricing strategy, match revenue and expenses, optimize trading, etc. and help businesses achieve a competitive edge within their domain. But if the data is not usable and is not validated properly it can cause errors while generating predictions and analysis, leading to wrong business decisions and a loss of potential revenue.

How does ANAI help with this?

ANAI automates the usually tedious and cumbersome part of the data engineering such as Data cleansing and removing errors through inbuild features such as ML-based Data Wrangling and also extracting the right features from the data using our proprietary Automated Feature Engineering.

Advantages of using ANAI:

Advanced Data Engineering

This step aims to transform data into an easily interpretable form for the ML model, thus making it easy for the model to make further predictions. But for the data to be ML-ready, it needs to satisfy certain conditions and meet requirements. Hence, the data undergoes a few pre-processing steps like data analysis, wrangling, transformation, encoding, etc.

ANAI’s Data Engineering pipeline automates the data pre-processing pipeline. It provides 100 plus data ingestion methods to give flexibility while importing data, conducts a thorough pre-analysis to understand its characteristics and distribution, and then proceeds with ML-based data wrangling to deal with the missing values through imputing and removing duplicate values. Also, finally the data is summarized, and given a health score to determine its quality.For feature transformation, the platform applies automated feature engineering to detect the features that affect predictions the most and transforms certain features that are not suitable for analysis and ML training.

For this implementation of running retail-based data through ANAI, we have used the Brazilian E-Commerce Public Dataset by Olist and selected certain tables such as the customer, order and payment tables and joined them to create a data set for the Retail Data Validation use case.

ANAI performs automated data wrangling on the data to make it easier to process and interpret for further analysis and model building procedures. More specifically, ANAI cleans and transforms the data from one form to another to make it favorable for drawing valuable insights.

Methods deployed by ANAI for handling the raw data:

ANAI deploys certain data wrangling techniques onto the data, which include:

Handling missing data: If certain segments of data are missing, then the model built later will generate predictions that are biased or skewed for a particular class, causing misleading decisions that can lead to inaccurate analysis. Hence, ANAI uses a method to impute the missing values from the existing feature values so that the data becomes appropriate for training purposes.

Categorical encoding: ML algorithms work and understand better when the data is in a numerical format rather than in a categorical or non-numeric format, hence ANAI automatically converts all the categorical data into integer type, to make it readable to the model.

Handling imbalanced data: Imbalanced data refers to the data sets that have an uneven distribution of the classes. For a model to run optimally and predict accurately, we need the data to be balanced and not skewed. Skewed data hinders the ability of predictions by introducing biases. To counter this skewness, ANAI brings in normalization and scaling, which helps reduce the skewness of the data.

As mentioned before, for retail, getting the data right is the most important part. The data needs to perfectly normalized and should be free of any anomalies. Some skewness and biases in data can affect profit predictions, customer operations, leading to unfitting business decisions. ANAI, when deployed, can help reduce the imbalances within the data and help regulate the skewness of certain features.

Let’s see how ANAI handles the data:

Users can usually start with importing the data into the ANAI platform within a few clicks and get started. Preliminary results generated on the data include statistics, which describe essential characteristics of the data such as the missing cell percentage, dupicacy count, anomalies and a health score giving the data quality.

The Stats table show the missing cells to be 0.3%. This means that out of the total values that are in the data, 0.3% of those cells are missing or not available. The missing cells were purposely introduced by us within the data to show how ANAI deals with the such values. These missing values are handled by ANAI’s data imputing mechanism, inbuilt within the Data Engineering pipeline, that helps removing the ‘NaN’ values from the data set, without the user requiring to worry about it.

The skewness for each of the features in the data set can be seen and the presences of biases and outliers can be confirmed.

Each of those skewed features undergoes normalization, and ANAI displays and helps correlates standard deviation before and after transformation, to improve feature efficiency. Below, the image shows how ANAI lowered the skewness for the feature ‘payment_value’, so that it can be used further for analysis and predictions without leading to abnormal or inaccurate results.

The charts below show how ANAI normalized the same feature ‘payment_value’, with first two charts showing the skewness that was present before, and the other charts showing the feature when normalized.

Skewness present within the feature ‘payment_value

The feature distribution is normalized and skewness is removed

After normalizing the data, ANAI provides a before and after detailed list of results generated from the transformation of the individual features present in the data, along with mean, and standard deviation.

Automated Feature engineering

Feature Engineering is one of the most important aspects of any ML-related project. Hence, ANAI with its feature engineering pipeline selects the best possible features and transforms other features into a more optimal one. ANAI automates the feature engineering process, with its automated feature engineering methods deployed within the platform itself. Some of the methods that ANAI deploys are:

  1. Recursive feature selection –

    In this method, the features that affect the target variable the most are selected. By selecting all the features, the method starts eliminating them one by one and retains those that give the best performance. The model’s “coef” or “feature importances” attributes prioritize features in RFE. The model is then recursively stripped of a small number of features per loop, removing any existing dependencies and collinearities.

    Recursive Function Elimination reduces the number of features available, leading to an increase in the model efficiency. The method is repeated until no more features are left to delete or the specified number of features is reached.

  2. Expectation maximization –

    This maximizing method is implemented to estimate the missing data of the latent variables using the data set’s available observable data and then utilizing that data to update the values of the parameters in the maximizing stage. In a statistical or mathematical model, it can be used to find the local maximum likelihood (MLE) or maximum a posteriori (MAP) parameters for latent variables. Expectation maximization probabilistically assigns each data point to a cluster. In this scenario, we calculate the likelihood that it came from the red and yellow clusters, respectively. Maximization is done based on the points in the cluster and the parameters for each cluster are then updated.

    The major impact on the model’s performance revolves around the features used and their relationship with the target variable. ANAI performs feature selection, which helps select optimal features. For any target feature selected for this data the predictions should be accurate.

    For this, we only need features that give the most amount of information for predictions. Slight errors in predictions can lead to major drawbacks as already said before. Here’s where the importance of performing feature selection comes into the picture.

The above table shows the feature summary of the retail data set. Here, we can identify the power that ANAI brings with it. The table shows a different number of features that were used for training a model, the R2 score that each such model generated, and the features that were selected for that case.

We can clearly see that the highest R2 score, 3.2938 was received when all of the features were selected for training. Many a times the results can show optimal number of features required for the best predictions to be less than the total number of features. But here when all the features were selected while training lead to the best score concluding that all the features can be passed further to the model.

Another benefit that features selection brings is the reduction in the computational time while training because of the lower number of features that the model now has to train on. As the number of features increases the dimensional space the model has to work through increases exponentially, leading to a problem called the curse of dimensionality. Hence, reducing the feature space saves a lot of time and energy during training.

Root Cause Analysis

With ANAI, businesses can also perform Root Cause Analysis on the data set to find certain anomalies within the data and helps in rectifying such issues before it leads to further problems. The RCA tool comes inbuilt with ANAI and can be used while decoding certain problems that can occur while dealing with the data.

Detecting Data Drift

ANAI can also help with detecting Data drifts within the retail data as it is the most volatile and constantly changing industry with requirements for regular retraining of previously built predictive and forecasting models. The data keeps on varying over a single year and it also keeps on fluctuating based on recent trends, availability of new products, new fashion fads, global pandemics and other geo political issues. Hence, to keep updated with the recent market trends and building a good model that keeps on performing the best, it is necessary to detect data drifts way before the competitors helping to generate accurate analytical results and well trained ML models.


This case study focused on data validation and its importance in the retail sector, and how using ANAI the data quality can be improved. Implementation of data engineering and feature selection via ANAI on the retail data set helps in understanding that ANAI is self-sufficient to deal with all of the data related problems quite well without writing a single line of code.

The retail sector, a domain where the presence of efficient and optimal data helds an important value from the business point of view, requires processing and modifying a lot of data which easily increases the total expenditure. Using ANAI, entire data management can be done within a fraction of the cost, without the need for a dedicated team of data engineers and data scientists, and within a fraction of the time required during conventional data engineering and management.

To implement such solutions or to get a personalized solution for your niche use case, contact us at or visit

retail data validation list | retail data validation process | retail data validation model | use of retail data validation

Related Resources

Want to get started?

Connect with us to get a free demo