In this paper we compare two alternate machine-learning techniques from the Apache Mahout stable, namely: Apache Sparks’, spark-item similarity, and its counterpart Apache Hadoop’s MapReduce. We compare these both qualitatively as well as quantitatively in the context of two
e-commerce stores with different behavior to determine which one is more effective and efficient in a given context.
Data Gathering and setup
Relevant click stream data for both subjects was collected. This constitutes user behavior, namely view and buy. Based on this, predictive analytics for item-similarity was run using the Apache Spark and Apace Hadoop mapreduce Log Likelihood in both cases. The subjects were observed for 1 week to gather both quantitative and qualitative results.
- We gathered data for both stores data points.
- Plotted data points hourly for a one-week period.
- That explains the peaks and troughs where activity goes down at night and peaks
during the day.
- Total products viewed (blue)
- Recommendation available from Apace Hadoop mapreduce log likelihood (LLR) (red)
- Recommendations available from Apache SPARK (Spark) (grey)
Sample Store 1
In the case (Sample store 2) where we have lower transactions and lower visitors, we see that Spark yields far fewer results (i.e. recommendations) than in the case (Sample store 1) where there are large number of transactions and more traffic. We see that in (Sample store 1) the total product views, the total products for which we have recommendations from LLR and recommendations from SPARK are almost identical, which shows we have recommendations for almost all products that are viewed both using Spark as well as LLR. In Sample store 2, we see that the total product views and the total products for which we have recommendations from LLR are almost identical, but the recommendations from Spark lag behind significantly. Inference: Hence we conclude that quantitatively if the there are large number of transactions then quantitatively Spark and LLR are almost equivalent in terms of the number of recommendations they yield. Qualitative Analysis: We gathered data for both stores and plotted the following data points hourly for a one-week period.Total products bought (purple)Products that were recommended by Apace Hadoop mapreduce log likelihood (LLR)that were bought (Blue)Products that were recommended by Apace Spark (Spark) that were bought (grey)
Sample Store 1
Sample Store 1
We see that in (Sample store 1) the total product buys, and the total products which were recommended by SPARK and bought are almost equal, which suggests that most buys were for products that were recommended by Spark. However products recommended by LLR which were bought lag behind significantly.
Sample Store 2
We see that in (Sample store 2) the total product buys, and the total products, which were recommended by SPARK and LLR and bought, are further apart than in Sample store1,
which suggests that most buys were for products that were not recommended by Spark or LLR.
We also see that while spark still does marginally better than LLR, both are comparable,
and deviate from the products that were actually bought.
Hence we conclude that qualitatively if the there are large number of transactions then qualitatively Spark is significantly better than LLR, and almost all products that are recommended by Spark are also bought. LLR lags behind significantly. When there are lesser transactions, we see that Spark is still marginally better than LLR qualitatively, but products that are actually bought, are different from he ones that are recommended by both Spark and LLR.