An Approach for Recommending Similar Fashion Products

This project is about the Fashion Recommendation system. It is very different from a normal recommendation system because in the majority of recommendations they try to recommend a product for a particular product or about a single query. But in our case, it is different because while the user searching for a product the user might be interested in secondary products worn by the model. So to address this need we will recommend the secondary products worn by the model/person. The architecture and design components are inspired from a Paper:

Task: The task here is to build an end-to-end Fashion Recommendation system that recommends the secondary products worn by the model


  1. Business problem
  2. Mapping to DeepLearning Problem and Breaking The problem into Parts or Modules
  3. Data Acquisition and Analysis
  4. Explaining my Approach to this problem
  5. Experimentation Results
  6. Deployment using Streamlit
  7. Conclusion
  8. Future works
  9. Links to GitHub and profile
  10. References

1. Business Problem:

In a short way, our business problem is to recommend all the secondary clothes worn by the model with the Deep learning models. In this project, we will first detect Whether it is a Full-Front-Pose Image or not and then detect all the clothes and particular place where it is located and Extract similar images of the secondary articles

2. Mapping to Deep Learning problems and Breaking The problem into Parts or Modules

Before Mapping to the Deep learning model we will discuss the Architecture of this Problem. My architecture is inspired from the below figure

credits: From this paper

As we can see from the above figure the problem is divided into Four Modules

Module 1 : (Pose Estimation)

In this, We will detect whether the image is a Full-Front-pose image or not. So this will be a binary classifier (Yes/No)

Module 2 : (Localization and Article Detection)

In this Module, we detect all the articles (clothes) and particular places the article is placed or located. This will be Both a classification and Regression Problem. Classification because of Article Detection and Regression because of Localization (Bounding Box Co-ordinates)

Module 3 : (Image_embeddings)

In the research paper, I referred to, They have trained a Triplet-Net Based network with image embeddings with a CNN network and Resnet as a BackBone To get similar Images. But In our case, We will do this differently that will be discussed below

Module 4: (Getting similar Images)

In the reasearch paper they have used Triplet-net Based network loss to get similar images. But in our case we will retreive images with Euclidean distance.

3. Data Acquisition and Analysis

I Have Used the data that is scrapped from Myntra


In this data, it consists of 8 Types of products

From the above figure, we can see the products and count of each product image we have. But this data is not having output(masks, BBoxes) for training so For the Article detection and localization part I have taken the data from Kaggle Competiton: iMaterialist (Fashion) 2019 at FGVC6


So this data has 45.2k files approximately with output In Encoded_pixel Format with class_labels

4. Explaining my Approach to this problem

I will be explaining my work In the order of modules

Module 1: In this Module, we have to detect whether the image is a Full-Front-pose Image or not. For this problem statement, I have used a pre-trained model from TensorFlow ‘Posenet’ which gave me very good results on my myntra scrapped data. It is important to be aware of the fact that pose estimation merely estimates where key body joints are and does not recognize who is in an image or video... This model takes input as an image and outputs information about key points of the body in an image. The keypoints detected are indexed by a part ID, with a confidence score between 0.0 and 1.0. The confidence score indicates the probability that a keypoint exists in that position.

So as to classify the image Full-Front-Pose or not we gave a condition to classify that these key points should be present with confidence threshold =0.5, the key points are Nose, any-one eye, any-one hip, the any-one ankle should be present. If all these keypoints are present with the threshold then we will classify the image as Full-Front-Pose Image and the image is sent to Module-2

def parse_output(heatmap_data,offset_data, threshold):'''
heatmap_data - hetmaps for an image. Three dimension array
offset_data - offset vectors for an image. Three dimension array
threshold - probability threshold for the keypoints. Scalar value
array with coordinates of the keypoints and flags for those that have
low probability
COnd: If nose and (any one eye) and (any one hip) and (any one ankle) is present then only Bool will be True
joint_num = heatmap_data.shape[-1]
pose_kps = np.zeros((joint_num,3), np.uint32)
for i in range(heatmap_data.shape[-1]):joint_heatmap = heatmap_data[...,i]
max_val_pos = np.squeeze(np.argwhere(joint_heatmap==np.max(joint_heatmap)))
remap_pos = np.array(max_val_pos/8*257,dtype=np.int32)
pose_kps[i,0] = int(remap_pos[0] + offset_data[max_val_pos[0],max_val_pos[1],i])
pose_kps[i,1] = int(remap_pos[1] + offset_data[max_val_pos[0],max_val_pos[1],i+joint_num])
max_prob = np.max(joint_heatmap)
if max_prob > threshold:
if pose_kps[i,0] < 257 and pose_kps[i,1] < 257:
pose_kps[i,2] = 1
bool = (pose_kps[0][2]==1) and ((pose_kps[1][2] or pose_kps[2][2]) ==1) and ((pose_kps[11][2] or pose_kps[12][2]) ==1) and ((pose_kps[15][2] or pose_kps[16][2]) ==1)
return bool

The results of this module I will show in the next section with Images

Module 2: In this module, we have to detect all the articles and localize them. For this task, I tried with my custom UNET Architecture and MaskRcnn. As compared to UNET I got Very good results in MaskRcnn I Trained my Mask Rcnn Model. I have Used the MatterPort MaskRcnn code for my training. For training the data I have divided the data into train and validation data with Kfold from sklearn library. After splitting the data I got the distribution of train and validation data as below

The top image is the distribution of train data and below is the distribution of validation data.

On the Horizontal axes are the products and vertical are the count

As we can see from these plots the distribution of the products are the same for train and valid data

I Have run it for 15 epochs and the wall-time it took to run is around 6 hours and the loss I got from the last epoch is :

Epoch 15/15
1000/1000 [==============================] - 1561s 2s/step - loss: 1.9558 - rpn_class_loss: 0.0605 - rpn_bbox_loss: 0.8799 - mrcnn_class_loss: 0.3666 - mrcnn_bbox_loss: 0.2983 - mrcnn_mask_loss: 0.3505 - val_loss: 1.9440 - val_rpn_class_loss: 0.0603 - val_rpn_bbox_loss: 0.8583 - val_mrcnn_class_loss: 0.3791 - val_mrcnn_bbox_loss: 0.2974 - val_mrcnn_mask_loss: 0.3488

Epoch 00015: saving model to /content/drive/MyDrive/mrcnn2/mask_rcnn_fashion_0005-0.34882.h5

According to the Matterport git code,

Here Loss = Sum of all the losses we got, rpn_class_loss = binary class loss of Regional proposal network(RPN), rpn_bbox_loss = Smooth L1 loss of the BBox coordinates of RPN, mrcnn_class_loss = It is the average of cross-entropy loss of all the classes, mrcnn_bbox_loss = mean of distance loss or Smooth L1 loss of each object BBoxes coordinates, mrcnn_mask_loss = Mean of cross-entropy loss of each class

I have written a document about the MaskRcnn of my Understanding You can Check it in my GitHub

After localization and article detection these particular articles are cropped and sent to Module-2

Here I divided the input image into three parts as shown below

#combine categories for simplification
foot_wear = ['shoe']
upper_body_wear = ['vest','top, t-shirt, sweatshirt','sweater','sleeve','shirt, blouse','neckline','lapel','jacket','hood',
lower_body_wear =['pocket', 'pants', 'shorts', 'skirt']
wholebody = ['cape', 'coat', 'dress', 'jumpsuit']

we will be searching the items based on the Upper_wear, lower_wear, foot-wear these are all the items we have scrapped from myntra, and these upper, lower, foot parts are cropped and sent to Module 3

The results of this module I will show in the next section

Module 3: (Image Embeddings)

For getting Image_embeddings I tried with Densenet121 and Resnet50 from TensorFlow :

resnet = ResNet50(weights='imagenet', include_top=False,
input_shape=(512, 512, 3), pooling='max')
densenet = tf.keras.applications.DenseNet121(
include_top=False, weights='imagenet', input_tensor=None, input_shape=(512,512,3),
def extract_features(img_path, model):
input_shape = (512, 512, 3)
img = image.load_img(img_path, target_size=(
input_shape[0], input_shape[1]))
img_array = image.img_to_array(img)
expanded_img_array = np.expand_dims(img_array, axis=0)
preprocessed_img = preprocess_input(expanded_img_array)
features = model.predict(preprocessed_img)
flattened_features = features.flatten()
normalized_features = flattened_features / norm(flattened_features)
return normalized_features
a = extract_features('women_trousers/image59_4.jpg',resnet)
print('The Number of features with resnet {}'.format(len(a)))
a = extract_features('women_trousers/image59_4.jpg',densenet)
print('The Number of features with Densenet {}'.format(len(a)))
The Number of features with resnet 2048
The Number of features with Densenet 1024

As we can see from above Densenet gives 1024 embedding when compare to Resnet has 2048. So I Choose Densenet because I can search for similar images faster. And all these embeddings are stored in Pickle files

Based on my knowledge I divided all the images on my myntra scrapped data into three categories Upper_wear: women_shirts_tops_tess, lower_wear: women jeans juggings, women skirts, women trousers Foot_wear: women casual shoes, flats, heels

Module 4:

I have trained a nearest-neighbor using a brute-force algorithm to find the nearest 50 neighbors based on Euclidean distance for all three categories we divided as upper, lower, footwear. And these models are stored in Pickle files

from sklearn.neighbors import NearestNeighbors
neighbors = NearestNeighbors(n_neighbors=50, algorithm='brute',metric='euclidean')['Embedings'])))
distances, indices = neighbors.kneighbors([lowerwear['Embedings'][1603]])

I got a very decent result that I will show in the next section

5. Experimentation Results:

The results I will show in order of Modules only

Module -1 Pose Estimation : Here we input image it will return bool as true or false (Full-Front-pose(true),Not Full-Front-pose(Flase))

If ‘Module-1’ returns true then it sends to ‘Module-2’

Module-2(Localization and article Detection)

Localization and article detection part

Now cropping images into Upper Part, lower Part, footwear as discussed from an above section in ‘Explaining the approach- module 2’

Now these images are resized to (512,512) and Send to the next module for getting the image Embeddings

Module-3(Image Embeddings)

We will extract the features from pre-trained TensorFlow.Densenet121 to extract features or Image Embeddings for the Upper, Lower, and Foot Images that we got from Module2

class Module3:
def __init__(self):

self.model = tf.keras.applications.DenseNet121(include_top=False, weights='imagenet', input_tensor=None, input_shape=(512,512,3) , pooling='max')
def extract_features(self,img):
'''Input : Image and pretrained densenet12 model
output : Image Embedings of size 1024 for an Image'''
preprocessed_img = preprocess_input(np.expand_dims(cv2.resize(img,(512,512)),axis=0))
features = self.model.predict(preprocessed_img)
flattened_features = features.flatten()
normalized_features = flattened_features / norm(flattened_features)
return normalized_features

Module -4(Getting Similar Images)

We will use pre-trained nearest-neighbor that discussed in the above section to get similar Images embeddings and those who have less distance these images are similar and we will output those images

The recommendation for the upper part recommendation for the above image that I got

Lower Image Recommendation

FooT wear Recommendation

6. Deployment Using Streamlit

Here is the full video for deployment full video

7. Conclusion:

Finally, we are able to recommend the Secondary articles or products worn by the Model or person in the image and Deployed from End-to-End using Streamlit

8. Future Works

1. Reduce the Latency

2. Train the Triplet-net Based Embedding layer network for getting similar images

3. Improve the ‘Loss’ In Module-2

4. More data collection and articles

9 Links to Github and Linkedin

My complete code is in Github Repository and My Linkedin contact if you want to discuss about this project

10. References

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store