Wednesday , July 28 2021

How to save time for human resources with machine learning – towards data science

Last weekend I attended WNS Analytics Wizard 2018, an online learning machine hackathon. Within 72 hours, more than 1000 participants were solving a task from WNS Analytics. In this post I will describe a technical approach, which helped me finish in 4th position in this event and to give you some thoughts on the commercial value of my solution.

Introduction to the challenge

In this competition, our client is WNS Analytics, a large multinational company. They want us to identify the right person for a promotion. Currently, this process consists of several steps:

  1. First they identify a group of employees based on past recommendations and performance;
  2. Selected employees follow the separate training and assessment program for each vertical. These programs are based on the skill required by each vertical;
  3. At the end of the program, based on various factors such as training performance, KPI completion (only employees with completed KPIs above 60% are considered), an employee receives a promotion.
The pipeline of a promotion process

As you can imagine, the whole process takes a lot of time. One way to speed it up so that a company saves time and money is to determine the right candidate at the checkpoint. We will use a data-driven approach to predict if the candidate would be promoted or not. Our prediction will be based on the employee's performance from the nomination for promotion to the control point and the number of demographic characteristics.

The competition problem is formulated as a binary classification task, the evaluation metrics are the F1 score. The training set consists of 54808 examples, test sets – 23490 examples. Only 8.5% of employees are recommended for promotion.

The data set includes an explanation

Validation strategy

The validation strategy is fundamental in all activities driven by data and in particular in competitions. At the beginning of the competition, I made a hypothesis according to which the test data came from a different distribution, then the train data. It could be the case if the train / test split was carried out by a time variable, which was not provided by the organizers. In this scenario, the scores obtained from the simple validation performed by StratifiedKFoldcould give overestimated assumptions about our real score and the expected position in the standings.

I decided to split the training data set into folds based on the probability of belonging to the test set. It is a simple but powerful technique to manage different distributions in the parts of the train / test. The idea of ​​this approach is simple: before solving any machine learning task, we calculate the probability that each training example is presented in a set of tests. It allows us to:

  1. It determines the characteristics that strongly distinguish the train and test data sets. These features often led to excessive clothing (ie, they are useless) and we could drop them to increase the models' score. Furthermore, we may have some insights from a nature of those characteristics;
  2. It determines the train samples, which are very similar to the test examples (ie "abnormal values"). We can illuminate these samples, give them a lower weight during training or a lower weight during training and validation. Based on my experience, it always increases a score on both validation and ranking;
  3. Make a better validation. We want our validation folds to have exactly the same distribution as a test set (the very idea of ​​validation). By randomly dividing the train into folds, we could end up in a situation where multiple folds consist of examples similar to a test set. We could not count on this division's scores and therefore our validation is useless. A similar division of being part of a test set could help us overcome this problem.

To perform this technique I added a new feature: is_test which was the same as the 0 for the part of the train e 1 for the test part. I have combined the two data sets together and I have foreseen the new target:is_test.

it imports panda as pd 
imports numpy as np
scipy.stats imports the rankdata
from sklearn.model_selection import cross_val_predict
train['is_test'] = 0
test['is_test'] = 1
train_examples = train.shape[0]
train = train.drop (& # 39; is_promoted & # 39 ;, axis = 1)
data = pd.concat ([train, test], axis = 0) .reset_index (drop = True)
data_x = data.drop (& # 39; is_test & # 39 ;, axis = 1)
data_y = data['is_test']
is_test_probs = cros_val_predict (some_ml_model, data_x, data_y, method = & # 39; predicict_proba & # 39;)[:train_examples]
train['is_test'] = rankdata (is_test_probs)
bin = np.histogram (train['is_test'])[1][:-1]
train['is_test_bins'] = np.digitize (train['is_test'], bins)
# use & # 39; is_test_bins & # 39; as stratification

However, this approach did not help me in this competition. The AUC score for this task was about 0.51, which clearly showed that the classifier was not able to distinguish the train and the parts of the test (the test data is similar to the train data). I decided to use StratifiedKFold with 11 folds instead.

Pre-processing and feature engineering

I entered the missing values ​​in education is previous_year_rating with a new category missing. All other features have no missing values ​​and outliers, this data set was convenient to work with.

I added some features such as combinations of categorical characteristics and features related to the age of the employee, they slightly increased the score:

train['work_fraction'] = train['length_of_service'] / train['age']
test['work_fraction'] = test['length_of_service'] / test['age']
train['start_year'] = train['age'] - train['length_of_service']
test['start_year'] = test['age'] - test['length_of_service']

Also, I noticed avg_training_score it was an important feature of the categorizer, so I created a lot of combinations with categorical and characteristics avg_training_score. For example, a new feature avg_training_score_scaled_mean_department_region it was the result of avg_mean_score divided by an average score for the region and the particular department. From the figures below it is clearly seen that this type of normalization was a good feature for a classifier. A person with a higher than average score in the department of his region had a greater chance of being promoted.

The distributions of `avg_training_score` for a positive and negative target they are presented on the left figure. The distributions of `avg_training_score divided on the average scores of each department_regione` are presented on the right figure

To make my models different, I used different approaches to manage categorical features: tag encoding, unique encoding and media encoding. The average encoding often increases the score on activities with many categorical features with multiple levels. On the other hand, the incorrect use of the media encoding may damage the score (see figure below). The correct approach to media encoding consists of splitting the data set into multiple folds and performing the average encoding inside each fold separately.

Results of game experiments with click-through rate prediction data sets and media encoding (LogLoss metrics). The wrong approach to the average encoding will gradually increase the CV score, but will drastically reduce the ranking score

Trained models

I trained 3 CatBoost and 2 LightGBM with different pre-processing strategies on 11 Folds (55 models in total):

  1. CatBoost – 1
    Previous features + Tag encoding
  2. CatBoost – 2
    Previous features + New features + Tag encoding
  3. CatBoost – 3
    Outdated functions + New features + MeanEncoding
  4. LightGBM – 1
    Outdated functions + New features + MeanEncoding
  5. LightGBM – 2
    Previous functions + New features + OHE

I used different seeds for each model StratifiedKFold to different models (although I did not intend to stack it, it was ok). For each fold, I determined the optimal threshold based on the F1 score of a fold. The final forecast was the main vote of 55 models. Cross validation The F1 score of the best single model was 0.5310, the final model – 0.5332, the private score – 0.5318 (20th place overall and 4th place in private). Always trust your CV!

Matrix of confusion normalized in terms of lines of the best presentation, the F1 score is equal to 0.5432

To conclude

How can we interpret the results of this competition? I think there is a huge room for improvement. Firstable, I think the goal of the problem should be moved from "Who should be promoted?" To "What should employees do to be promoted?". From my point of view, an automatic learning tool should show the path to a person, so he would have a clear understanding and motivation to succeed in his work. A small change in focus will make the difference.

Secondly, in my opinion, the result of the score of the best model is rather poor. It could be useful if its performance is better than human performance, but WNS Analytics might consider adding more data to the decision-making process. I'm talking about adding more features not related to the promotion process, but to the KPI person about his work before the start of a promotion process.

In the end, I am happy with the results of this competition. For me, it was a good competition to be part of. During this competition, I tested various ideas on the real world data set; set up a semi-automatic pipeline for mixing and stacking; he worked with the data and did a little more on the engineering of the features.

Not to mention the fact that it was fun 🙂

Source link

Leave a Reply

Your email address will not be published.