Last weekend I attended WNS Analytics Wizard 2018, an online learning machine hackathon. Within 72 hours, more than 1000 participants were solving a task from WNS Analytics. In this post I will describe a technical approach, which helped me finish in 4th position in this event and to give you some thoughts on the commercial value of my solution.
Introduction to the challenge
In this competition, our client is WNS Analytics, a large multinational company. They want us to identify the right person for a promotion. Currently, this process consists of several steps:
- First they identify a group of employees based on past recommendations and performance;
- Selected employees follow the separate training and assessment program for each vertical. These programs are based on the skill required by each vertical;
- At the end of the program, based on various factors such as training performance, KPI completion (only employees with completed KPIs above 60% are considered), an employee receives a promotion.
As you can imagine, the whole process takes a lot of time. One way to speed it up so that a company saves time and money is to determine the right candidate at the checkpoint. We will use a data-driven approach to predict if the candidate would be promoted or not. Our prediction will be based on the employee's performance from the nomination for promotion to the control point and the number of demographic characteristics.
The competition problem is formulated as a binary classification task, the evaluation metrics are the F1 score. The training set consists of 54808 examples, test sets – 23490 examples. Only 8.5% of employees are recommended for promotion.
Validation strategy
The validation strategy is fundamental in all activities driven by data and in particular in competitions. At the beginning of the competition, I made a hypothesis according to which the test data came from a different distribution, then the train data. It could be the case if the train / test split was carried out by a time variable, which was not provided by the organizers. In this scenario, the scores obtained from the simple validation performed by StratifiedKFold
could give overestimated assumptions about our real score and the expected position in the standings.
I decided to split the training data set into folds based on the probability of belonging to the test set. It is a simple but powerful technique to manage different distributions in the parts of the train / test. The idea of this approach is simple: before solving any machine learning task, we calculate the probability that each training example is presented in a set of tests. It allows us to:
- It determines the characteristics that strongly distinguish the train and test data sets. These features often led to excessive clothing (ie, they are useless) and we could drop them to increase the models' score. Furthermore, we may have some insights from a nature of those characteristics;
- It determines the train samples, which are very similar to the test examples (ie "abnormal values"). We can illuminate these samples, give them a lower weight during training or a lower weight during training and validation. Based on my experience, it always increases a score on both validation and ranking;
- Make a better validation. We want our validation folds to have exactly the same distribution as a test set (the very idea of validation). By randomly dividing the train into folds, we could end up in a situation where multiple folds consist of examples similar to a test set. We could not count on this division's scores and therefore our validation is useless. A similar division of being part of a test set could help us overcome this problem.
To perform this technique I added a new feature: is_test
which was the same as the 0
for the part of the train e 1
for the test part. I have combined the two data sets together and I have foreseen the new target:is_test
.
it imports panda as pd
imports numpy as np
scipy.stats imports the rankdata
fromsklearn.model_selection import
cross_val_predict
train['is_test'] = 0
test['is_test'] = 1
train_examples = train.shape[0]
train = train.drop (& # 39; is_promoted & # 39 ;, axis = 1)
data = pd.concat ([train, test], axis = 0) .reset_index (drop = True)
data_x = data.drop (& # 39; is_test & # 39 ;, axis = 1)
data_y = data['is_test']
is_test_probs = cros_val_predict (some_ml_model, data_x, data_y, method = & # 39; predicict_proba & # 39;)[:train_examples]
train['is_test'] = rankdata (is_test_probs)
bin = np.histogram (train['is_test'])[1][:-1]
train['is_test_bins'] = np.digitize (train['is_test'], bins)
# use & # 39; is_test_bins & # 39; as stratification
However, this approach did not help me in this competition. The AUC score for this task was about 0.51, which clearly showed that the classifier was not able to distinguish the train and the parts of the test (the test data is similar to the train data). I decided to use StratifiedKFold
with 11 folds instead.
Pre-processing and feature engineering
I entered the missing values in education
is previous_year_rating
with a new category missing
. All other features have no missing values and outliers, this data set was convenient to work with.
I added some features such as combinations of categorical characteristics and features related to the age of the employee, they slightly increased the score:
train['work_fraction'] = train['length_of_service'] / train['age']
test['work_fraction'] = test['length_of_service'] / test['age']
train['start_year'] = train['age'] - train['length_of_service']
test['start_year'] = test['age'] - test['length_of_service']
Also, I noticed avg_training_score
it was an important feature of the categorizer, so I created a lot of combinations with categorical and characteristics avg_training_score
. For example, a new feature avg_training_score_scaled_mean_department_region
it was the result of avg_mean_score
divided by an average score for the region and the particular department. From the figures below it is clearly seen that this type of normalization was a good feature for a classifier. A person with a higher than average score in the department of his region had a greater chance of being promoted.
avg_training_score`
for a positive and negative target they are presented on the left figure. The distributions of `avg_training_score divided on the average scores of each department_regione` are presented on the right figureTo make my models different, I used different approaches to manage categorical features: tag encoding, unique encoding and media encoding. The average encoding often increases the score on activities with many categorical features with multiple levels. On the other hand, the incorrect use of the media encoding may damage the score (see figure below). The correct approach to media encoding consists of splitting the data set into multiple folds and performing the average encoding inside each fold separately.
Trained models
I trained 3 CatBoost and 2 LightGBM with different pre-processing strategies on 11 Folds (55 models in total):
- CatBoost – 1
Previous features + Tag encoding - CatBoost – 2
Previous features + New features + Tag encoding - CatBoost – 3
Outdated functions + New features + MeanEncoding - LightGBM – 1
Outdated functions + New features + MeanEncoding - LightGBM – 2
Previous functions + New features + OHE
I used different seeds for each model StratifiedKFold
to different models (although I did not intend to stack it, it was ok). For each fold, I determined the optimal threshold based on the F1 score of a fold. The final forecast was the main vote of 55 models. Cross validation The F1 score of the best single model was 0.5310, the final model – 0.5332, the private score – 0.5318 (20th place overall and 4th place in private). Always trust your CV!
To conclude
How can we interpret the results of this competition? I think there is a huge room for improvement. Firstable, I think the goal of the problem should be moved from "Who should be promoted?" To "What should employees do to be promoted?". From my point of view, an automatic learning tool should show the path to a person, so he would have a clear understanding and motivation to succeed in his work. A small change in focus will make the difference.
Secondly, in my opinion, the result of the score of the best model is rather poor. It could be useful if its performance is better than human performance, but WNS Analytics might consider adding more data to the decision-making process. I'm talking about adding more features not related to the promotion process, but to the KPI person about his work before the start of a promotion process.
In the end, I am happy with the results of this competition. For me, it was a good competition to be part of. During this competition, I tested various ideas on the real world data set; set up a semi-automatic pipeline for mixing and stacking; he worked with the data and did a little more on the engineering of the features.
Not to mention the fact that it was fun 🙂
Source link