preface :
I was working on projects with overdue credit , Used xgboost Model , Details are in previous articles .
Now I'm working on the project of Telecom fraud , The user data information to be provided in this project determines whether there is fraud information , Similar to projects with overdue credit , It is essentially a problem of two classifications , There are only some differences in the processing of data . Use separately xgboost Model ,lightgbm Model for prediction . Experimental effect display ,lightgbm The effect of the model is better than xgboost Model , Record here lightgbm Model .
experience :
On the premise that the parameters are within the normal range , Model tuning , The results predicted by the model will not be significant . in my opinion , There are roughly two solutions :1. Replacement model , Perhaps the model currently used is not the most suitable model for the data set , Change other types of models , Such as random forest, etc .2. Select better data features for training , Selecting good data features can significantly improve the prediction results .
to make a long story short , Good data and good models will get the best prediction results .
1. Data cleaning
According to data characteristics , Data cleaning of forms , For example, remove null values , Remove duplicate values , Or the missing value is supplemented by the median .
It should be noted that , The data needs to be normalized . After normalization , The prediction results will improve , Better effect .
2. Partition data X,Y
This is supervised learning ,X Data characteristics , Namely feature,Y by target, That is, whether it is the result of fraud . Fraud as 1, Otherwise 0.
3. Divide training set and test set
# Import package required from sklearn.model_selection import train_test_split # Divide training set and test set
X_train, X_test, y_train, y_test = train_test_split(feature, target,
test_size=0.2)
4 use lightgbm Model for prediction
import lightgbm as lgb lgb_train = lgb.Dataset(X_train, y_train) lgb_eval =
lgb.Dataset(X_test, y_test, reference = lgb_train) #lightgbm Model parameter setting , Adjust according to your own needs
params = { 'task':'train', 'boosting_type':'gbdt', 'objective':'binary',
'metric':{'12','auc','binary_logloss'}, 'num_leaves':40, 'learning_rate':0.05,
'feature_fraction':0.9, 'bagging_fraction':0.8, 'bagging_freq':5, 'verbose':0,
'is_unbalance':True } # Training parameter setting gbm =
lgb.train(params,lgb_train,num_boost_round=1000,valid_sets=lgb_eval,early_stopping_rounds=100)
5 model prediction
The first 4 Step to get the trained model , You can now enter the same format X, Namely feature, You can use the model to predict . with X_test take as an example .
lgb_pre = gbm.predict(X_test) # You need to enter the same data format as during training in parentheses
6 Result evaluation
Compare the predicted results with the real results , Evaluate the quality of the model .
from sklearn.metrics import roc_auc_score auc_score = roc_auc_score(y_test,
lgb_pre)
7 Model saving and loading
Save the trained model , Load the model directly where needed , No need to retrain
# Model saving gbm.save_model('model.txt') # Model loading import lightgbm as lgb gbm =
lgb.Booster(model_file = 'model.txt')
Technology
Daily Recommendation