LightGBMでの予測値がすべて同じ値になるときの原因と対処

Source Code Software Computer機械学習

原因

  • トレーニングデータセットが小さすぎる

対処方法

  • ハイパーパラメータmin_child_samplesの値を小さくする。
  • トレーニングデータ数を増やす

補足

min_child_samplesは、末端ノードに含まれる最小のデータ数。

初期値が20となっており、これを下回るような分割はされない。

従って、初期値設定のままだと、数十程度のデータセットだとトレーニングまともにできず、予測値がすべて同じ値になる場合がある。

対処方法としては、 データ量を増やすことが一番良いが、 min_child_sampleの値を小さくすることである程度ましになる。

現象例

In [10]:
import lightgbm as lgb
import sklearn.datasets
from sklearn.model_selection import train_test_split

普通に予測¶

In [90]:
boston = sklearn.datasets.load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=0)
In [91]:
model = lgb.LGBMRegressor(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred
Out[91]:
array([23.51779222, 25.1634114 , 23.74558738, 10.59833384, 19.64574118,
       19.96662897, 21.8582743 , 19.50020379, 21.26668172, 18.26415997,
        8.45298343, 13.04995005, 15.0873012 ,  8.2987718 , 47.36216404,
       34.86872717, 21.64090661, 39.99090241, 26.05482856, 21.79610728,
       23.73643801, 22.16598025, 19.13113231, 25.79010059, 21.72594814,
       18.70777934, 17.14465817, 15.13668063, 42.46717769, 18.69930545,
       15.73468722, 17.34674089, 19.65790067, 20.08222737, 25.83470178,
       16.61005936,  8.74898077, 23.39105625, 14.55845363, 14.58655287,
       23.36466612, 21.94871246, 21.81213129, 16.25078297, 22.04850307,
       21.7384976 , 20.10937895, 16.6653988 , 15.01326306, 25.60206669,
       16.90822172, 20.60323435, 22.29170513, 42.67729761, 13.84472805,
       19.03075463, 19.51608104, 18.51464692, 19.35011084, 20.69497763,
       20.95147061, 20.86294727, 32.8268614 , 32.5744496 , 18.95636996,
       26.08159908, 16.20687672, 18.44315019, 15.8077606 , 22.82257648,
       19.29345367, 22.34956776, 24.17311272, 30.35479374, 26.16227244,
        8.33653911, 45.01062402, 22.59214111, 23.3614851 , 20.57377855,
       25.51010287, 17.67929032, 17.85171494, 43.67630391, 39.42340579,
       24.40606366, 24.34236667, 14.66840737, 26.15887604, 16.22312834,
       18.5747412 , 11.50690741, 22.36027667, 30.05608215, 21.75762671,
       22.13243742, 10.95540418, 22.17537827, 15.43527962, 18.50935058,
       24.92223822, 19.95910786, 27.03080267, 21.16328981, 26.92852917,
       19.20265533,  8.39133907, 16.83714336, 21.54709105, 23.81013026,
       32.03670925, 14.55612791, 18.48223334, 17.97027692, 16.87002851,
       21.35367344,  8.55890384, 19.71166888,  9.75565535, 45.84800209,
       28.74251292, 11.60536039, 18.23442408, 21.66743716, 19.30663662,
       19.58682227, 38.05853668])
In [92]:
score = model.score(X_test, y_test)
score
Out[92]:
0.7410313895708159

LightGBMの予測値がすべて同じ値になる時¶

In [102]:
# データの切り出し
X_train_short = X_train[:30,:]
y_train_short = y_train[:30,]
In [103]:
model = lgb.LGBMRegressor(random_state=0)
model.fit(X_train_short, y_train_short)
Out[103]:
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
              random_state=0, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
In [104]:
y_pred = model.predict(X_test)
In [105]:
# 予測値 すべて同じ値
y_pred
Out[105]:
array([20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688, 20.55666688, 20.55666688, 20.55666688,
       20.55666688, 20.55666688])
In [106]:
score = model.score(X_test, y_test)
score
Out[106]:
-0.03746940245176966

小データセットでのLightGBMの予測の改善¶

In [107]:
params = {
    #fixed
    'min_child_samples':3,
}
In [108]:
model2 = lgb.LGBMRegressor(**params,random_state=0)
model2.fit(X_train_short, y_train_short)
Out[108]:
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              min_child_samples=3, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
              random_state=0, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
In [109]:
y_pred2 = model2.predict(X_test)
y_pred2
Out[109]:
array([23.59842819, 21.04001598, 24.25679736, 11.29455799, 21.55054729,
       23.48508799, 23.70510083, 22.94101353, 23.88411442, 22.28766455,
        8.5588055 , 11.24756747, 10.26005531,  8.56017815, 32.23957481,
       32.8900881 , 22.88128202, 33.39904706, 23.8622733 , 22.52561999,
       23.69031291, 21.44949619, 18.6305379 , 23.59168777, 22.86287914,
       20.87245133, 23.22369243, 12.68071484, 32.26021952, 20.91819321,
       13.28596428, 12.16580672, 22.46707511, 21.68303515, 23.56861853,
       10.38498309, 11.03312465, 19.66626012, 13.65569641, 13.0036135 ,
       21.49628923, 18.87797782, 21.30628159, 13.30570893, 22.17998598,
       23.35148025, 18.38405521, 12.73570213, 10.99869785, 24.41279037,
       12.71766435, 19.99497057, 23.85239512, 33.48738661, 16.01253221,
       18.24084792, 21.62211109, 21.04990677, 20.59734574, 19.75487628,
       23.38190849, 23.47836046, 33.56947705, 24.17857852, 16.30286047,
       24.00790208, 10.61068404, 18.18502344, 12.26102171, 23.48729952,
       23.01519923, 21.6584925 , 22.67781617, 32.95334594, 24.1949926 ,
        7.89663717, 32.38892823, 22.9741865 , 23.57857828, 20.16457843,
       24.57191162, 15.54425963, 20.25132642, 32.74019494, 32.59918162,
       23.02732036, 24.11573041, 13.19993255, 23.41185163, 11.38306773,
       17.66397205,  9.49967577, 23.53279949, 33.30693645, 22.67443254,
       22.90987986,  9.89923151, 23.0662971 ,  8.39813457, 20.85335245,
       24.7617339 , 23.74815633, 31.75170601, 22.12626585, 30.77593589,
       21.0407497 ,  9.58668095, 19.30262003, 22.64276415, 22.03291882,
       31.01214356, 11.88899228, 17.97882543, 19.26124619, 16.47882763,
       18.66521901,  6.04301808, 21.96279296, 11.13531389, 36.52481394,
       24.13753206, 11.57834355, 18.45544057, 24.08193129, 23.04652952,
       20.44277207, 34.94595162])
In [110]:
score = model2.score(X_test, y_test)
score
Out[110]:
0.5042425544047608

コメント

タイトルとURLをコピーしました