What 20,000 games of online chess can tell us

Photo by Hassan Pasha on Unsplash

It’s well known among all chess aficionados, from the casual observer to the highest rated grandmasters, that the white pieces have the advantage over the black pieces before the game even starts. Why? Because white has control of the first move.

But white’s initial advantage can easily be ceded to black if the opening moves are inaccurate. This observation led me to pose 3 questions that I was curious to find the answers to:

  1. Do certain chess openings lead to white winning more often than black? Relatedly, if the game is rated, do the openings change significantly?
  2. Does time control affect the winner? Is there a bias towards white or black winner as the number of turns increases?
  3. Can we predict the winner based on multiple factors other than player rating? How does this compare to predictions based on player rating only?

The dataset used was provided by Kaggle and contains 20,058 rated and unrated online chess games from Lichess, a free and open-source Internet chess server.

Exploratory Data Analysis

Before looking into how effective chess openings were, I checked to see that the distribution of player ratings for white and black were approximately the same across all games.

White Ratings Distribution (left) vs. Black Ratings Distribution (right)
Win Counts for White (blue bar) vs. Black (black bar)

The upper graphs above are histograms, separated into 15 bins, that show count vs ratings for white and black pieces. The lower graph shows the number of wins for each color. Because the histograms are similarly shaped and have similar magnitudes, combined with the fact that the difference in the number of wins is ~4% of the total number of games played, we can conclude that the dataset is relatively even for win proportion and rating distribution.

Which openings lead to more victories?

Among all games (rated and non-rated), which openings lead to more victories for each color? The graph below can help answer these questions.

Top 10 Openings for all games (Rated and Non-Rated)

The table above shows the top 10 openings used out of 1,453. Although these openings only accounted for 0.6% of openings used, they represent 13% and 12.8% of victories for white and black, respectively.

The top 3 openings leading to the most victories (for both rated and non-rated games) for white were:

  • Scandinavian Defense: Mieses-Kotroc Variation (+75)
  • Philidor Defense #3 (+60)
  • Queen’s Pawn (+46)

The top 3 openings leading to the most victories(for both rated and non-rated games) for black were:

  • Van’t Kruijs Opening (+100)
  • Sicilian Defense (+45)
  • Sicilian Defense: Bowdler Attack (+45)

(The numbers in parentheses show how many more victories were achieved over the opponent’s color).

Is there a bias in openings if the game is rated?

Win or lose, in a non-rated game, the players’ ratings will not change. For a game that is rated, the chess rating will increase if the player wins, or decrease if the player loses. The magnitude of the increase or decrease depends on each players’ rating — a much lower rated player beating a higher rated player will have a much higher rating increase than a much higher rated player beating a lower rated player.

With chess rating at stake during a rated match, does this affect the opening distribution? The distribution for openings when the game is rated is shown below.

Top 10 Openings for all games (Rated Games Only)

The table above shows the top 10 openings used out of 1,360 used for rated games. Although these openings only accounted for 0.7% of openings used, they represent 13.1% and 14.2% of victories for white and black, respectively.

The top 3 openings leading to the most victories for white in rated games were:

  • Scandinavian Defense: Mieses-Kotroc Variation (+56)
  • Philidor Defense #3 (+48)
  • Queen’s Gambit Refused: Marshall Defense (+47)

The top 3 openings leading to the most victories for black in rated games were:

  • Van’t Kruijs Opening (+101)
  • Sicilian Defense: Bowdler Attack (+42)
  • Sicilian Defense: Old Sicilian (+32)

Let’s compare these results to our previous results when also taking into account non-rated games.

White Openings used in Wins: All Games (left), Rated Games Only (right)
Black Openings used in Wins: All Games (left), Rated Games Only (right)

Comparing the openings deployed in rated games vs all games, two of the top 3 openings appear in both categories that lead white and black to the greatest difference in the number of wins.

For white, it is advisable to study and play these 2 openings to increase the probability of winning:

  1. Scandinavian Defense: Mieses-Kotroc Variation
  2. Philidor Defense #3

For black, it is advisable to study and play these 2 openings to increase the probability of winning:

  1. Van’t Kruijs Opening
  2. Sicilian Defense

Does the number of turns determine a winner? And does the time control affect the winner?

Other than the opening used, another factor I was curios to know about was if game longevity favored either side and if the increment code affected gameplay. For the purposes of this section, only rated games will be taken into account.

Increment codes specify the time controls. An increment code is X+Y, where X is the total time (in minutes) allotted to the player, and Y is the number of seconds added to X after each move is played. For example, if the increment code is 10+5, a player starts with 10 minutes, and 5 seconds is added to his clock after each movie is played. The distribution of increment codes for the top 15 and difference in win results are shown below:

Top Increment Codes
Top Left Graph: Top 5 Increment Codes favoring Black. Top Right: Max Increment Difference favoring Black Bottom Left: Top 5 Increment Codes favoring White. Bottom Right: Max Increment Difference favoring White

As shown on by the graphs above, 10+0 increment was by far the most widely played and showed the greatest advantage for white in terms of the difference in the number of games won (+192 games). Conversely, 7+9 increment favored black the most (+15 games), but the difference was far lower than whites advantage, less than 10% of white’s advantage.

Time increment code had a greater effect on the number of win differences than the opening used. This means that the middlegame and endgame had a far greater effect on the outcome than the opening.

Relatedly, I also wondered if the number of turns biased the results. Box plots and statistics for white and black turns where the respective color won the game are shown below.

Box Plots for # of Turns for White (left) and Black (right)
Statistics for # of Turns for White (left) and Black (right)

Although the maximum number of turns for white was significantly larger than black’s, this point was an outlier. The box plots and statistics below show that the majority of the number of turns did not have a significant outcome on game outcome for either color.

Winner Prediction using Logistic Regression

Lastly, can we predict the winner using variables other than player ratings? And how does this model perform against against a model prediction based on player ratings alone?

To answer this, I deployed a logistic regression model, using stochastic average gradient as the solver. Note that I did not consider drawn games for the analysis. My response vector was a binary set; 0 if black won, 1 if white won. I trained the model using the following features:

  • rated
  • turns
  • winner
  • increment_code
  • white_rating
  • black_rating
  • moves
  • opening_eco
  • opening_name
  • opening_ply

I did not use the remaining features ‘id’, ‘created_at’, ‘last_move_at’, ‘victory_status’, ‘white_id’, ‘black_id’ as I felt these features do not contribute to the game outcome. When using the features noted above as inmy input matrix, the model achieved the following performance:

Logistic Regression Model using Multiple Features

Our model was correct ~72.1% of the time on our test data. I wanted to know how performance would differ if I had trained the model using only player rating only. Below is the performance:

Logistic Regression Model using Player Rating Only

The model trained on player rating only was correct ~65.2% of the time on the test data. A ~10% boost in performance was attained by including other input features, rather than just player ratings. This result agrees with initial expectations as chess game outcomes rely on multiple factors, i.e., not just on ratings, but opening used, time controls, etc. which were taken more into account in our first model.

Key Takeaways

  • Time increment code had a greater effect on the number of win differences than the opening used. Consequently, concentrate and study more middlegame and endgame positions more than openings.
  • If you’re playing as white, look into the Scandinavian Defense: Mieses-Kotroc Variation opening, with 10+0 increments.
  • If you’re playing as black, look into the Van’t Kruijs Opening, with 7+9 increments.
  • Number of moves per game does not show any effect on overall outcome of the game.
  • Although ratings assist in the prediction of the winner, the inclusion of other factors improved our logistic regression model performance by ~7% when compared to prediction by rating alone.

Follow up questions:

  1. So far, we only analyzed games where decisive outcomes (i.e. either white or black won). What are the effects on the analysis if draws are taken into account?
  2. How much does rating difference account for wins? Do they skew the biases of openings played? If we limited the rating difference to below 100, are certain openings more popular?
  3. Is there a bias in the opening played vs increment code?
  4. How different are the openings if the dataset were grandmaster rated games only? Would the distribution be more concentrated?
  5. Would model performance increase or decrease if I had included all features?
  6. Which input features contributed most to the correct prediction of the winner?

Engineer by day, aspiring data scientist by night