an analysis of Pro League of Legends matches from 2022. Homework for EECS 398 at the University of Michigan.
an analysis of Pro League of Legends matches from 2022. Homework for EECS 398 at the University of Michigan.
This is an analysis of all the professional League of Legends games played in 2022. There are 150180 rows in the dataset, and each individual game is split up into twelve rows: ten rows to represent each player, and two rows to represent each team. Each row contains either personal or team data, ranging from kills, gold earned, objectives, champions played, and much more. Most importantly for my personal question, there is also some time-based data, including data like gold at 10 minutes and kills at 10 minutes.
This analysis will be focusing on the “snowballing” effect inherent to League of Legends, where the team that initially gains the lead gains an advantage that further extends the leads. This is why the dataset is interesting; having a twenty point lead in the fourth quarter of a basketball game does not necessarily mean the team with the lead will score more points in the fourth quarter, but in a League of Legends game, a team with a significant gold advantage has a statistical advantage in their characters having a higher level and better items. I want to investigate the effect of snowballing by using the gold difference at 25 minutes to predict the gold difference at the end of the game. The columns of the dataset I will be using are as follows:
I started by removing all the rows about individual players, as I did not feel that this data was important relative to my goal of predicting teamwide gold data. I also removed team two (red side) in this step, as gold difference is a zero sum statistic and I only need team one’s data. I then removed all games that were not labelled as complete, as they did not contain enough data for me to run my model. Notably, they were missing the “golddiffat25” column, which is the main component of my model. I then also removed games that were less than 25 minutes long, for the same reason. Below is a sample of my dataset:
teamname | golddiffat10 | golddiffat25 | |
---|---|---|---|
10 | BRION Challengers | 1523 | 88 |
22 | T1 Esports Academy | -1619 | -7280 |
46 | KT Rolster Challengers | -103 | 4145 |
70 | Dplus KIA Challengers | 337 | -1160 |
94 | Kwangdong Freecs Challengers | -901 | 2516 |
Below is a histogram showing the distribution of gold differences at 15 minutes. The distribution follows a normal distribution, and my question is trying to essentially trying to predict what this histogram looks like at the end of game rather than at fifteen minutes.
Below is the same histogram as above, just showing the gold difference at 25 minutes instead of 15 minutes. Notice that the distribution has spread out further than compared to at 15 minutes, but still follows a normal distribution.
Below is a scatterplot comparing the gold difference at 15 minutes vs. 25 minutes. There is a clear positive linear association, i.e. having a gold lead at 15 minutes implies that team will have a gold lead at 25 minutes, and vice versa, but there are many outliers.
Below is a pivot table I constructed to show the correlation between having a positive and negative gold difference at ten minutes to winning. There is a strong correlation between winning and having a greater positive gold difference, so its important to see how often a positive gold difference actually converts into a win.
Negative Gold Difference | Positive Gold Difference | |
---|---|---|
Losses | 3031 | 1552 |
Wins | 1574 | 3355 |
I did not want to impute any values, such as golddiffat25, because I did not have the data necessary to do so, and using mean imputation or other kinds of imputation would be essentially making up data that would take away from the real data. Therefore, I did not impute any values.
As previously stated, I am trying to predict what the final gold difference is at the end of the game. This is a classic regression problem, and the response variable is the blue side final gold difference. I am using R-squared to measure the performance of my model, as I am using multiple features in my model. I will set my “time of prediction” to 25-30 minutes, so the information I have available will be the “snapshots” at 10, 15, 20, and 25 minutes, along with team information, but I will not have the result of the game or how long the game will last.
The base model I used for the prediction is a LinearRegression model using “golddiffat25” and “killsat25,” which are both quantitative variables. This base model had an R-squared of 0.708. This model is decent enough, but it lacks critical information about game states, assuming that gold and kills are the only things that matter.
For my final model, I added three variables:
I added these variables because I believe these variables help predict the game state better, as these are extra variables aside from a gold lead that are associated with winning. It should be noted that there is no timestamp associated with the “firstbaron” column, so it is possible that the first baron is taken after 25 minutes, but typically the first baron is taken close enough to 25 minutes that I felt comfortable including it. The region variable was added because there is an generally held association that certain regions convert leads better, and I wanted to factor this in.
Instead of using a LinearRegression model, I tested both Ridge and Lasso models so I could try extra hyperparameters that LinearRegression does not allow for. I used a GridSearchCV to iterate through three hyperparameters: alpha, tolerance, and max iterations. What I found was that an Alpha of 1, Max Iterations of 1000, and a tolerance of 0.001. This resulted in an R-squared of 0.775, compared to the baseline model performance of 0.708, a marked improvement.