How to Quantify Breakout Hitters

A machine learning approach to predicting a breakout hitter.

If you play fantasy baseball competitively, you should be using projections. Whether you like The BAT X, ATC, ZiPS, or you have your own system, if you are not factoring in projections to some extent in your draft strategy, you are probably putting yourself at a disadvantage against your league-mates. With that said, we have to understand the limitations of projections. Essentially, projections give us the average outcome of a given player, but we often need to consider the entire distribution.

If I told you Player A had a better projection than Player B, but Player B had a higher chance to be a top-25 player in the league, who would you take in the 15th round?  We have to remember, our ultimate goal is to win our fantasy leagues, not to maximize our chance to finish above average. Ariel Cohen provides both inter-projection and intra-projection standard deviations in his ATC projections, which helps us gain a better idea of this concept.

My process was a little bit different. I looked at the players who have heavily overperformed their ADP in recent history and looked for potential trends we can use to our advantage when looking for breakout candidates in future seasons. I could have picked a statistical benchmark a player had to fall below in the previous season, but that would include breakout players the rest of the field may have already been in on, heading into the following season. Drafting players who eventually overperform their ADP is how you win your league.




Before I get into the specifics of my findings, let me first outline the steps I took in my process:

  1. I downloaded offensive statistics from various online sources, as well as NFBC ADP data for each season since 2015.
  2. I selected only those players who had an ADP of 120 or higher and defined “a breakout player” as one who finished in the top 25 the same season using a z-score method for a classic 12-team 5×5 rotisserie league.
  3. I used the players’ previous season statistics as predictor variables, thus eliminating players who did not qualify based on too little playing time.
  4. I tested numerous types of models, both supervised and unsupervised, to try and predict the probability of a player breaking out. I used repeated k-fold cross-validation to divide the data into test and training sets.
  5. After comparing multiple performance metrics, I eventually decided on a Bagged Tree Model as the best fit for these data.
  6. Finally, I used this model to find the most likely breakout candidates for 2022.




You may have noticed a few limitations to my methodology, so before I go any further, let me address them all.

First off, my categorization of a breakout player was rather arbitrary. However, it is pretty much impossible to use this process without making some sort of subjective decision. I wanted for there to be a large enough sample to eliminate noise, while also maintaining a significant difference between where the players were drafted and where they finished, so these numbers made sense.

Next, you may notice that this model is only based on data from the previous year. This excludes rookie breakouts (this model won’t show it, but I am high on Oneil Cruz), players who are coming off an injury, and bounce-back candidates. However, with the sample I was working with, I was afraid the addition of more data would lead to issues with overfitting. As I adjust the model in the future, I may consider these factors.

My point here is that I will not be using my findings as an end-all-be-all, but rather a guide to help me once my draft reaches triple-digit pick numbers, just like projections.


Variable Importance


I won’t share my full model yet, but I will display the most important metrics to look at from the previous season when predicting a breakout. The metric “variable importance” represents exactly what it sounds like, how much the model relies on a given variable. Here are the top 10 most important variables in my model.

Breakout Prediction Variable Importance

The first takeaway I have is, unsurprisingly, many of the variables shown here are strongly correlated. One of my favorite benefits of using Tree-based models is that they eliminate the effect of multicollinearity, or overfitting the model by “double counting” multiple highly correlated variables, compared to logistic regression, for example. Essentially, all of the xStats and quality of contact metrics likely don’t offer much of an effect isolated from each other, but the combination of all of them is very important.

The second-most important variable to look at when predicting a breakout is sd(la), or “launch angle tightness”, whose importance was discovered by Alex Chamberlain a few years ago. As Alex found, players with a tighter distribution of launch angles, which leads to more line drives, have a stronger chance of future success.

As always, playing time is important in fantasy baseball. Sometimes, it’s not a bad idea to use a pick on a player who is guaranteed to play nearly every day, even without a high ceiling from an efficiency standpoint. Remember, of the classic five offensive categories, four of them are counting stats. While there are obviously better ways to predict playing time, previous season playing time correlates fairly well with following season playing time. This is why previous season PA appears important to this model.

Finally, there were a few metrics I was surprised not to see on this list. Most of them have to do with plate discipline, such as walk rate, strikeout rate, and whiff rate. These stats are very sticky year-over-year, so I expected them to have a significant role in this model. Perhaps, plate discipline metrics are more important in estimating the floor of a hitter, rather than the ceiling, which is what this exercise is all about. Also worth noting, classic roto leagues use batting average as a category, so I would expect these metrics to be more important if we were working with OBP.


2022 Breakouts


This is probably what many of you are reading for. Who are the players my model likes as breakout candidates for this season?


Joey Votto, ADP: 147.47, Pr(Breakout): 15.6%

I feel like a broken record when I say this, but Joey Votto is undervalued once again in fantasy baseball this year. He’s not your typical breakout. You’re probably mostly looking for rookies and other young players on the come-up – it is probably worth noting that age was a factor considered in this model but proved to be not very important.  Votto is 38, but he showed in 2021 that he still has it. He can still hit for power, plays in a hitter-friendly ballpark, and should still get on base at an elite clip, which will help him get runs. He also is projected to bat third and considering the Reds don’t have a top prospect waiting to take over Votto’s job, you should be able to bank on a lot of plate appearances from Votto.


Willy Adames, ADP: 131.54, Pr(Breakout): 10.7%

You can probably say Adames already broke out, depending on what your definition is. But according to mine, Adames is a top candidate to break out this coming season. Adames’ quality of contract metrics saw a massive boost in 2022 – his .325 xwOBA and 11.4% barrel rate were both career highs. In his first full season as the starting shortstop for Milwaukee, I like Adames at his current price. With shortstop being so strong this year, waiting until the middle of your draft for a guy like him wouldn’t be a bad strategy.


Yuli Gurriel, ADP: 200.09, Pr(Breakout): 9.6%

Another veteran 1B lands on this list. That is probably my biggest takeaway from this research. Don’t reach on first basemen at the top of your draft and don’t be afraid to rely on an older player. Like Votto, Gurriel proved last season that he can still give you production and has a favorable playing time projection. With the Astros offense in great shape, even without Carlos CorreaI see Gurriel as a sleeper for 2022.


Miguel Sanó, ADP: 295.67, Pr(Breakout): 8.2%

I didn’t want to turn this into a “sleeper first basemen” article, but these really were the bulk of my findings. Sanó is not quite as old as Votto or Gurriel if that makes him more appealing. He has always had a strong batted ball profile, hits for power, and is going to give you plenty of playing time. I’m really not sure why Sanó is being drafted so late, as he is probably my most rostered player currently.


Clint Frazier, ADP: 427.17, Pr(Breakout): 6.3%

I felt obligated to include a deep sleeper in this analysis. There are few players in recent history that needed a change of scenery more than Frazier after his time with the Yankees. With the National League likely set to finally adopt the designated hitter, Roster Resource projects Frazier to hold that role for the Cubs this season. He still has some potential to contribute in virtually every category. If you are looking for a late-round dart throw, Frazier could be your guy.


One important disclaimer to note is that the probabilities of breakouts are all fairly small. The overall fantasy baseball community is pretty sharp, especially those playing in NFBC leagues. It’s not common to see a player picked in the mid-late rounds immediately become a superstar. However, if you have a sound process and continue to take shots on the right type of players, you maximize your chance to hit on a breakout.


Photo by Joe Robbins/Icon Sportswire | Adapted by Doug Carlin (@Bdougals on Twitter)

Jeremy Siegel

Jeremy is currently a senior studying Computer Science and Statistics at the University of Pittsburgh. He is a writer and data science staffer here at Pitcher List. His goal is to one day work in the analytics department of the front office of an MLB team

Leave a Reply

Your email address will not be published. Required fields are marked *

Account / Login