Advice on making the most of basketball three-point shot data
Authors: George Terhanian1
Corresponding Author:
George Terhanian, PhD
200 Hoover Avenue, #2101
Las Vegas NV, 89101
george.terhanian@gmail.com
646-430-3420
1George Terhanian founded Electric Insights after holding executive positions at The NPD Group, Toluna, and Harris Interactive. He has also served on boards or advisory groups for several organizations, including the US National Academy of Sciences, the Advertising Research Foundation, and the British Polling Society. He is known for conceiving how to make survey data, including pre-election forecasts, more accurate through statistical matching methods.
Making the most of basketball three-point shot data
ABSTRACT
This study’s primary goal is to help National Basketball Association (NBA) and other basketball teams worldwide increase their three-point shooting accuracy and decrease their opponents’, a key to winning more games. A related goal is to explain how a combination of good data, logistic regression analysis, likely effects reporting in probabilities or percentage points, and self-serve simulation can improve communication among data analysts, basketball coaches, and players, and enhance each group’s effectiveness. Logistic regression analysis of 32,511 NBA three-point shots shows six factors affect the three-point shooting percentage: closest defender’s distance to the shooter, time left on the 24-second shot clock, whether the player shot after dribbling or catching the ball, game period, shot distance, and venue. In the past, data analysts conveyed the results of such analyses to coaches and players using terms such as regression, logits, and odds. Some NBA executives say doing so again would be disastrous. An alternative is to emphasize probabilities and percentages in communication and create self-serve simulators coaches and players can use to predict how changes in critical factors affect three-point shooting percentages. NBA and other teams worldwide can apply this approach to new and existing datasets they maintain, enhance, and build.
Key Words: self-serve simulation, predicted probabilities, logistic regression, likely effects reporting, psychotherapy
INTRODUCTION
The National Basketball Association (NBA) releases specific three-point shot characteristics, such as shooter name and shot distance. Aside from the 2014-15 season’s first 903 of 1,230 games (and 2015-16’s first 631, though the latter data are no longer publicly available), the released data exclude a variety of individual shot characteristics such as the closest defender’s distance to the shooter, a crucial defensive effectiveness measure (14). Teams are said to consider the excluded characteristics proprietary. As Mike Zarren, assistant general manager and chief legal counsel for the NBA’s Boston Celtics, explained, “You can’t share stuff with other teams…We are not at an equilibrium point where all the teams know what everyone else is doing. There are some advantages that some teams have over others” (15) (51:47).
The analyses here use the 2014-15 shot dataset, the last and largest single-season one containing full shot data that is publicly available. The main goal is to help NBA and other basketball teams worldwide increase their three-point shooting accuracy and decrease their opponents’. Teams that do so should win more games. A related goal is to explain how a combination of good data, logistic regression analysis, likely effects reporting in probabilities or percentage points (e.g., “Shooting off the catch rather than the dribble is associated with a two-percentage-point increase in our three-point shot make percentage.”), and self-serve simulation can improve communication among data analysts, basketball coaches, and players, and enhance each group’s effectiveness. NBA and other teams worldwide can apply this approach to new and existing datasets they maintain, enhance, and build. Aspects of the approach are also portable to many other issues and areas where the key outcome variable is binary (26).
This paper has seven additional sections (excluding references and other ancillary information). The first summarizes basic rules and strategies for NBA basketball, highlighting the importance of the three-point shot. It also explains why data analysts seeking to communicate effectively with coaches and players should consider using non-technical language. The second section describes the three-point shot data used in this paper’s analyses. It then provides the rationale for relying on logistic regression analysis for model building and prediction. The third section reports the results of the analyses and suggests how data analysts might share them with coaches and players. It also explores why academic researchers tend not to report likely effects in probabilities or percentage points. The fourth details how data analysts can build self-serve simulators that report likely effects in probabilities or percentage points. The limitations of this paper’s analyses are discussed in the fifth section. The next-to-last section describes how teams might apply the approach described here, while the final section provides concluding remarks.
NBA Basketball: Basic Rules and Strategies
NBA games have two teams with five players competing for four 12-minute periods (excluding possible five-minute overtime periods). To score, a team needs to shoot the ball through the basket. With the clock running, a successful shot is worth three or two points, depending on the shooter’s distance from the basket. The clock stops for free throws, which are uncontested 15-foot shots worth a single point awarded for specific infringements. One can calculate each shot’s expected value (EV) by multiplying its potential value by its average make percentage. For the 2022-23 regular season, the expected value of a three- and two-point shot was almost identical: 1.08 points (3*.36) for a three-pointer and 1.10 (2*.55) for a two-pointer. Each free throw’s expected value was .78 points (1*.78) or 1.56 for a more typical pair (3). A recent example shows why the expected value measure can be strategically important.
In the second round of the 2020-21 playoffs, the Atlanta Hawks shocked the heavily favored Philadelphia 76ers, coming from behind to win the seven-game series four to three. The Hawks’ decision to foul Ben Simmons repeatedly to force him to shoot free throws contributed to the victory. As Hall-of-Fame player Earvin “Magic” Johnson observed, “…it fueled the Hawks’ comeback” (13).
Simmons shot just 33% (15 for 45) from the free-throw line for the series, far below his 61% (and the league’s 77%) regular season average. Simmons’s 33% figure suggests the Hawks expected him to score only .66 points for two free throws in a series in which his team made 40% of its three-pointers (for an expected value of 1.22 points) and 52% of its two-pointers (for a 1.05 expected value). That means the Hawks expected to gain .56 points (1.22 – .66) for a replaced three-point shot and .39 points (1.05 – .66) for a replaced two-pointer with the foul Simmons strategy. Perhaps more notably, it may have affected Simmons’s decision-making. To his team’s detriment, Simmons chose not to attempt an open lay-up or dunk with 3:30 remaining in game seven (4), arguably for fear of getting fouled and having to shoot free throws (21, 27).
Overstating three-point shooting’s significance is difficult. In 2022-23, the Toronto Raptors, Charlotte Hornets, and Houston Rockets won 41, 27, and 21 (of 82) regular season games, too few to qualify for the post-season playoffs; their three-point shooting percentages of 34%, 33%, and 33% were the league’s worst. The Philadelphia 76ers, Golden State Warriors, and Los Angeles Clippers won 54, 44, and 44 games, enough to compete in the playoffs; they were top performers in three-point shooting at 39%, 39%, and 38%. These data and separate multi-season analyses (18, 20) suggest that winning in the NBA hinges heavily on making (and defending) three-point shots.
Clear Communication
An excellent statistical model is “a simplified version of reality, like a street map that shows you how to travel from one part of a city to another” (28) (p. ix). But that map will not help you find your way if it includes esoteric terms or unfamiliar signs or symbols. Likewise, if data analysts use uncommon language when giving advice, coaches and players may feel lost. Mike Zarren would agree. If Celtics’ data analysts were to apply logistic regression to three-point shot data, he would tell them to communicate what they learn “without using the word regression because that’s a disaster” (15) (11:18). Terms like logits, standard deviations, odds, odds ratios, and z scores also would be off-limits. Zarren does not believe coaches and players are unintelligent. Even good data analysts can find aspects of logistic regression challenging. That is why DeMaris (7) (p. 1,057) observed, “…there is still considerable confusion about the interpretation of logistic regression results.” And why Gelman and Hill (11) (p. 83) commented, “…the concept of odds can be difficult to understand, and odds ratios are even more obscure.”
Washington Wizards’ assistant coach Dean Oliver’s views on clear communication resemble Zarren’s. “When I directed quantitative analysis for the Denver Nuggets and would prepare stuff for coaches,” he said, “there were actually very few numbers in there. It was usually words because it was easier for them to absorb…” (15) (48:54).
An alternative to avoiding numbers is to report key predictor variables’ likely effects with familiar ones like probabilities and percentages—the NBA reports various descriptive statistics and cross-tabulations on its website, emphasizing percentages, hence coaches’ and players’ familiarity.
Methods
Data
The NBA has used technology to gather detailed player performance data since the 2013-14 season via SportVU, then Second Spectrum. The analyses here use SportVU data, described as “real-time and innovative statistics based on speed, distance, player separation, and ball possession for comprehensive analysis of players and teams” (25). How did the SportVU system work? In each arena’s rafters, six cameras recorded information throughout each game in .04-second intervals, producing 25 images per second. A computer algorithm then plotted the locations of the ball, basket, and 10 players. SportVU delivered data and reports to each team and the league as a last step.
As noted earlier, the NBA made available SportVU raw, shot-level data—including the defender distance variable—for three-quarters of the 2014-15 regular season. (The NBA also made available raw, shot-level data early in the 2015-16 season before discontinuing the practice entirely in January 2016. The latter dataset is no longer publicly available.) The 2014-15 dataset (17)—the last and largest single-season one publicly available—contains 21 variables and 128,069 three- and two-point shots, as described in the Appendix. After making minor changes (e.g., removing two-point shots), the remaining three-point shots totaled 32,511—11,426 makes and 21,085 misses—taken from October 28, 2014, through March 4, 2015.
Analysis Method
Logistic regression models the relationship between a binary outcome (e.g., made or missed three-point shots, or nearly anything with a yes or no interpretation) and, typically, several predictor or explanatory variables. It is ideal for identifying and estimating the effects of actions to increase or decrease the size or proportion of the group of interest, specifically, made three-point shots. It can also predict each three-point shot’s probability of belonging to the “made” rather than the “missed” group. Many academic researchers consider it “the standard way to model binary outcomes” (11) (p. 79), “dominating all other methods in both the social and biomedical sciences” (2) (para. 1).
RESULTS
The final logistic regression model comprises one dependent and six predictor variables. The predictor variables were selected based on their relationship with the dependent variable, one another, theory, availability, and their effect on the model’s predictive accuracy. Below are descriptions of the seven variables and brief explanations for how they may differ from the original ones described in the Appendix.
- ShotResult: The dependent variable: whether the shooter made the shot. (Values: 0=Missed, 1=Made; Original variable: Fgm)
- DefDist: The closest defender’s distance to the shooter in feet (ft.). Basketball players and coaches recommended a four-category variable after discussions and preliminary analyses. (Values: 1=0-3 ft., 2=3-6 ft., 3=6-9 ft., 4=9+ ft.; Original variable: Close_Def_Dist)
- ShotClock: The number of seconds (secs.) on the 24-second shot clock. Analyses showed steep drops in the make probability at the 4- and 2-second marks, thus the decision to create a variable with three categories. (Values: 1=0-2 secs., 2=2-4 secs., 3=4+ secs; Original variable: Shot_Clock)
- Catch: Whether the shooter took the shot off the catch or dribble. The original variable reported the number of dribbles the shooter took before shooting. Basketball players and coaches recommended a two-category variable after discussions and preliminary analyses. (Values: 1=Off Catch, 2=Off Dribble; Original variable: Dribbles)
- Period: The game period when the shot was taken, with fourth period and overtime shots pooled because of their similar make percentages. (Values: 1=1, 2=2, 3=3, 4=4+; Original variable: Period)
- ShotDist: The distance in feet from the center of the basket to the shooter. Basketball players and coaches recommended a four-category variable after discussions and preliminary analyses. (Values: 1=22-24 ft., 2=24-25 ft., 3=25-26 ft., 4=26+ ft.; Original variable: Shot_Dist)
- Venue: Whether it was a home or away game for the shooter’s team. (Values: 0=Away, 1=Home; Original variable: Location)
Table 1 reports the logistic regression analysis results, notably, standard information such as logit coefficients, odds, z scores, and a measure of statistical significance (i.e., p>z). It also reports useful non-standard information such as frequencies, (predicted) probabilities, and expected values. The rationale for reporting standard and non-standard information, to borrow from the statistician Frederick Mosteller, is to “let weaknesses from one…be buttressed by strength from another” (16) (Ch. 4, p. 116), a concept he referred to as “balancing biases.” As envisioned, data analysts can rely on standard information when building and evaluating logistic regression models, and non-standard when communicating the results and their implications to coaches and players.
Table 1.
Results of final logistic regression analysis
Variable | Frequency | Logit | Odds | z | p>z | Prob | EV |
DefDist | |||||||
0-3 ft. | 6% | – | 1 | – | – | .29 | 0.86 |
3-6 ft. | 54% | 0.25 | 1.29 | 4.74 | 0.00 | .34 | 1.02 |
6-9 ft. | 28% | 0.38 | 1.46 | 6.78 | 0.00 | .37 | 1.11 |
9+ ft. | 12% | 0.47 | 1.60 | 7.74 | 0.00 | .39 | 1.17 |
ShotClock | |||||||
0-2 secs. | 5% | – | 1 | – | – | .21 | 0.62 |
2-4 secs. | 7% | 0.63 | 1.88 | 8.02 | 0.00 | .33 | 0.99 |
4+ secs. | 88% | 0.77 | 2.17 | 11.87 | 0.00 | .36 | 1.08 |
Catch | |||||||
Off catch | 75% | – | 1 | – | – | .36 | 1.07 |
Off dribble | 25% | -0.09 | .0.91 | -3.21 | 0.00 | .34 | 1.01 |
Period | |||||||
1 | 24% | – | 1 | – | – | .37 | 1.11 |
2 | 24% | -0.11 | 0.89 | -3.34 | 0.00 | .34 | 1.03 |
3 | 25% | -0.05 | 0.96 | -1.34 | 0.18 | .36 | 1.08 |
4+ | 27% | -0.15 | 0.86 | -4.54 | 0.00 | .34 | 1.01 |
ShotDist | |||||||
22-24 ft | 31% | – | 1 | – | – | .38 | 1.13 |
24-25 ft. | 36% | -0.09 | 0.91 | -3.25 | 0.01 | .36 | 1.06 |
25-26 ft. | 20% | -0.17 | 0.84 | -5.12 | 0.00 | .34 | 1.01 |
26+ ft. | 13% | -0.30 | 0.74 | -7.13 | 0.00 | .31 | 0.92 |
Venue | |||||||
Away | 50% | – | 1 | – | – | .35 | 1.04 |
Home | 50% | 0.05 | 1.05 | 2.14 | 0.03 | .36 | 1.07 |
…Constant | – | -1.46 | 0.23 | -17.42 | 0.00 | .19 | 0.56 |
Note. n=32,511. Log pseudolikelihood, starting value: -21,078.18; final value: -20,827.69. Likelihood ratio (degrees of freedom=13): 498.44, p > chi2 = 0.00. Tjur R2: 0.014; McFadden R2: 0.012. Stukel chi2(1) = 4.10, p > chi2 = 0.043
Standard versus Non-Standard Interpretations
Table 1 shows that the defender distance variable (DefDist) affects the outcome variable. A standard interpretation would emphasize odds ratios and statistical significance:
Controlling for other variables’ effects, three-point shots taken with the closest defender 9+ feet away have a:
- 60% higher odds (i.e., 1.6/1) of going in than those taken with the closest defender 0-3 feet away,
- 24% higher odds (i.e., 1.6/1.29) than those with the defender 3-6 feet away, and
- 10% higher odds (i.e., 1.6/1.46) than those with the defender 6-9 feet away.
Each effect is statistically significant, as their z scores show.
Although the standard interpretation is correct from a technical standpoint, coaches and players may not understand or act on it, given Zarren’s and Oliver’s comments (as well as those of DeMaris, Gelman, and Hill). Now consider a non-standard interpretation (that relies on Table 1’s non-standard information). Note that each percentage’s associated expected value is in parentheses.
All else unchanged, the percentage of three-point makes would decrease from 35% (1.05 pts.) to:
- 29% (0.86 pts.) with the defender always 0-3 feet away from the shooter, and
- 34% (1.02 pts.) with the defender always 3-6 feet away.
It would increase from 35% to:
- 37% (1.11 pts.) with the defender always 6-9 feet away, and
- 39% (1.17 pts.) with the defender always 9+ feet away.
NBA coaches and players would probably prefer the non-standard interpretation. Arguably, reporting the likely effect in percentage points instead of odds is more intuitive and actionable (26, 30).
Calculating Each Shot’s Make Probability
Another number to note in Table 1 is the constant of -1.46 logits which translates to a predicted make probability of 19% (0.56 pts.). The -1.46 number represents a three-point shot with the lowest value on each predictor variable:
- Defender 0-3 feet away
- 0-2 seconds on the shot clock
- Off the catch
- First period
- Shot distance of 22-24 feet
- Away game
An implication is that it is possible to calculate the predicted make probability of each of the 32,511 shots. Such information can spark curiosity and foster improved performance for a player scrutinizing his own (or opponents’) shot data. For example, Row 1 of Table 2 reports the logit coefficients associated with the first three-point shot Klay Thompson of the Golden State Warriors attempted in 2014-15. In the third period of an away game versus the Sacramento Kings with 4.6 seconds on the shot clock, Thompson missed from 22 feet off the catch with the defender 3.9 feet away. As the column titled Prob shows, that shot’s predicted make probability was 38% (.38*100), calculated by applying the following formula to select Table 2 numbers: exp (sum of logit coefficients + constant)/ (exp (sum of logit coefficients + constant) +1).
Upon closer examination, Thompson could have asked the team’s data analysts how that shot’s make probability would have changed had the defender been 9+ rather than 3.9 feet away. To respond, an analyst could have replaced the DefDist logit coefficient of 0.25 with 0.47, the one corresponding to a 9+ feet value. As shown in Row 2, the make probability would have risen to 42%, a four-percentage-point increase or likely effect.
Thompson next might have asked how shooting off the dribble rather than the catch would have affected the 42% probability. After replacing the Catch logit coefficient of 0 with-0.09, an analyst could have reported that the probability would have dropped to 39%, as Row 3 of the Prob column shows.
Thompson, an excellent shooter, would probably work to improve specific aspects of his shooting if he had such data for all his three-point shots (31).
Table 2.
Simulating the effect of changes on a single shot’s make probability
Row | DefDist | ShotClock | Catch | Period | ShotDist | Venue | Cost | Total | Prob |
1 | 0.25 | 0.77 | 0 | -0.05 | 0 | 0 | -1.46 | -0.49 | 0.38 |
2 | 0.47 | 0.77 | 0 | -0.05 | 0 | 0 | -1.46 | -0.27 | 0.42 |
3 | 0.47 | 0.77 | -0.09 | -0.05 | 0 | 0 | -1.46 | -0.36 | 0.39 |
Predicting the Likely Effect of Multiple Changes to Multiple Predictor Variables
Coaches thinking more broadly might focus on all 32,511 shots and ask analysts to predict the likely effect of multiple changes to the values of multiple predictor variables. Building on the Thompson example, analysts could approach the task by conceptualizing changes as scenarios. Below, and graphically in Figure 1, are three illustrative ones.
Scenario 1. Players take all 32,511 three-point shots with the defender 9+ ft. away.
Prediction: 39% of all three-pointers will go in, an increase of four percentage points compared to the 35% baseline, translating to 1,297 more makes and 12,723 total ones.
Scenario 2. Players take all 32,511 three-point shots:
- with the defender 9+ feet away
- from 22-24 ft. away from the basket
Prediction: 42% of all shots will go in, a three-percentage-point gain vs. Scenario 1. This translates to 808 more makes and 13,531 total makes.
Scenario 3. Players take all 32,511 three-point shots:
- with the defender 9+ ft. away
- from 22-24 ft. away from the basket
- with 4+ seconds on the 24-second shot clock
Prediction: 43% of all shots will go in, an increase of another percentage point compared to Scenario 2, translating to 370 more makes and 13,901 total ones.
Figure 1.
Percentage of predicted makes by scenario
Each scenario’s likely effect results from all-or-nothing simulation. How does it work? For any predictor variable, such as Catch, data analysts select one target value—either “Off Catch” (occurring 75% of the time) or “Off Dribble” (25%). Assume they choose “Off Catch,” with a logit coefficient of 0, as Table 1 shows. For the 8,127 “Off Dribble” shots, they would replace the coefficient of -0.09, also shown in Table 1, with 0 and calculate the new likely effect: 158 more made three-pointers for the season, translating to 11,584 total makes.
Adopting a fine-tuning approach is another possibility. After examining the frequency distribution of the Catch values, analysts could specify a new distribution, such as 92% “Off Catch” and 8% “Off Dribble,” ensuring the total sums to 100%. They would keep the original 24,384 “Off Catch” values (i.e., 75%) and change the -0.09 coefficient to 0 for another 2,600 selected randomly from the original 8,127 “Off Dribble” values to achieve the 92:8 ratio. The change would result in 11,530 made three-pointers, 54 less (i.e., 11,584-11,530) than if players had taken all shots off the catch.
If coaches and players embrace simulation, there could be too many scenarios for data analysts to handle. To stay ahead of demand, they could build self-serve simulators tailored explicitly for coaches’ and players’ use. Finding prototypes in academic research will be a struggle, however, arguably because of the non-linear relationship between logits and probabilities (26, 30) and its dampening effect on reporting likely effects in probabilities or percentage points. Figure 2 plots illustrative logit and probability values to cast light on that relationship.
Figure 2.
The non-linear relationship between logits (x-axis) and probabilities (y-axis)
Note how a one-logit increase from zero to one on the x-axis corresponds to a .23 probability increase (from .5 to .73) on the y-axis. Yet a one-logit increase from four to five (or minus 5 to minus 4) translates only to a tiny probability increase. As shown in Table 1 (and later in Table 3), it is still possible to report the effect of a predictor variable, x, on a binary outcome, y, in probabilities or percentage points (e.g., a one-unit change in x is associated with a three-percentage-point increase in y, all else being equal). Arguably, it is also sensible to do so, not least because NBA players make roughly 35% of their three-point shots and the relationship between logits and probabilities is reasonably linear between .2 and .8 on the probability scale, as Figure 2 shows. But in more extreme cases, as Figure 2 suggests, the effect size will depend heavily on the value of y and the values of the model’s other predictor variables. More precisely, the size of the effect will decrease near 0 and 1. As a result, x’s effect on y in probabilities percentage points “…cannot be fully represented by a single number” (19) (p. 23). That may be why some logistic regression experts (6-8) have advised against using probabilities or percentage points to report and interpret logistic regression coefficients’ overall effects. It also may be why most major statistical software packages do not produce effects in probabilities or percentage points through pre-packaged procedures or built-in modules. As an unintended consequence, some data analysts seeking guidance likely have had to fend for themselves.
A GUIDE TO BUILDING SELF-SERVE SIMULATORS
Data analysts can use this guide to build simulators that report likely effects in probabilities or percentage points. (For convenience, references are made to the three-point shot data used in this paper’s analyses, although the guide is general and should work across areas of interest.) Several steps are involved in the process:
Step 1. Ensure sufficient three-point shot data are available to conduct logistic regression analysis, which should be a straightforward task for NBA teams given the league’s business relationship with Second Spectrum (which replaced SportVU). How does one define sufficient? As a rule of thumb, at least 10 shot attempts are needed for each predictor variable in logistic regression model, adjusting for the expected shot make rate (or miss rate if it is lower than the make rate). For context, this paper’s main analysis with six predictor variables and a 35% expected make rate required a minimum of 171 three-point shot attempts: 10 * (6 /.35). For non-NBA teams requiring raw data, assistant coaches can record key shot characteristics with paper and pencil or specialized hand-held apps.
Step 2. Develop a model to predict successful 3-point shots, the binary outcome of interest. Logistic regression produces a weight—a logit coefficient—for each category of each predictor variable. In an optimal model, those weights maximize the predicted probability gap between the mutually exclusive outcomes (1).
Step 3. To calculate a single 2014-15 three-point shot’s make probability, sum the weights corresponding to its characteristics and add the constant. After that, apply the formula shown earlier to the result: exp (sum of logit coefficients + constant)/ (exp (sum of logit coefficients + constant) +1). Alternatively, request the predicted probability from the statistical software.
Step 4. Do the same for the 32,510 remaining shots, sum all 32,511 probabilities, then take the average to compute the overall make probability. If the model predicts players will make 35% of all three-point shots, it translates to 11,426 makes (.35*32,511).
Step 5. To enable the simulator to work online or in a mobile app, develop an algorithm using JavaScript. The simulator’s purpose is to let users see how changes they make to the values of the predictor variables affect the .35 probability.
Step 6. Design a user interface, possibly by enlisting the support of someone familiar with website and app development.
Step 7. Keep things simple initially—permit users to change only one value of one predictor variable. If it has two response choices like Away and Home, let the user change every Away response to Home or vice versa. Think of this as the all-or-nothing option.
Step 8. For all 32,511 three-point shots, change the corresponding Away or Home logit coefficient (but no others) to align with the user’s selection, then recalculate the predicted make probability. The likely effect is the difference between the new and starting probability (and the new and starting makes).
Step 9. Follow the same process to let users change the values of several predictor variables simultaneously.
Step 10. Go further and allow users to change any predictor variable’s frequency distribution as they please, ensuring the distribution sums to 100%. Think of this as the fine-tuning option. The algorithm will need rules to accommodate the changes.
What would all-or-nothing and fine-tuning self-serve simulators look like, and how would they function? Figure 3 shows a screenshot of a working all-or-nothing simulator (accessible at https://www.electricinsights.com/hoops1). The first column contains the predictor variables and their values. Column 2 shows the changes (in blue) the user made to the 2014-15 frequencies; the third column displays the original frequencies.
Figure 3
All-or-nothing simulation
As Figure 3 shows, the user selected values of “0-3 ft.” for “Defender Distance,” “0-2 secs.” for “Time Left on Shot Clock,” “Dribble” for “Off Catch or Dribble?” and “26+ ft.” for “Shot Distance.” The likely effect is a 22-point decrease in the make probability, translating to 7,229 fewer makes and 4,197 total ones.
Personalized simulators for players like Klay Thompson and Stephen Curry could be more beneficial (and accurate) than a generic, all-player one. To support this point, Table 3 reports the results of a new analysis of Curry’s 2014-15 three-point shots. Note how the values of many key measures, such as frequencies and expected values, differ substantially from their Table 1 counterparts. Table 3 shows, for instance, that Curry took 54% of his three-pointers off the dribble with an expected value of 1.32 points per shot. But Table 1 showed NBA players (including Curry) took only 25% of their three-pointers off the dribble with a 1.01 points-per-shot expected value. Curry is not your average three-point shooter, hence the need for personalization.
Table 3.
Results of Steph Curry logistic regression analysis
Variable | Frequency | Logit | Odds | z | p>z | Prob | EV |
DefDist | |||||||
0-3 ft. | 11% | – | 1 | – | – | .24 | 0.72 |
3-6 ft. | 55% | 0.89 | 2.44 | 2.44 | 0.02 | .43 | 1.29 |
6-9 ft. | 24% | 0.97 | 2.65 | 2.48 | 0.01 | .45 | 1.35 |
9+ ft. | 10% | 1.26 | 3.51 | 2.75 | 0.00 | .52 | 1.55 |
ShotClock | |||||||
0-2 secs. | 2% | – | 1 | – | – | .25 | 0.75 |
2-4 secs. | 3% | 2.10 | 8.17 | 2.19 | 0.03 | .72 | 2.15 |
4+ secs. | 95% | 0.79 | 2.21 | 1.10 | 0.27 | .42 | 1.25 |
Catch | |||||||
Off catch | 46% | – | 1 | – | .40 | 1.21 | |
Off dribble | 54% | 0.15 | 1.17 | .75 | 0.46 | ..44 | 1.32 |
Period | |||||||
1 | 33% | – | 1 | – | – | .44 | 1.30 |
2 | 19% | 0.01 | 1.01 | 0.05 | 0.963 | .44 | 1.31 |
3 | 29% | -0.03 | 0.97 | -0.12 | 0.902 | .43 | 1.28 |
4+ | 19% | -0.26 | 0.77 | -0.91 | 0.364 | .37 | 1.12 |
ShotDist | |||||||
22-24 ft | 16% | – | 1 | – | – | .55 | 1.65 |
24-25 ft. | 31% | -0.75 | 0.47 | -2.46 | 0.01 | .37 | 1.11 |
25-26 ft. | 24% | -0.51 | 0.60 | -1.58 | 0.11 | .43 | 1.28 |
26+ ft. | 28% | -0.65 | 0.52 | -2.04 | 0.04 | .40 | 1.12 |
Venue | |||||||
Away | 54% | – | 1 | – | – | .41 | 1.23 |
Home | 46% | 0.11 | 1.12 | .56 | 0.58 | .44 | 1.31 |
…Constant | – | -1.53 | 0.22 | -1.8 | 0.07 | .19 | 0.56 |
Note. n=j. Log pseudolikelihood, starting value: -305.04; final value: -294.46. Likelihood ratio (degrees of freedom=13): 21.16, p > chi2 = 0.07. Tjur R2: 0.047; McFadden R2: 0.035. Stukel chi2(1) = 4.38, p > chi2 = 0.11.
A working fine-tuning simulator—a complement to the Curry analysis—is available at https://www.electricinsights.com/curry1. It lets users change any value of any predictor variable by any amount and see the likely effect. In the screenshot shown in Figure 4, the user changed Curry’s 2014-15 season frequencies (in parentheses) for “Defender Distance,” “Off Catch or Dribble?” and “Shot Distance.” The likely effect is a seven-percentage-point increase to his 42% average make probability, translating to 31 more makes (i.e., 220-189).
Figure 4
Steph Curry’s fine-tuning simulator
Discussion
If the sample size of three-point shots allows, data analysts can build all-or-nothing and fine-tuning simulators that include all teams and players, each team, and each player. Given sufficient demand, they can also do so with data for other major shot types (i.e., two-pointers and free throws).
Several caveats are in order before describing how basketball teams might act on the results the approach described here, using the results (and simulators) shown earlier for illustration. First, inferences drawn from the 2014-15 dataset may no longer apply because of the time gap. Nor did this dataset include several three-point shot characteristics (e.g., closest defender’s height and reach, the game score at each shot) that could be important, which is a second caveat.
A third caveat concerns the “all else the same” assumption, a logistic regression analysis theoretical staple. In practice, it may not hold up. Giving excellent three-point shooters more playing time, for example, could worsen teams defensively. Deciding who plays and why, a type of optimization, lies outside this paper’s scope.
Another caveat involves ease of implementation. Building and updating simulators like Curry’s for NBA players who shoot, say, 175 or more three-point shots per season may require automation. To characterize the task as trivial would be misleading.
Humility and ignorance are two key factors to consider as the fifth caveat. Some NBA data analysts may have already adopted an approach combining good data, logistic regression, likely effects reporting in probabilities or percentage points, and self-serve simulation. As noted earlier, they work mainly in secrecy. And when they make comments at analytics conferences or similar forums, some are instructed “to go up on stage and talk about something without saying anything” (15) (51:37), according to Zarren.
Application In Sports
Good basketball coaches position their players to make the highest percentage of three-pointers possible, all else equal. They also implement a defense to minimize opponents’ three-point make percentage. The analyses presented here suggest six factors affect the make percentage:
- Closest defender’s distance to the shooter
- Time left on the 24-second shot clock
- Whether the player shot off the dribble or catch
- Game period
- Shot distance
- Venue
How might coaches act on these findings? There are numerous possibilities, starting with game pace. Fast ball movement from defense to offense (e.g., before the defense sets) gives the offensive team more time to find an open three-point shot, preferably before the four-second mark on the shot clock where shooting percentages dip, and unquestionably before the two-second mark where they plummet. As the NBA’s all-time leading three-point shooter, Steph Curry understands this well. Table 3 showed he attempted only two percent (compared to a five percent NBA average) of his three-point shots with less than two seconds on the shot clock.
Coaches should design offensive plays and patterns to create at least three feet of space between the shooter and defender. A 22-24-foot shot’s make probability with the defender 0-3 feet away is only 29%, all else equal. It increases to 34% with the defender 3-6 feet away. Space is critical for Curry, too. He shot 11% of his three-pointers with the defender 0-3 feet away versus the NBA average of 6%, reducing his overall make percentage. It could have been worse. Had he taken all 448 of his shots with the defender 0-3 feet away, all other factors being equal, his make probability would have dropped from 42% to 24%.
Making sure players understand the characteristics of a desirable three-point shot is another opportunity. Personalized simulators like Curry’s can make each player’s shooting strengths and weaknesses obvious. For instance, some players may make a higher percentage of three-pointers off the dribble than catch. Others may suffer only a slight percentage point decline when guarded tightly or shooting from 26+ rather than 22-24 feet. And if those simulators contain opponents’ shot data, coaches could use them to determine how to exploit specific opponents’ weaknesses.
Analyses show the three-point make percentage drops in the fourth period. Player fitness could be a contributing factor. Without applicable data (e.g., feet, meters, or miles logged since tip-off), it is difficult or impossible to test the hypothesis. Maybe the players on the court lack the skills needed to shoot higher percentages. Or game stress could affect shooting performance—data on the game score at each shot would clarify the matter. For context, the all-or-nothing simulator would show that the highest probability three-point shot (46%) has these characteristics:
- Defender 9+ feet away
- 4+ seconds on the shot clock
- Off the catch
- First period
- 22-24 feet from the basket
- At home
The simulator would also show that the 46% make probability drops to 42% in the fourth period, changing nothing else. That means players have grown tired, different players are on the court, game pressure has taken its toll, or unknown variables caused the drop. So how should head coaches make sense of this? Working with assistant coaches and data analysts, they can explore ways to increase players’ fitness levels, optimize substitution patterns, and help players cope better with pressure. If teams can access variables that were unavailable for analysis here, their analysts can include them in new models to estimate their likely effect.
Players make a higher percentage of three-point shots at home than on the road, all else equal. Crowd noise, characteristics (e.g., lighting) of the less familiar setting, travel effects (e.g., uncomfortable hotel beds), or some combination of these may explain why. Coaches can look outside the league for ideas to help players overcome such obstacles. For instance, former US Navy SEAL commander Mark Divine prepares SEAL candidates for training by replicating the challenges they are likely to encounter, including Hell Week during which “each candidate sleeps only about four total hours but runs more than 200 miles and does physical training for more than 20 hours per day” (5).
Contrary to conventional wisdom, Divine’s SEALFIT program places particular emphasis on skills like positive visualization, breath control, and meditation because, as he said, “People who haven’t learned to control their mind and emotions quit or they get hurt” (10). Does SEALFIT work? Divine reports that nine of 10 SEAL candidates who complete SEALFIT training become SEALs (versus a 20% norm). He is confident that NBA players would benefit from the program (M. Divine, personal communication, March 11, 2022).
A complementary tool for improving performance is psychotherapy. As described earlier, Ben Simmons’s decision to avoid attempting an open lay-up or dunk (arguably) for fear of being fouled and having to shoot free throws may have cost his team the 76ers a 2021 playoff series to the Hawks. As his teammate Joel Embiid declared, “That was the turning point” (12) (1:08). Psychotherapist Richard Schwartz, who developed the Internal Family Systems (IFS) therapeutic model (23), would probably concur then speculate that Simmons’s widely criticized decision (21, 27) originated from past trauma linked to his poor free-throw shooting. After citing evidence (24) of IFS’s effectiveness, Schwartz might posit that a protective part of Simmons’s mind—a “guardian of [his] inner world” (23) (p. 184)—compelled him to pass rather than shoot to prevent a traumatized part—think of it as a deeply wounded child—from re-experiencing pain or shame at the free throw line. Were Schwartz to work with Simmons, he would likely try to communicate with his mind’s traumatized part as if it were an actual person, restore its faith in Simmons’s free-throw shooting abilities, and encourage the protective part to undertake different tasks. The more traditional coaching approach of advising, or even requiring, Simmons to practice harder with expert guidance did not—and may never—work. As Early (9) observed, “Simmons has been reluctant to seek help from top shooting coaches…He reportedly clashed with his former team (the 76ers) years ago over who he would work with, preferring to practice with his brother rather than team shooting coach John Townsend.”
Coaches can use the same strategies to reduce their opponent’s three-point shooting percentage they use to improve their own. Table 1 data (and the all-or-nothing simulator) suggest the key lies in forcing opponents to shoot with less than four seconds on the clock, off the dribble, from long distances while being closely guarded. Stepping up the defensive intensity in the first and third periods where the likelihood of making a three-point shot is relatively high, and motivating the home crowd to unsettle opponents makes sense, too.
Coaches can also think about implementing a full- or three-quarter court press more often, maybe for entire games. The goals of a 2-2-1 three-quarter court press, for example, are control and containment, not turnover generation. As envisioned, its use would slow down the game and force opponents to shoot a higher percentage of difficult three-pointers with less time on the clock, reducing their make percentage. As Hall-of-Fame coach Jack Ramsay explained in Pressure Basketball, “The tempo of the game is controlled by the defensive team and the best manner of control is through the exertion of pressure at some point on the court” (22) (p. 80).
Good data, logistic regression analysis, and self-serve simulation can also promote truth and trust, positive attributes for any coach or leader. Maybe tongue in cheek, former NBA coach Jeff Van Gundy (15) (17:40) confessed to lying to his players. “If I saw what I wanted to change,” he said, “I would either use numbers to support it or make them up because the players are not going to know the difference.” Giving players tools that predict the likely effects of their potential actions would be more truthful and potentially more effective, too.
Conclusions
Keeping things simple is critical in basketball. According to Zarren (15) (7:00), “There are 20 things in (the coach’s) head that will get us X number of wins per season, but you can only focus on six of them in practice, and the players might only remember four and actually execute one in a game. So you’ve got to pick your battles if you’re a stats guy who…needs to talk to a coach. But if you’re a coach, you need to pick your battles, too.”
Van Gundy (15) (16:51) offered data analysts and coaches strong advice related to this point from his coaching experience. “I wouldn’t tell a guy you’re 38% on three to four dribbles so dribble a fifth time because you go up to 40%,” he said. “You better be pretty sure about what you’re saying…You want players to feel confident. You don’t want them out there saying, ‘Was that [four] dribbles or [five] when I pull up?’”
To mitigate the risk of generating harmful insights, data analysts should actively engage coaches and players in making key analytical decisions (e.g., ensuring predictor variables and their levels are meaningful), not least because Van Gundy and others who share his philosophy consider basketball sense—the capacity to make wise choices that benefit the team—to be of paramount importance.
Arguably, self-serve simulation with likely effects reporting in probabilities or percentage points is steeped in such basketball sense. As a benefit, data analysts will not need to rely on technical terms (e.g., “he shoots two standard deviations below the league average when you force him to the left” (15) (48:20)), as former Memphis Grizzlies’ executive John Hollinger once did. Instead, they can speak with more authority using plain language (e.g., “his make probability drops to 28% when you force him to the left”). Or they can make self-serve simulators available to players (and coaches) and let them figure it out on their own. They may appreciate it, even cynics sharing Hall-of-Fame player Charles Barkley’s views: “Analytics don’t work at all. It’s just the crap that some people who are really smart made up to try to get in the game because they had no talent” (29) (2:05).
NBA and other basketball teams worldwide should consider adopting an approach that combines good data, logistic regression analysis, likely effects reporting in probabilities or percentage points, and self-serve simulation. The possible benefits are myriad. It can help teams increase their three-point shooting percentages while lowering their opponents’; improve communication among data analysts, coaches, and players; enhance each group’s effectiveness; and lead to more wins.
Appendix
Variables in the 2014-15 NBA shot dataset
- Game_Id: The game’s unique identifier.
- Matchup: The teams competing.
- Location: Whether it was a home or away game for the shooter’s team.
- Outcome: Whether the shooter’s team won or lost.
- Final_Margin: By how many points the shooter’s team won or lost.
- Shot_Number: The shooter’s nth shot that game.
- Period: The period in which the shooter took the shot.
- Game_Clock: Minutes and seconds left in the period in which the shooter took the shot.
- Shot_Clock: Seconds remaining on the 24-second shot clock when the shooter took the shot.
- Dribbles: Number of dribbles the shooter took before shooting.
- Touch_Time: Number of seconds the shooter had the ball before shooting.
- Shot_Dist: Distance in feet from the center of the basket to the shooter.
- Pts_Type: Whether the shooter attempted a two- or three-point shot.
- Shot_Result: Whether the shooter made the shot.
- Closest Defender: Name of the defender closest to the shooter.
- Closest_Defender_Player_Id: The closest defender’s unique identifier.
- Close_Def_Dist: The closest defender’s distance to the shooter in feet.
- Fgm: Whether the shooter made the shot.
- Pts: The shot’s point value (0, 2 or 3).
- Player_Name: The shooter’s first and last name.
- Player_Id: The shooter’s unique identifier.
Note: The original dataset contained 128,069 two- and three-point shots. After removing all two-point shots, and all three-point shots with a missing (or unimputable) value on the Shot_Clock variable, the size decreased to 32,511. For a value to be imputable, there had to be 24 seconds or less on the game clock when the player took the shot. In that case, the decision was made to replace the missing Shot_Clock value with the Game_Clock value.
ACKNOWLEDGEMENTS
The author would like to thank David Clemm, Robert Eisinger, Ward Fonrose, John Geraci, Ryan Heaton, Adam Hoeflich, Priam Lacassagne, Roxane Lacassagne, and Mark Naples for reviewing earlier versions of this paper, and for providing helpful comments and suggestions. The author is particularly thankful to Dan Dougherty (who passed away in 2022) and Tom Northrup for their indirect contribution. Their longstanding beliefs and ideas about how basketball should be played permeate this paper’s “implications for coaches” section.
References
- Allison, P. (2013, February 13). What’s the Best R-Squared for Logistic Regression? Statistical Horizons. https://statisticalhorizons.com/r2logistic/
- Allison, P. (2015, April 1). What’s So Special About Logit? Statistical Horizons. https://statisticalhorizons.com/whats-so-special-about-logit
- Basketball Reference. (2023). Basketball-Reference.com. https://www.basketball-reference.com/
- Ben Simmons passes up a wide-open dunk Sixers vs Hawks Game 7. (2021, June 20). Www.youtube.com. https://www.youtube.com/watch?v=-EHA4UhYuQY
- BUD/S Hell Week. (2015, February 25). Navy SEALs. https://navyseals.com/3930/buds-hell-week/#:~:text=In%20this%20grueling%20five%2Dand
- DeMaris, A. (1992). Logit modeling: practical applications. Sage Publications.
- DeMaris, A. (1993). Odds versus Probabilities in Logit Equations: A Reply to Roncek. Social Forces, 71(3), 1057-1065.
- DeMaris, A.; Teachman, J.; Morgan, S. P. (1990). Interpreting Logistic Regression Results: A Critical Commentary. Journal of Marriage and the Family, 52(1), 271-277. https://doi.org/10.2307/352857.
- Early, D. (2022, February 24). Ben Simmons Savagely Roasted by Legendary Philly “Shot Doctor.” ClutchPoints. https://clutchpoints.com/ben-simmons-savagely-roasted-by-legendary-philly-shot-doctor
- Eighty Percent of Navy SEAL Candidates Fail for a Reason. (2017, September 14). SEALFIT. https://sealfit.com/80-navy-seal-candidates-fail-reason/
- Gelman, A. B., & Hill, J. (2009). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
- Joel Embiid blames Ben Simmons for game 7 loss…. (2021, June 20). Www.youtube.com. https://www.youtube.com/watch?v=sJtyx6TOPvs
- Johnson, E. [@MagicJohnson]. (2021, June 16). Give Hawks coach Nate McMillan a lot of credit he did the hack-a-Shaq on Ben Simmons to send him to the free throw. [Tweet]. Twitter. https://twitter.com/MagicJohnson/status/1405355621726162954
- Meehan, B. (2017). Predicting NBA Shots. http://cs229.stanford.edu/proj2017/final-reports/5132133.pdf
- MIT SLOAN Analytics Conference: Basketball Analytics. (2012, March 12).Www.sloansportsconference.com. Retrieved November 20, 2023, from https://www.sloansportsconference.com/event/basketball-analytics
- Mosteller, F. M. (1996). Discussant comments for So what? The implications of new analytic methods for designing NCES surveys by Robert F. Boruch and George Terhanian. In From Data to Information: New Directions for the National Center for Education Statistics, Hoachlander, G.; Griffith, J.E.; Ralph, J.H.; US Department of Education, National Center for Education Statistics: NCES 96–901, pp. 4-116-4-118.
- NBA shot logs. (2016). Kaggle.com. https://www.kaggle.com/dansbecker/nba-shot-logs
- Nourayi, M; Singhvi, M. (2021, January 15). The Impact of NBA New Rules on Games. The Sport Journal. https://thesportjournal.org/article/the-impact-of-nba-new-rules-on-games/
- Pampel, F. C. (2000). Logistic Regression. SAGE Publications.
- Peterson, D. (2020, May 28). How Different Metrics Correlate with Winning in the NBA over 30 Years. Medium. https://towardsdatascience.com/how-different-metrics-correlate-with-winning-in-the-nba-over-30-years-57219d3d1c8
- Pina, M. (2021, June 20). Ben Simmons’s Flaws Laid Bare in Potential End of the Process. Sports Illustrated. https://www.si.com/nba/2021/06/21/sixers-hawks-game-7-ben-simmons-flaws-trae-young
- Ramsay, J. (1963). Pressure Basketball.
- Schwartz, R. C. (2023). Introduction to Internal Family Systems therapy (2nd ed.). Sounds True.
- Shadick, N. A.; Sowell, N. F.; Frits, M. L.; Hoffman, S. M.; Hartz, S. A.; Booth, F. D.; Sweezy, M.; Rogers, P. R.; Dubin, R. L.; Atkinson, J. C.; Friedman, A. L.; Augusto, F.; Iannaccone, C. K.; Fossel, A. H.; Quinn, G.; Cui, J.; Losina, E.; Schwartz, R. C. (2013). A Randomized Controlled Trial of an Internal Family Systems-based Psychotherapeutic Intervention on Outcomes in Rheumatoid Arthritis: A Proof-of-Concept Study. The Journal of Rheumatology.
- Stats LLC and NBA to make STATS SportVU Player Tracking data available to more fans than ever before. (2016, January 19). NBA.com: NBA Communications. https://pr.nba.com/stats-llc-nba-sportvu-player-tracking-data/
- Terhanian, G. (2019). The Possible Benefits of Reporting Percentage Point Effects. International Journal of Market Research, 61(6), 635–650.
- Thomas, L. (2021, October 3). Ben Simmons and the Acceptance of Failure. The New Yorker. https://www.newyorker.com/sports/sporting-scene/ben-simmons-and-the-acceptance-of-failure
- Thorp, E. O. (2018). A man for all markets: from Las Vegas to Wall Street, how I beat the dealer and the market. Random House.
- TNT’s Charles Barkley rants about analytics in NBA, Houston Rockets GM Daryl Morey. (2015, February 10). Www.youtube.com. https://www.youtube.com/watch?v=2asGeItzGWM
- Williams, R. (2012). Using the Margins Command to Estimate and Interpret Adjusted Predictions and Marginal Effects. The Stata Journal, 12(2), 308–331.
- Zwerling, J. (2014, August 27). Team USA’s Klay Thompson Breaks Down the Skills That Make Him a Shooting Star. Bleacher Report. https://bleacherreport.com/articles/2173236-team-usas-klay-thompson-breaks-down-the-skills-that-make-him-a-shooting-star