Much attention has been paid to the challenges faced by the airline industry. Patterns in customer demand are an important variable to watch. The scatterplot below shows the number of passengers departing from Oakland (CA) airport month by month from 1990 to 2007. Time is shown as years since 1990, with fractional years used to represent each month. (Data selected from Oakland passengers 2016)
Hereâs a regression with the residuals plotted against the predicted values
a) Interpret the slope and intercept of the regression model.
b) What does the value of R2 say about how successful the model is?
c) Interpret se in this context.
d) Would you use this model to predict the numbers of passengers in 2010 (Years - 1990 = 20)? Explain.
1250000 1000000 750000 500000 ++ 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.0 Year-1990 sabuassed R squared = 85.2% s = 100359 Variable Coefficient Intercept year-1990 455650 46131.3 250000 + 125000 -125000 -250000 600000 1000000 Predicted Residuals
> Your manager learned all about databases, and frequently makes queries such as: “What fraction of our customers who bought a product in the last six months are female and live within 5 miles of the store?” He says that he is data mining. Is he right? Exp
> Your manager is confused by Big Data. Explain what makes “big data” different than “data.”
> A producer of beverage containers wants to ensure that a liquid at 90C will lose no more than 4C after 30 minutes. Containers are selected at random and subjected to testing. Historical data shows the standard deviation to be 0.2C. The quality control te
> These boxplots from Ex. 23 Chapter 9 show the relationship between the number of cylinders in a car’s engine and its fuel economy from a study conducted by a major car manufacturer. a) What are the null and alternative hypotheses? Talk
> Your manager wants to just find and use “the best” model, but you have found that a combined model (boosting) is better. Explain why boosting might help, and why it may be better than trying to find “the best.”
> What are the pros and cons of combining multiple models to produce a prediction? Should we always combine models?
> Your manager wants to use the total accurate classification rate (percent of all cases properly classified) as the metric to evaluate the division’s models. Is this a good idea? Why or why not?
> What are the advantages and disadvantages of using tree vs. neural network models?
> Is any one portion of the CRISP-DM more important than the others? Why?
> For which one of the following situations would Spearman’s rho be appropriate? a) The Mohs scale rates the hardness of minerals. If one mineral can scratch another, it is judged to be harder. (Diamond, the hardest mineral, is a 10.) Is hardness related
> For which one of the following situations would Spearman’s rho be appropriate? a) Comparing the ratings of a new product on a 5-point Likert scale by a panel of consumers to their ratings of a competitor’s product on the same scale. b) Comparing the sw
> For the probabilities of Exercise 8 and the decision tree of Exercise 4, using the expected values found in Exercise 8, compute the standard deviations of the values associated with each action and the corresponding coefficient of variation.
> For the probabilities of Exercise 7 and the cost matrix of Exercise 3, using the expected values you found in Exercise 7, compute the standard deviation of values associated with each action and the corresponding coefficient of variation.
> Cyanoacrylates, the generic name for several compounds with strong adhesive properties, were invented during WWII during experiments to make a special extra-clear plastic suitable for gun sights. They didn’t work for gun sights, however, because they stu
> A company that specializes in developing concrete for construction strives to continually improve the properties of its materials. To increase the compressive strength of one of its new formulations, they varied the amount of alkali content (low, medium,
> For the experiment you designed in the Brief Case of Chapter 9, analyze the results of your experiment and write up your analysis and conclusions, including any recommendations for further testing.
> Here are some diagnostic plots for the home prices data from Exercise 29. These were generated by a computer package and may look different from the plots generated by the packages you use. (In particular, note that the axes of the Normal probability plo
> Many variables have an impact on determining the price of a house. A few of these are size of the house (square feet), lot size, and number of bathrooms. Information for a random sample of homes for sale in the Statesboro, Georgia, area was obtained from
> A study by the U.S. Smal Business Administration used historical data to model the GDP per capita of 24 of the countries in the Organization for Economic Cooperation and Development (OECD) (Crain, M. W., The Impact of Regulatory Costs on Small Firms, ava
> What is the financial impact of pollution abatement on small firms? The U.S. government’s Small Business Administration studied this and reported the following model. Pollution abatement/employee = -2.494 - 0.431 ln(Number of Employees) + 0.698 ln(Sales
> Here are some more interpretations of the regression model to predict the price of wine developed in Exercise 24. One of these interpretations is correct. Which is it? Explain what is wrong with the others. a) The minimum price for a bottle of wine that
> A household appliance manufacturer wants to analyze the relationship between total sales and the company’s three primary means of advertising (television, magazines, and radio). All values were in millions of dollars. They found the following regression
> Many factors affect the price of wine, including such qualitative characteristics as the variety of grape, location of winery, and label. Researchers developed a regression model considering two quantitative variables: the tasting score of the wine and t
> A regression was performed to predict selling Price of houses in dollars from their Area in square feet, Lotsize in square feet, and Age in years. The R2 is 92%. The equation from this regression is given here. Price = 169,328 + 35.3 Area + 0.718 Lotsiz
> We really should have examined the residuals. Here is a scatterplot of the residuals from the regression of Exercise 14. a) Which assumptions and conditions for regression can you check with this plot? What do you conclude? Perhaps we should re-express
> Here are some plots of residuals for the regression of Exercise 13. Which of the regression conditions can you check with these plots? Do you find that those conditions are met? 250 125 -125 300 375 450 Predicted 250 125 -125 + 1.25 -1.25 0.00 Nsco
> A second-order autoregressive model for the gas prices is: Using values from the table, what is the predicted value for January 2007 (the value just past those given in the table)? Dependent variable is: Gas R squared = 82.2% R squared (adjusted) =
> The investor in Exercise 18 now accepts your analysis but claims that it demonstrates that it doesn’t matter how many weeks a show plays on Broadway; receipts will be essentially the same. Explain why this interpretation is not a valid use of this regres
> A Police union leader accepts your analysis in Exercise 17 but claims that it proves that paying police more will reduce violent crime. Explain why this interpretation is not a valid use of this regression model. Offer some alternative explanations.
> Consider the coefficient of Playing Weeks in the regression table of Exercise 14. a) State the standard null and alternative hypotheses for the true coefficient of Playing Weeks. b) Test the null hypothesis 1at a = 0.052 and state your conclusion. c) A
> Suppose you have fit a linear model to some data and now take a look at the residuals. For each of the following possible residuals plots, tell whether you would try a re-expression and, if so, why. a) - b) c)
> Suppose you have fit a linear model to some data and now take a look at the residuals. For each of the following possible residuals plots, tell whether you would try a re-expression and, if so, why. a) b) c)
> A real estate agent collects data to develop a model that will use the Size of a new home (in square feet) to predict its Sale Price (in thousands of dollars). Which of these is most likely to be the slope of the regression line: 0.008, 0.08, 0.8, or 8?
> Although some women are colorblind, this condition is found primarily in men. An advertisement for socks marked so they were easy for someone who was colorblind to match started out “There’s a strong correlation between sex and colorblindness.” Explain i
> Here is a regression of Women’s age vs. Men’s age, and a plot of the residuals. a) The residual plot shows 4 outliers, labeled according to the years they correspond to. Explain what they say about the data for those
> In Exercise 39 you investigated the federal rate on 3-month Treasury bills between 1950 and 1980. The scatterplot below shows that the trend changed dramatically after 1980, so we’ve built a new regression model that includes only the d
> A second-order autoregressive model for the apple prices (for all 4 years of data) is / Using the values from the table, what is the predicted value for January 2007 (the value just past those given in the table)?
> In Exercise 21 we looked at the age at which women married as one of the variables considered by those selling wedding services. Another variable of concern is the difference in age of the two partners. The graph shows the ages of both men and women at f
> Here’s a plot showing the federal rate on 3-month Treasury bills from 1950 to 1980, and a regression model fit to the relationship between the Rate (in %) and Years since 1950. (www.gpoaccess.gov/eop/) a) What is the correlation betw
> How does the speed at which a car drives affect fuel economy? Owners of a taxi fleet, watching their bottom line sink beneath fuel costs, hired a research firm to tell them the optimal speed for their taxis to drive. Researchers drove a compact car for 2
> Small businesses must track every expense. A f lower shop owner tracked her costs for heating and related it to the average daily Fahrenheit temperature, finding the model Cost = 133 - 2.13 Temp. The residuals plot for her data is shown. a) Interpret
> Published reports about violence in computer games have become a concern to developers and distributors of these games. One firm commissioned a study of violent behavior in elementary-school children. The researcher asked the children’s parents how much
> A researcher gathering data for a pharmaceutical firm measures blood pressure and the percentage of body fat for several adult males and finds a strong positive association. Describe three different possible cause-and-effect relationships that might be p
> The original five points in Exercise 33 produce a regression line with slope 0. Match each of the green points (a–e) with the slope of the line after that one point is added: 1) -0.45 2) -0.30 3) 0.00 4) 0.05 5) 0.85
> The scatterplot shows five blue data points at the left. Not surprisingly, the correlation for these points is r = 0. Suppose one additional data point is added at one of the five positions suggested below in green. Match each point (a–
> Each of the following scatterplots a–d shows a cluster of points and one “stray” point. For each, answer questions 1–4: 1) In what way is the point unusual? Does it have high leverag
> Each of the four scatterplots a–d that follow shows a cluster of points and one “stray” point. For each, answer questions 1–4: 1) In what way is the point unusual? Does it have high
> For the Gas prices of Exercise 6, find the lag2 version of the prices.
> Like many businesses, The National Hurricane Center also participates in a program to improve the quality of data and predictions by government agencies. They report their errors in predicting the path of hurricanes. The following scatterplot shows the t
> How does what a movie earns relate to its run time? Will audiences pay more for a longer film? Does the relationship depend on the type of film? The scatterplot shows the relationship for the films in Exercise 27 between U.S. Gross earnings and Run Time.
> Here’s a scatterplot of the production budgets (in millions of dollars) vs. the running time (in minutes) for a collection of major movies. Dramas are plotted in red and all other genres are plotted in blue. A separate least squares reg
> An intern who has created a linear model is disappointed to find that her R2 value is a very low 13%. a) Does this mean that a linear model is not appropriate? Explain. b) Does this model allow the intern to make accurate predictions? Explain.
> In justifying his choice of a model, a consultant says “I know this is the correct model because R2 = 99.4%.” a) Is this reasoning correct? Explain. b) Does this model allow the consultant to make accurate predictions? Explain.
> The United Nations Development Programme (UNDP) uses the Human Development Index (HDI) in an attempt to summarize in one number the progress in health, education, and economics of a country. The mean years of schooling is positively associated with HDI.
> The United Nations Development Programme (UNDP) collects data in the developing world to help countries solve global and national development challenges. In the UNDP annual Human Development Report, you can find data on over 100 variables for each of 197
> Even with campaigns to reduce smoking, Americans still consume more than four packs of cigarettes per month per adult (libraries.ucsd.edu/ssds/pub/ CTS/tobacco/sales). The Centers for Disease Control and Prevention track cigarette smoking in the United S
> Weddings are one of the fastest growing businesses; about $40 billion is spent on weddings in the United States each year. But demographics may be changing, and this could affect wedding retailers’ marketing plans. Is there evidence tha
> For the Apple prices of Exercise 5, find the lag1 version of the prices.
> Orange growers know that the larger an orange the higher the price it will bring. But as the number of oranges on a tree increases, the fruit tends to be smaller. Here’s a table of that relationship. Create a model for this relationship
> The Organization for Economic Cooperation and Development (OECD) is an organization comprised of thirty countries. To belong, a country must support the principles of representative democracy and a free market economy. How have these countries grown in t
> For the regression model in Exercise 8, the leverage values look like this: The movie with the highest leverage of 0.219 is Walt Disney’s John Carter, which grossed $66M but had a budget of $300M. If the budget for John Carter had bee
> Here is the scatterplot of the variables in Exercise 7 with regression lines added for each kind of movie: The regression model is: a) Write out the regression model. b) In this regression, the variable Budget*R Rating is an interaction term. How wou
> Are R rated movies as profitable as those rated PG-13? Here’s scatterplot of USGross ($M) vs. Budget ($M) for PG-13 (green) and R (purple) rated movies a) How would you code the indicator variable? (Use PG-13 as the base level.) b) H
> A marketing manager has developed a regression model to predict quarterly sales of his company’s mid-weight microfiber jackets based on price and amount spent on advertising. An intern suggests that he include indicator (dummy) variables for each quarter
> For each of the following, show how you would code dummy (indicator) variables to include in a regression model. a) Type of residence (Apartment, Condominium, Townhouse, Single family home) b) Employment status (Full-time, Part-time, Unemployed)
> Here is the regression for Exercise 3 with an indicator variable: a) Write out the regression model. b) In this regression, the variable R Rating is an indicator variable that is 1 for movies that have an R rating. How would you interpret the coeffici
> Do movies of different types have different rates of return on their budgets? Here’s a scatterplot of Gross Revenue in US ($M) vs. Budget ($M) for recent movies whose MPAA Rating is either PG (blue) or R (red): a) Why might a research
> A marketing manager has developed a regression model to predict quarterly sales of his company’s down jackets based on price and amount spent on advertising. An intern suggests that he include an indicator (dummy) variable for the Fall quarter. a) How wo
> For the Gas prices of Exercise 6, the actual value for January 2007 was 2.321. Find the absolute percentage error of your forecast.
> If the VIF for Networth in the regression of Exercise 11 was 20.83, what would the R2 be from the regression of Networth on Age, Income, and Past Spending?
> The analyst from Exercise 11, worried about collinearity, regresses Age against Past Spending, Income, and Networth. The output shows: What is the VIF for Age? Response Variable: Age R? = 98.75% Adjusted R? = 98.74% s = 2.112 with 908 – 4 = 904 deg
> The analyst in Exercise 11 fits the model with the four predictor variables. The regression output shows: // a) How many observations were used in the regression? b) What might you do next? c) Is it clear that Income is more important to predicting Spe
> An analyst wants to build a regression model to predict spending from the following four predictor variables: a) How many observations were used in the regression? b) What might you do next? c) Is it clear that Income is more important to predicting Sp
> For the same regression as in Exercise 9, the Cook’s Distances look like this: The outlier, once again, is John Carter, whose budget was more than $200M more than its gross revenue in the U.S. Setting this movie aside and rerunning th
> For each of the following, show how you would code dummy (or indicator) variables to include in a regression model. a) Company unionization status (Unionized, No Union) b) Gender (Female, Male) c) Account Status (Paid on time, Past Due) d) Political part
> In the regression model of Exercise 3, a) What is the R2 for this regression? What does it mean? b) Why is the “Adjusted R Square” in the table different from the “R Square”?
> a) What is the null hypothesis tested for the coefficient of Run Time in the regression of Exercise 3? b) What is the t-statistic corresponding to this test? c) Why is this t-statistic negative? d) What is the P-value corresponding to this t-statistic? e
> In the regression output for the movies of Exercise 3, a) What is the null hypothesis tested for the coefficient of Stars in this table? b) What is the t-statistic corresponding to this test? c) What is the P-value corresponding to this t-statistic? d) C
> For the movies regression, here is a histogram of the residuals. What does it tell us about these assumptions and conditions? a) Linearity Condition b) Nearly Normal Condition c) Equal Spread Condition 50 40 30 20 10 -150 -25 100 225 Residuals (U)
> For the Apple prices smoothed in Exercise 5, the actual value for January 2007 was 1.034. Find the absolute percentage error of your forecast.
> For the movies examined in Exercise 4, here is a scatterplot of USGross vs. Budget: What (if anything) does this scatterplot tell us about the following assumptions and conditions for the regression? a) Linearity Condition b) Equal Spread Condition c)
> A middle manager at an entertainment company, upon seeing the analysis of Exercise 3, concludes that the longer you make a movie, the less money it will make. He argues that his company’s films should all be cut by 30 minutes to improve their gross. Expl
> What can predict how much a motion picture will make? We have data on a number of recent releases that includes the USGross (in $M), the Budget ($M), the Run Time (minutes), and the average number of Stars awarded by reviewers. The first several entries
> A candy maker surveyed chocolate bars available in a local supermarket and found the following least squares regression model: a) The hand-crafted chocolate she makes has 15 g of fat and 20 g of sugar. How many calories does the model predict for a ser
> A study of homes looking at the relationship between Age of a home and Price produced the following scatterplot. A regression was fit to the data as shown below. On the basis of this plot, would you advise using this regression? Explain. 350,000- 3
> A scatterplot of Salary against Years Experience for some employees, and the scatterplot of residuals against predicted Salary from the regression line are shown in the figures. On the basis of these plots, would you recommend a re-expression of either S
> The regression of Total Revenue on Total Expenses for the concerts of Exercise 13 gives the following model: a) The Durbin-Watson statistic for this analysis is 0.73. Consult Table D in Appendix B and complete the test at α = 0.05. b) Wha
> The manager of the concert production company considered in earlier exercises considers the regression of Total Revenue on Ticket Sales (see Exercise 4) and computes the Durbin-Watson statistic, obtaining a value of 0.51. a) Consult Table D in Appendix B
> A company fits a regression to predict monthly Orders over a period of 48 months. The Durbin-Watson statistic on the residuals is 0.875. a) At a = 0.01, using k = 1 and n = 50, what are the values of dL and dU? b) Is there evidence of positive autocorrel
> A beverage company specializing in sales of champagne reports four years of quarterly sales as follows (in millions of $): The regression equation is Predicted Sales = 14.15 + 4.87 Quarter. a) Find the residuals. b) Plot the residuals against Quarter.
> Here are data on the monthly price of Delicious apples and gas, which are both components of the Consumer Price Index. The timeplot shows the years 2006–2009 for apples; the data table shows just 2006, for both. For the Gas prices: (D
> Here is another part of the regression output for the movies in Exercise 3: a) Using the values from the table, show how the value of R2 could be computed. Don’t try to do the calculation, just show what is computed. b) What is the F-
> A house in the upstate New York area from which the chapter data was drawn has 2 bedrooms and 1000 square feet of living area. Using the multiple regression model found in the chapter, a) Find the price that this model estimates. b) The house just sold
> The bookstore in Exercise 5 decides to have a gala event in an attempt to drum up business. They hire 100 employees for the day and bring in a total of $42,000. a) Find the regression line predicting Sales from Number of people working with the new point
> The production company of Exercise 7 offers advanced sales to “Frequent Buyers” through its website. Here’s a relevant scatterplot: One performer refused to permit advanced sales. What effect has th
> A regression of Total Revenue on Ticket Sales by the concert production company of Exercises 2 and 4 finds the model a) Management is considering adding a stadium-style venue that would seat 10,000. What does this model predict that revenue would be if
> Here are prices for the external disk drives we saw in Chapter 15, Exercise 10: The least squares line is The assumptions and conditions for regression are met. a) Disk drives keep growing in capacity. Some tech experts now talk about Petabyte (PB
> Here are the data from the small bookstore we saw in Chapter 15, Exercise 9. The regression line is: and we can assume that the assumptions and conditions for regression are met. Calculations with technology find that a) Find the predicted sales on
> The concert production company of Exercise 2 made a second scatterplot, this time relating Total Revenue to Ticket a) Describe the relationship between Ticket Sales and Total Revenue. b) How are the results for the two venues similar? c) How are they d
> The analyst in Exercise 1 tried fitting the regression line to each market segment separately and found the following: What does this say about her concern in Exercise 1? Was she justified in worrying that the overall model might not accurately summ